September 24, 2016 Leave a comment
The next version of DC++ will require SSE2 CPU support.
This represents no change for the 64-bit builds since x86-64 includes SSE2. The last widely used CPUs affected, lacking SSE2 support, are Athlon XPs the last of which were released in 2004. As such, not just DC++ but Firefox 49, Chrome on both Windows and Linux since 2014, IE 11 since 2013, and Windows 8 since 2012 all require SSE2. Empirically, Firefox developers found that just 0.4% of their users as of this May lacked SSE2 and Chrome developers measured 0.33% of their Windows stable population lacking SSE2 in 2014, suggesting that to the extent not requiring SSE2 imposes non-negligible development or runtime cost, one might find increasingly thin support for avoiding it.
A straightforward advantage SSE2 provides derives from non-SIMD 32-bit x86 supporting only arguably between 6 and 8 general-purpose 32-bit registers. SSE2 in 32-bit environments adds 8 additional registers, substantially increasing x86’s architecturally named registers.
Furthermore, these additional registers in 32-bit x86 are 128-bit, allowing 64-bit and 128-bit memory moves in single instructions, rather than multiple 32-bit mov instructions, which also enables each reg/mem move to more efficiently align on larger boundaries. Similarly, access to 64-bit arithmetic and comparisons on x86 allow native handling of all those 64-bit arithmetic, logic, and comparison operations which show up both in the Tiger hash code (designed for 64-bit CPUs and it shows) and the 64-bit file position handling pervasive in DC++.
Finally, there’s substantial use of 2-wide SIMD, especially when common patterns such as
foo += bar; baz += foobar;
via SSE2 packed integer addition (e.g., paddq) or
foo -= bar; baz -= foobar;
appear, using packed integer subtraction (e.g., psubq).
Putting all this together in one of the more dramatic improvements in generated code quality as a result of this change, one can watch as enabling SSE2 automatically transforms part of TigerHash::update(…) from:
193:dcpp/TigerHash.cpp **** } movl 168(%esp), %edi # %sfp, x7 movl 172(%esp), %ebp # %sfp, x7 movl 440(%esp), %ebx # %sfp, x1 movl 444(%esp), %esi # %sfp, x1 movl %edi, %eax # x7, tmp2058 movl 412(%esp), %edx # %sfp, x0 xorl $-1515870811, %eax #, tmp2058 movl %eax, 488(%esp) # tmp2058, %sfp movl %ebp, %eax # x7, tmp2059 movl %ebx, %ecx # x1, tmp2062 xorl $-1515870811, %eax #, tmp2059 movl %esi, %ebx # x1, tmp2063 movl 156(%esp), %esi # %sfp, x2 movl %eax, 492(%esp) # tmp2059, %sfp movl 408(%esp), %eax # %sfp, x0 subl 488(%esp), %eax # %sfp, x0 sbbl 492(%esp), %edx # %sfp, x0 xorl %eax, %ecx # x0, tmp2062 movl %ecx, 384(%esp) # tmp2062, %sfp xorl %edx, %ebx # x0, tmp2063 movl 384(%esp), %edi # %sfp, x1 movl %ebx, 388(%esp) # tmp2063, %sfp movl 152(%esp), %ebx # %sfp, x2 movl 388(%esp), %ebp # %sfp, x1 movl %edi, %ecx # x1, tmp2066 notl %ecx # tmp2066 addl %edi, %ebx # x1, x2 movl %ecx, 496(%esp) # tmp2066, %sfp movl %ebp, %ecx # x1, tmp2067 adcl %ebp, %esi # x1, x2 notl %ecx # tmp2067 movl %ebx, (%esp) # x2, %sfp movl %ecx, 500(%esp) # tmp2067, %sfp movl 496(%esp), %ecx # %sfp, tmp1093 movl %esi, 4(%esp) # x2, %sfp movl 500(%esp), %ebx # %sfp, movl (%esp), %esi # %sfp, x2 movl 4(%esp), %edi # %sfp, shldl $19, %ecx, %ebx #, tmp1093, movl %esi, %ebp # x2, tmp2069 movl 460(%esp), %esi # %sfp, x3 sall $19, %ecx #, tmp1093 xorl %edi, %ebx #, tmp2070 xorl %ecx, %ebp # tmp1093, tmp2069 movl %ebp, 504(%esp) # tmp2069, %sfp movl %ebx, 508(%esp) # tmp2070, %sfp movl 456(%esp), %ebx # %sfp, x3 subl 504(%esp), %ebx # %sfp, x3 sbbl 508(%esp), %esi # %sfp, x3 movl %ebx, %edi # x3, x3
to something of comparative beauty:
193:dcpp/TigerHash.cpp **** } movl 80(%esp), %eax # %sfp, tmp1091 movl 84(%esp), %edx # %sfp, xorl $-1515870811, %eax #, tmp1091 xorl $-1515870811, %edx #, movd %eax, %xmm0 # tmp1091, tmp1885 movd %edx, %xmm1 #, tmp1886 punpckldq %xmm1, %xmm0 # tmp1886, tmp1885 psubq %xmm0, %xmm7 # tmp1885, x0 movdqa 96(%esp), %xmm1 # %sfp, tmp2253 pxor %xmm7, %xmm1 # x0, tmp2253 movdqa %xmm1, %xmm0 # x1, tmp1843 psrlq $32, %xmm0 #, tmp1843 movd %xmm1, %edx # tmp21, tmp2105 notl %edx # tmp2105 movd %xmm0, %eax #, tmp2106 notl %eax # tmp2106 paddq %xmm1, %xmm6 # x1, x2 movl %edx, 192(%esp) # tmp2105, %sfp movdqa %xmm1, %xmm3 # tmp2253, x1 movl %eax, 196(%esp) # tmp2106, %sfp movl 192(%esp), %eax # %sfp, tmp1093 movl 196(%esp), %edx # %sfp, shldl $19, %eax, %edx #, tmp1093, sall $19, %eax #, tmp1093 movd %edx, %xmm1 #, tmp1888 movd %eax, %xmm0 # tmp1093, tmp1887 punpckldq %xmm1, %xmm0 # tmp1888, tmp1887 pxor %xmm6, %xmm0 # x2, tmp1094 psubq %xmm0, %xmm5 # tmp1094, tmp2630
The register overflow spill/fills in the non-SSE version from %eax to 492(%esp) back to %edx three instructions later to enable %eax to be reused; from %ecx to 500(%esp) back to %ebx in another three instructions to enable 496(%esp) to be left-shifted a few instructions later; and between %edi, %ecx, and that same 496(%esp) because evidently, there’s not enough space to sort both %ecx and notl %ecx simultaneously with a half-dozen GPRs.
Virtually no spills/fills remain because there are now ample registers; the movdqa from 96(%esp) to %xmm1 replaces multiple 32-bit movl instructions; the ugly addl/adcl and subl/sbbl pairs emulating 64-bit addition and subtraction using 32-bit arithmetic disappear in lieu of natively 64-bit arithmetic; and each pair of 32-bit xorl instructions becomes a single pxor.
While TigerHash.cpp especially shows off SSE2’s advantage over i686-generation 32-bit x86, each of these improvements appears sprinked in thousands of places around DC++, in function prologues, every time certain Boost template functions shows up, every time _builtin_memcpy is called, and in dozens of other mundane yet common situations.