The next version of DC++ will require SSE2 CPU support.
This represents no change for the 64-bit builds since x86-64 includes SSE2. The last widely used CPUs affected, lacking SSE2 support, are Athlon XPs the last of which were released in 2004. As such, not just DC++ but Firefox 49, Chrome on both Windows and Linux since 2014, IE 11 since 2013, and Windows 8 since 2012 all require SSE2. Empirically, Firefox developers found that just 0.4% of their users as of this May lacked SSE2 and Chrome developers measured 0.33% of their Windows stable population lacking SSE2 in 2014, suggesting that to the extent not requiring SSE2 imposes non-negligible development or runtime cost, one might find increasingly thin support for avoiding it.
A straightforward advantage SSE2 provides derives from non-SIMD 32-bit x86 supporting only arguably between 6 and 8 general-purpose 32-bit registers. SSE2 in 32-bit environments adds 8 additional registers, substantially increasing x86’s architecturally named registers.
Furthermore, these additional registers in 32-bit x86 are 128-bit, allowing 64-bit and 128-bit memory moves in single instructions, rather than multiple 32-bit mov instructions, which also enables each reg/mem move to more efficiently align on larger boundaries. Similarly, access to 64-bit arithmetic and comparisons on x86 allow native handling of all those 64-bit arithmetic, logic, and comparison operations which show up both in the Tiger hash code (designed for 64-bit CPUs and it shows) and the 64-bit file position handling pervasive in DC++.
Finally, there’s substantial use of 2-wide SIMD, especially when common patterns such as
foo += bar;
baz += foobar;
via SSE2 packed integer addition (e.g., paddq) or
foo -= bar;
baz -= foobar;
appear, using packed integer subtraction (e.g., psubq).
Putting all this together in one of the more dramatic improvements in generated code quality as a result of this change, one can watch as enabling SSE2 automatically transforms part of TigerHash::update(…) from:
193:dcpp/TigerHash.cpp **** }
movl 168(%esp), %edi # %sfp, x7
movl 172(%esp), %ebp # %sfp, x7
movl 440(%esp), %ebx # %sfp, x1
movl 444(%esp), %esi # %sfp, x1
movl %edi, %eax # x7, tmp2058
movl 412(%esp), %edx # %sfp, x0
xorl $-1515870811, %eax #, tmp2058
movl %eax, 488(%esp) # tmp2058, %sfp
movl %ebp, %eax # x7, tmp2059
movl %ebx, %ecx # x1, tmp2062
xorl $-1515870811, %eax #, tmp2059
movl %esi, %ebx # x1, tmp2063
movl 156(%esp), %esi # %sfp, x2
movl %eax, 492(%esp) # tmp2059, %sfp
movl 408(%esp), %eax # %sfp, x0
subl 488(%esp), %eax # %sfp, x0
sbbl 492(%esp), %edx # %sfp, x0
xorl %eax, %ecx # x0, tmp2062
movl %ecx, 384(%esp) # tmp2062, %sfp
xorl %edx, %ebx # x0, tmp2063
movl 384(%esp), %edi # %sfp, x1
movl %ebx, 388(%esp) # tmp2063, %sfp
movl 152(%esp), %ebx # %sfp, x2
movl 388(%esp), %ebp # %sfp, x1
movl %edi, %ecx # x1, tmp2066
notl %ecx # tmp2066
addl %edi, %ebx # x1, x2
movl %ecx, 496(%esp) # tmp2066, %sfp
movl %ebp, %ecx # x1, tmp2067
adcl %ebp, %esi # x1, x2
notl %ecx # tmp2067
movl %ebx, (%esp) # x2, %sfp
movl %ecx, 500(%esp) # tmp2067, %sfp
movl 496(%esp), %ecx # %sfp, tmp1093
movl %esi, 4(%esp) # x2, %sfp
movl 500(%esp), %ebx # %sfp,
movl (%esp), %esi # %sfp, x2
movl 4(%esp), %edi # %sfp,
shldl $19, %ecx, %ebx #, tmp1093,
movl %esi, %ebp # x2, tmp2069
movl 460(%esp), %esi # %sfp, x3
sall $19, %ecx #, tmp1093
xorl %edi, %ebx #, tmp2070
xorl %ecx, %ebp # tmp1093, tmp2069
movl %ebp, 504(%esp) # tmp2069, %sfp
movl %ebx, 508(%esp) # tmp2070, %sfp
movl 456(%esp), %ebx # %sfp, x3
subl 504(%esp), %ebx # %sfp, x3
sbbl 508(%esp), %esi # %sfp, x3
movl %ebx, %edi # x3, x3
to something of comparative beauty:
193:dcpp/TigerHash.cpp **** }
movl 80(%esp), %eax # %sfp, tmp1091
movl 84(%esp), %edx # %sfp,
xorl $-1515870811, %eax #, tmp1091
xorl $-1515870811, %edx #,
movd %eax, %xmm0 # tmp1091, tmp1885
movd %edx, %xmm1 #, tmp1886
punpckldq %xmm1, %xmm0 # tmp1886, tmp1885
psubq %xmm0, %xmm7 # tmp1885, x0
movdqa 96(%esp), %xmm1 # %sfp, tmp2253
pxor %xmm7, %xmm1 # x0, tmp2253
movdqa %xmm1, %xmm0 # x1, tmp1843
psrlq $32, %xmm0 #, tmp1843
movd %xmm1, %edx # tmp21, tmp2105
notl %edx # tmp2105
movd %xmm0, %eax #, tmp2106
notl %eax # tmp2106
paddq %xmm1, %xmm6 # x1, x2
movl %edx, 192(%esp) # tmp2105, %sfp
movdqa %xmm1, %xmm3 # tmp2253, x1
movl %eax, 196(%esp) # tmp2106, %sfp
movl 192(%esp), %eax # %sfp, tmp1093
movl 196(%esp), %edx # %sfp,
shldl $19, %eax, %edx #, tmp1093,
sall $19, %eax #, tmp1093
movd %edx, %xmm1 #, tmp1888
movd %eax, %xmm0 # tmp1093, tmp1887
punpckldq %xmm1, %xmm0 # tmp1888, tmp1887
pxor %xmm6, %xmm0 # x2, tmp1094
psubq %xmm0, %xmm5 # tmp1094, tmp2630
The register overflow spill/fills in the non-SSE version from %eax to 492(%esp) back to %edx three instructions later to enable %eax to be reused; from %ecx to 500(%esp) back to %ebx in another three instructions to enable 496(%esp) to be left-shifted a few instructions later; and between %edi, %ecx, and that same 496(%esp) because evidently, there’s not enough space to sort both %ecx and notl %ecx simultaneously with a half-dozen GPRs.
Virtually no spills/fills remain because there are now ample registers; the movdqa from 96(%esp) to %xmm1 replaces multiple 32-bit movl instructions; the ugly addl/adcl and subl/sbbl pairs emulating 64-bit addition and subtraction using 32-bit arithmetic disappear in lieu of natively 64-bit arithmetic; and each pair of 32-bit xorl instructions becomes a single pxor.
While TigerHash.cpp especially shows off SSE2’s advantage over i686-generation 32-bit x86, each of these improvements appears sprinked in thousands of places around DC++, in function prologues, every time certain Boost template functions shows up, every time _builtin_memcpy is called, and in dozens of other mundane yet common situations.
You must be logged in to post a comment.