DC++ Will Require SSE2

The next version of DC++ will require SSE2 CPU support.

This represents no change for the 64-bit builds since x86-64 includes SSE2. The last widely used CPUs affected, lacking SSE2 support, are Athlon XPs the last of which were released in 2004. As such, not just DC++ but Firefox 49, Chrome on both Windows and Linux since 2014, IE 11 since 2013, and Windows 8 since 2012 all require SSE2. Empirically, Firefox developers found that just 0.4% of their users as of this May lacked SSE2 and Chrome developers measured 0.33% of their Windows stable population lacking SSE2 in 2014, suggesting that to the extent not requiring SSE2 imposes non-negligible development or runtime cost, one might find increasingly thin support for avoiding it.

A straightforward advantage SSE2 provides derives from non-SIMD 32-bit x86 supporting only arguably between 6 and 8 general-purpose 32-bit registers. SSE2 in 32-bit environments adds 8 additional registers, substantially increasing x86’s architecturally named registers.

Furthermore, these additional registers in 32-bit x86 are 128-bit, allowing 64-bit and 128-bit memory moves in single instructions, rather than multiple 32-bit mov instructions, which also enables each reg/mem move to more efficiently align on larger boundaries. Similarly, access to 64-bit arithmetic and comparisons on x86 allow native handling of all those 64-bit arithmetic, logic, and comparison operations which show up both in the Tiger hash code (designed for 64-bit CPUs and it shows) and the 64-bit file position handling pervasive in DC++.

Finally, there’s substantial use of 2-wide SIMD, especially when common patterns such as

foo += bar;
baz += foobar;

via SSE2 packed integer addition (e.g., paddq) or

foo -= bar;
baz -= foobar;

appear, using packed integer subtraction (e.g., psubq).

Putting all this together in one of the more dramatic improvements in generated code quality as a result of this change, one can watch as enabling SSE2 automatically transforms part of TigerHash::update(…) from:

193:dcpp/TigerHash.cpp **** 	}
movl	168(%esp), %edi	 # %sfp, x7
movl	172(%esp), %ebp	 # %sfp, x7
movl	440(%esp), %ebx	 # %sfp, x1
movl	444(%esp), %esi	 # %sfp, x1
movl	%edi, %eax	 # x7, tmp2058
movl	412(%esp), %edx	 # %sfp, x0
xorl	$-1515870811, %eax	 #, tmp2058
movl	%eax, 488(%esp)	 # tmp2058, %sfp
movl	%ebp, %eax	 # x7, tmp2059
movl	%ebx, %ecx	 # x1, tmp2062
xorl	$-1515870811, %eax	 #, tmp2059
movl	%esi, %ebx	 # x1, tmp2063
movl	156(%esp), %esi	 # %sfp, x2
movl	%eax, 492(%esp)	 # tmp2059, %sfp
movl	408(%esp), %eax	 # %sfp, x0
subl	488(%esp), %eax	 # %sfp, x0
sbbl	492(%esp), %edx	 # %sfp, x0
xorl	%eax, %ecx	 # x0, tmp2062
movl	%ecx, 384(%esp)	 # tmp2062, %sfp
xorl	%edx, %ebx	 # x0, tmp2063
movl	384(%esp), %edi	 # %sfp, x1
movl	%ebx, 388(%esp)	 # tmp2063, %sfp
movl	152(%esp), %ebx	 # %sfp, x2
movl	388(%esp), %ebp	 # %sfp, x1
movl	%edi, %ecx	 # x1, tmp2066
notl	%ecx	 # tmp2066
addl	%edi, %ebx	 # x1, x2
movl	%ecx, 496(%esp)	 # tmp2066, %sfp
movl	%ebp, %ecx	 # x1, tmp2067
adcl	%ebp, %esi	 # x1, x2
notl	%ecx	 # tmp2067
movl	%ebx, (%esp)	 # x2, %sfp
movl	%ecx, 500(%esp)	 # tmp2067, %sfp
movl	496(%esp), %ecx	 # %sfp, tmp1093
movl	%esi, 4(%esp)	 # x2, %sfp
movl	500(%esp), %ebx	 # %sfp,
movl	(%esp), %esi	 # %sfp, x2
movl	4(%esp), %edi	 # %sfp,
shldl	$19, %ecx, %ebx	 #, tmp1093,
movl	%esi, %ebp	 # x2, tmp2069
movl	460(%esp), %esi	 # %sfp, x3
sall	$19, %ecx	 #, tmp1093
xorl	%edi, %ebx	 #, tmp2070
xorl	%ecx, %ebp	 # tmp1093, tmp2069
movl	%ebp, 504(%esp)	 # tmp2069, %sfp
movl	%ebx, 508(%esp)	 # tmp2070, %sfp
movl	456(%esp), %ebx	 # %sfp, x3
subl	504(%esp), %ebx	 # %sfp, x3
sbbl	508(%esp), %esi	 # %sfp, x3
movl	%ebx, %edi	 # x3, x3

to something of comparative beauty:

193:dcpp/TigerHash.cpp **** 	}
movl	80(%esp), %eax	 # %sfp, tmp1091
movl	84(%esp), %edx	 # %sfp,
xorl	$-1515870811, %eax	 #, tmp1091
xorl	$-1515870811, %edx	 #,
movd	%eax, %xmm0	 # tmp1091, tmp1885
movd	%edx, %xmm1	 #, tmp1886
punpckldq	%xmm1, %xmm0	 # tmp1886, tmp1885
psubq	%xmm0, %xmm7	 # tmp1885, x0
movdqa	96(%esp), %xmm1	 # %sfp, tmp2253
pxor	%xmm7, %xmm1	 # x0, tmp2253
movdqa	%xmm1, %xmm0	 # x1, tmp1843
psrlq	$32, %xmm0	 #, tmp1843
movd	%xmm1, %edx	 # tmp21, tmp2105
notl	%edx	 # tmp2105
movd	%xmm0, %eax	 #, tmp2106
notl	%eax	 # tmp2106
paddq	%xmm1, %xmm6	 # x1, x2
movl	%edx, 192(%esp)	 # tmp2105, %sfp
movdqa	%xmm1, %xmm3	 # tmp2253, x1
movl	%eax, 196(%esp)	 # tmp2106, %sfp
movl	192(%esp), %eax	 # %sfp, tmp1093
movl	196(%esp), %edx	 # %sfp,
shldl	$19, %eax, %edx	 #, tmp1093,
sall	$19, %eax	 #, tmp1093
movd	%edx, %xmm1	 #, tmp1888
movd	%eax, %xmm0	 # tmp1093, tmp1887
punpckldq	%xmm1, %xmm0	 # tmp1888, tmp1887
pxor	%xmm6, %xmm0	 # x2, tmp1094
psubq	%xmm0, %xmm5	 # tmp1094, tmp2630

The register overflow spill/fills in the non-SSE version from %eax to 492(%esp) back to %edx three instructions later to enable %eax to be reused; from %ecx to 500(%esp) back to %ebx in another three instructions to enable 496(%esp) to be left-shifted a few instructions later; and between %edi, %ecx, and that same 496(%esp) because evidently, there’s not enough space to sort both %ecx and notl %ecx simultaneously with a half-dozen GPRs.

Virtually no spills/fills remain because there are now ample registers; the movdqa from 96(%esp) to %xmm1 replaces multiple 32-bit movl instructions; the ugly addl/adcl and subl/sbbl pairs emulating 64-bit addition and subtraction using 32-bit arithmetic disappear in lieu of natively 64-bit arithmetic; and each pair of 32-bit xorl instructions becomes a single pxor.

While TigerHash.cpp especially shows off SSE2’s advantage over i686-generation 32-bit x86, each of these improvements appears sprinked in thousands of places around DC++, in function prologues, every time certain Boost template functions shows up, every time _builtin_memcpy is called, and in dozens of other mundane yet common situations.

Leave a comment