Disabling TLS 1.0 and 1.1 in DC++ by 2020

Following the IETF’s deprecation of TLS 1.0 and TLS 1.1Chrome, Edge, Firefox, and Safari have announced that they’ll disable both TLS 1.0 and 1.1 during the first half of 2020. GitHubStripeCloudFlarePayPal, and KeyCDN have all already done so on the server side. The deprecated TLS 1.0 dates from 1999 and TLS 1.1 from 2006.

Meanwhile, TLS 1.2 has now existed since 2008 and been supported by OpenSSL 1.0.1 since 2012. DC++, along therefore with modified versions, has supported TLS 1.2 since version 0.850 in 2015. ncdc likewise has supported TLS 1.2 for many years. ADCH++, uhub, and Luadch all support TLS 1.2 or 1.3.

Hardening DC++ Cryptography: TLS, HTTPS, and KEYP and BEAST, CRIME, BREACH, and Lucky 13: Assessing TLS in ADCS document vulnerabilities that TLS 1.0 and 1.1 allow or exacerbate, including but not limited to BEAST, Lucky 13, and potential downgrade attacks discovered in the future in TLS 1.0 or TLS 1.1 to which TLS 1.2 is not subject.

As such, DC++ has deprecated TLS 1.0 and 1.1 and will disable both by default in 2020 along with the browsers, while supporting TLS 1.2, 1.3, and newer versions, with an option to re-enable TLS 1.0 and 1.1 should that remain necessary.

SSE3 in DC++

The next DC++ release will require SSE3. Steam’s hardware survey currently lists SSE3 as having 99.96% penetration. All AMD and Intel x86 CPUs since the Athlon 64 X2 in 2005 and Intel Core in January 2006 have supported SSE3. Even earlier, though, all Pentium 4 steppings since Prescott which support the NX bit required by Windows 8 and 10 also support SSE3, which extends the effective Intel support back to 2004. I can’t find an Intel CPU which supports NX (required for Win8/10) but not SSE3. Finally, this effectively affects only 32-bit builds, since 64-bit builds exclusively use SSE for floating-point arithmetic.

This effects two basic transformations, one minor and one major, depending on how well the existing code compiles. The minor improvement derives from functions such as bool SettingsDialog::handleClosing() using one instruction rather than two, from

bool SettingsDialog::handleClosing() {
	dwt::Point pt = getWindowSize();
	SettingsManager::getInstance()->set(SettingsManager::SETTINGS_WIDTH,
    cvttss2si eax,DWORD PTR [esp+0x18] ;; eax is just a temporary
    mov    DWORD PTR [edx+0x87c],eax   ;; which is promptly stored to mem

to

bool SettingsDialog::handleClosing() {
	dwt::Point pt = getWindowSize();
	SettingsManager::getInstance()->set(SettingsManager::SETTINGS_WIDTH,
    fisttp DWORD PTR [edx+0x87c]      ;; no byway through eax (also, less register pressure)

However, sometimes cvttss2si and related SSE/SSE2 instructions don’t fit as well, so g++ had been relying on fistp. These instances previously produced terrible code generation; without SSE3, only using through SSE2, part of void SearchFrame::runSearch() compiles to:

	auto llsize = static_cast(lsize);
    fnstcw WORD PTR [ebp-0x50e]     ;; save FP control word to mem
    movzx  eax,WORD PTR [ebp-0x50e] ;; zero-extend-move it to eax
    mov    ah,0xc                   ;; build new control word
    mov    WORD PTR [ebp-0x510],ax  ;; place control word in mem for fldcw
    fld    QWORD PTR [ebp-0x520]    ;; load lsize from mem (same as below)
    fldcw  WORD PTR [ebp-0x510]     ;; load new control word
    fistp  QWORD PTR [ebp-0x548]    ;; with correct control word, round lsize
    fldcw  WORD PTR [ebp-0x50e]     ;; restore previous control word

All 6 red-highlighted lines just scaffold around the actual fistp doing the floating point-to-int rounding, which can cost 80 cycles or more for this single innocuous-looking line of code. By contrast, using fisttp from SSE3, that same fragment collapses to:

	auto llsize = static_cast(lsize);
    fld    QWORD PTR [ebp-0x520]    ;; same as above; load lsize
    fisttp QWORD PTR [ebp-0x548]    ;; convert it. simple.

This pattern recurs many times through DC++, including void AdcHub::handle(AdcCommand::GET which has a portion halving in size and dramatically increasing in speed from

		// Ideal size for m is n * k / ln(2), but we allow some slack
		// When h >= 32, m can't go above 2^h anyway since it's stored in a size_t.
		if(m > (5 * Util::roundUp((int64_t)(n * k / log(2.)), (int64_t)64)) || (h < 32 && m > static_cast(1U << h))) {
    mov    DWORD PTR [esp+0x1c],edi
    xor    ecx,ecx
    imul   eax,DWORD PTR [esp+0x18]
    movd   xmm0,eax
    movq   QWORD PTR [esp+0x58],xmm0
    fild   QWORD PTR [esp+0x58]
    fdiv   QWORD PTR ds:0xca8
    fnstcw WORD PTR [esp+0x22]     ;; same control word dance as before
    movzx  eax,WORD PTR [esp+0x22]
    mov    ah,0xc                  ;; same control word
    mov    WORD PTR [esp+0x20],ax  ;; but fldcw loads from mem not reg
    fldcw  WORD PTR [esp+0x20]     ;; load C and C++-compatible rounding mode
    fistp  QWORD PTR [esp+0x58]    ;; the actual conversion
    fldcw  WORD PTR [esp+0x22]     ;; restore previous
    mov    eax,DWORD PTR [esp+0x58]
    mov    edx,DWORD PTR [esp+0x5c]

to, using the fisttp SSE3 instruction,

		// Ideal size for m is n * k / ln(2), but we allow some slack
		// When h >= 32, m can't go above 2^h anyway since it's stored in a size_t.
		if(m > (5 * Util::roundUp((int64_t)(n * k / log(2.)), (int64_t)64)) || (h < 32 && m > static_cast(1U << h))) {
    mov    DWORD PTR [esp+0x20],edi
    xor    ecx,ecx
    imul   eax,DWORD PTR [esp+0x1c]
    movd   xmm0,eax
    movq   QWORD PTR [esp+0x58],xmm0
    fild   QWORD PTR [esp+0x58]
    fdiv   QWORD PTR ds:0xca8
    fisttp QWORD PTR [esp+0x58]    ;; replaces all seven red lines
    mov    eax,DWORD PTR [esp+0x58]
    mov    edx,DWORD PTR [esp+0x5c]

This specific control word save/convert float/control word restore pattern recurs 19 other times across the current codebase in the dcpp, dwt, and win32 directories, including DownloadManager::getRunningAverage(); HashBloom::get_m(size_t n, size_t k); QueueItem::getDownloadedBytes(); Transfer::getParams(); UploadManager::getRunningAverage(); Grid::calcSizes(…); HashProgressDlg::updateStats(); TransferView::on(HttpManagerListener::Updated, …); and TransferView::onTransferTick(…).

Know your FPU: Fixing Floating Fast provides microbenchmarks showing just how slow this fistp-based technique can be due to the fnstcw/fldcw 80+-cycle FPU pipeline flush and therefore how much faster code which replaces it can become:

Fixed tests...
Testing ANSI fixed() ... Time = 2974.57 ms
Testing fistp fixed()... Time = 3100.84 ms
Testing Sree fixed() ... Time =  606.80 ms

SSE3 provides not simply some hidden code generation aesthetic quality improvement, but a speed increase across much of DC++.

Why DCNF uses HTTPS via Let’s Encrypt

All DCNF web services either use HTTPS or are being transitioned to HTTPS.

The US government’s HTTPS-only standard and Google’s “Why HTTPS Matters” describe how HTTPS enables increased website privacy, security, and integrity in general. ISPs, home routers, and antivirus software have all been caught modifying HTTP traffic, for example, which HTTPS hinders. HTTPS also increases Google’s search ranking and, via HTTP/2, decreases website loading time.

Somewhat more forcefully, Chrome 56 will warn users of non-HTTPS login forms, as does Firefox 50 beta and according to schedule, will Firefox 51. This will become important, for example, for the currently-under-maintenance DCBase forums.

Beyond the obvious advantages of not costing money, Let’s Encrypt provides important reduced friction versus alternatives in automatically and therefore scalably managing certificates for multiple subdomains, as well as ameliorating certificate revocation and security-at-rest importance and thereby HTTPS management overhead by such automation allowing more shorter-lived certificates and more rapid renewal. Additionally, as crypto algorithms gain and lose favor, such quick renewals catalyze agility. These HTTPS, in general, and Let’s Encrypt, specifically, advantages have led to adopting HTTPS using Let’s Encrypt.

DC++ Will Require SSE2

The next version of DC++ will require SSE2 CPU support.

This represents no change for the 64-bit builds since x86-64 includes SSE2. The last widely used CPUs affected, lacking SSE2 support, are Athlon XPs the last of which were released in 2004. As such, not just DC++ but Firefox 49, Chrome on both Windows and Linux since 2014, IE 11 since 2013, and Windows 8 since 2012 all require SSE2. Empirically, Firefox developers found that just 0.4% of their users as of this May lacked SSE2 and Chrome developers measured 0.33% of their Windows stable population lacking SSE2 in 2014, suggesting that to the extent not requiring SSE2 imposes non-negligible development or runtime cost, one might find increasingly thin support for avoiding it.

A straightforward advantage SSE2 provides derives from non-SIMD 32-bit x86 supporting only arguably between 6 and 8 general-purpose 32-bit registers. SSE2 in 32-bit environments adds 8 additional registers, substantially increasing x86’s architecturally named registers.

Furthermore, these additional registers in 32-bit x86 are 128-bit, allowing 64-bit and 128-bit memory moves in single instructions, rather than multiple 32-bit mov instructions, which also enables each reg/mem move to more efficiently align on larger boundaries. Similarly, access to 64-bit arithmetic and comparisons on x86 allow native handling of all those 64-bit arithmetic, logic, and comparison operations which show up both in the Tiger hash code (designed for 64-bit CPUs and it shows) and the 64-bit file position handling pervasive in DC++.

Finally, there’s substantial use of 2-wide SIMD, especially when common patterns such as

foo += bar;
baz += foobar;

via SSE2 packed integer addition (e.g., paddq) or

foo -= bar;
baz -= foobar;

appear, using packed integer subtraction (e.g., psubq).

Putting all this together in one of the more dramatic improvements in generated code quality as a result of this change, one can watch as enabling SSE2 automatically transforms part of TigerHash::update(…) from:

193:dcpp/TigerHash.cpp **** 	}
movl	168(%esp), %edi	 # %sfp, x7
movl	172(%esp), %ebp	 # %sfp, x7
movl	440(%esp), %ebx	 # %sfp, x1
movl	444(%esp), %esi	 # %sfp, x1
movl	%edi, %eax	 # x7, tmp2058
movl	412(%esp), %edx	 # %sfp, x0
xorl	$-1515870811, %eax	 #, tmp2058
movl	%eax, 488(%esp)	 # tmp2058, %sfp
movl	%ebp, %eax	 # x7, tmp2059
movl	%ebx, %ecx	 # x1, tmp2062
xorl	$-1515870811, %eax	 #, tmp2059
movl	%esi, %ebx	 # x1, tmp2063
movl	156(%esp), %esi	 # %sfp, x2
movl	%eax, 492(%esp)	 # tmp2059, %sfp
movl	408(%esp), %eax	 # %sfp, x0
subl	488(%esp), %eax	 # %sfp, x0
sbbl	492(%esp), %edx	 # %sfp, x0
xorl	%eax, %ecx	 # x0, tmp2062
movl	%ecx, 384(%esp)	 # tmp2062, %sfp
xorl	%edx, %ebx	 # x0, tmp2063
movl	384(%esp), %edi	 # %sfp, x1
movl	%ebx, 388(%esp)	 # tmp2063, %sfp
movl	152(%esp), %ebx	 # %sfp, x2
movl	388(%esp), %ebp	 # %sfp, x1
movl	%edi, %ecx	 # x1, tmp2066
notl	%ecx	 # tmp2066
addl	%edi, %ebx	 # x1, x2
movl	%ecx, 496(%esp)	 # tmp2066, %sfp
movl	%ebp, %ecx	 # x1, tmp2067
adcl	%ebp, %esi	 # x1, x2
notl	%ecx	 # tmp2067
movl	%ebx, (%esp)	 # x2, %sfp
movl	%ecx, 500(%esp)	 # tmp2067, %sfp
movl	496(%esp), %ecx	 # %sfp, tmp1093
movl	%esi, 4(%esp)	 # x2, %sfp
movl	500(%esp), %ebx	 # %sfp,
movl	(%esp), %esi	 # %sfp, x2
movl	4(%esp), %edi	 # %sfp,
shldl	$19, %ecx, %ebx	 #, tmp1093,
movl	%esi, %ebp	 # x2, tmp2069
movl	460(%esp), %esi	 # %sfp, x3
sall	$19, %ecx	 #, tmp1093
xorl	%edi, %ebx	 #, tmp2070
xorl	%ecx, %ebp	 # tmp1093, tmp2069
movl	%ebp, 504(%esp)	 # tmp2069, %sfp
movl	%ebx, 508(%esp)	 # tmp2070, %sfp
movl	456(%esp), %ebx	 # %sfp, x3
subl	504(%esp), %ebx	 # %sfp, x3
sbbl	508(%esp), %esi	 # %sfp, x3
movl	%ebx, %edi	 # x3, x3

to something of comparative beauty:

193:dcpp/TigerHash.cpp **** 	}
movl	80(%esp), %eax	 # %sfp, tmp1091
movl	84(%esp), %edx	 # %sfp,
xorl	$-1515870811, %eax	 #, tmp1091
xorl	$-1515870811, %edx	 #,
movd	%eax, %xmm0	 # tmp1091, tmp1885
movd	%edx, %xmm1	 #, tmp1886
punpckldq	%xmm1, %xmm0	 # tmp1886, tmp1885
psubq	%xmm0, %xmm7	 # tmp1885, x0
movdqa	96(%esp), %xmm1	 # %sfp, tmp2253
pxor	%xmm7, %xmm1	 # x0, tmp2253
movdqa	%xmm1, %xmm0	 # x1, tmp1843
psrlq	$32, %xmm0	 #, tmp1843
movd	%xmm1, %edx	 # tmp21, tmp2105
notl	%edx	 # tmp2105
movd	%xmm0, %eax	 #, tmp2106
notl	%eax	 # tmp2106
paddq	%xmm1, %xmm6	 # x1, x2
movl	%edx, 192(%esp)	 # tmp2105, %sfp
movdqa	%xmm1, %xmm3	 # tmp2253, x1
movl	%eax, 196(%esp)	 # tmp2106, %sfp
movl	192(%esp), %eax	 # %sfp, tmp1093
movl	196(%esp), %edx	 # %sfp,
shldl	$19, %eax, %edx	 #, tmp1093,
sall	$19, %eax	 #, tmp1093
movd	%edx, %xmm1	 #, tmp1888
movd	%eax, %xmm0	 # tmp1093, tmp1887
punpckldq	%xmm1, %xmm0	 # tmp1888, tmp1887
pxor	%xmm6, %xmm0	 # x2, tmp1094
psubq	%xmm0, %xmm5	 # tmp1094, tmp2630

The register overflow spill/fills in the non-SSE version from %eax to 492(%esp) back to %edx three instructions later to enable %eax to be reused; from %ecx to 500(%esp) back to %ebx in another three instructions to enable 496(%esp) to be left-shifted a few instructions later; and between %edi, %ecx, and that same 496(%esp) because evidently, there’s not enough space to sort both %ecx and notl %ecx simultaneously with a half-dozen GPRs.

Virtually no spills/fills remain because there are now ample registers; the movdqa from 96(%esp) to %xmm1 replaces multiple 32-bit movl instructions; the ugly addl/adcl and subl/sbbl pairs emulating 64-bit addition and subtraction using 32-bit arithmetic disappear in lieu of natively 64-bit arithmetic; and each pair of 32-bit xorl instructions becomes a single pxor.

While TigerHash.cpp especially shows off SSE2’s advantage over i686-generation 32-bit x86, each of these improvements appears sprinked in thousands of places around DC++, in function prologues, every time certain Boost template functions shows up, every time _builtin_memcpy is called, and in dozens of other mundane yet common situations.

Setting up multiple-subdomain HTTPS with nginx, acme-tiny, and Lets Encrypt

This guide briefly describes aspects of setting up nginx and acme-tiny to automatically register and renew multiple subdomains.

acme-tiny (Debian, Ubuntu, Arch, OpenBSD, FreeBSD, and Python Package Index) provides a more verifiable and more easily customizable than the default Let’s Encrypt client. This proves especially useful in less mainstream contexts where either the main client works magically or fails magically, but tends to offer little between those two outcomes.

The first step is to create a multidomain CSR which informs Let’s Encrypt of which domains it should provide certificates for. When adding or removing subdomains, this needs to be altered:
# OpenSSL configuration to generate a new key with signing requst for a x509v3
# multidomain certificate
#
# openssl req -config bla.cnf -new | tee csr.pem
# or
# openssl req -config bla.cnf -new -out csr.pem
[ req ]
default_bits = 4096
default_md = sha512
default_keyfile = key.pem
prompt = no
encrypt_key = no

# base request
distinguished_name = req_distinguished_name

# extensions
req_extensions = v3_req

# distinguished_name
[ req_distinguished_name ]
countryName = "SE"
stateOrProvinceName = "Sollentuna"
organizationName = "Direct Connect Network Foundation"
commonName = "dcbase.org"

# req_extensions
[ v3_req ]
# https://www.openssl.org/docs/apps/x509v3_config.html
subjectAltName = DNS:dcbase.org,DNS:www.dcbase.org

Then, when one is satisfies with one’s changes:
openssl req -new -key domain.key -config ~/dcbase_openssl.cnf > domain.csr
in the appropriate directory to regenerate a CSR based on this configuration. One does not have to change this CSR unless the set of subdomains or other information contained within also changes. Simply renewing certificates does not require regenerating domain.csr.

Having created a CSR, one then needs to ensure Let’s Encrypt knows where to find it. The ACME protocol Let’s Encrypt uses specifies that this should be /.well-known/acme-challenge/ and per acme-tiny’s documentation:
# https://github.com/diafygi/acme-tiny#step-3-make-your-website-host-challenge-files
location /.well-known/acme-challenge/ {
alias $appropriate_challenge_location;

allow all;
log_not_found off;
access_log off;

try_files $uri =404;
}

Where this needs to be accessible via ordinary HTTP, port 80, to work most conveniently, even if the entire rest of the site is HTTPS-only. Furthermore, this needs to hold even for otherwise dynamically generated sites — e.g., http://build.dcbase.org/.well-known/acme-challenge/, http://builds.dcbase.org/.well-known/acme-challenge/, http://archive.dcbase.org/.well-known/acme-challenge/, and http://forum.dcbase.org/.well-known/acme-challenge/ would all need to point to that same challenge location, even if disparate PHP CMSes generate each or they ordinarily redirect to other sites (such as Google Drive).

If this works, then one sees:
Parsing account key...
Parsing CSR...
Registering account...
Already registered!
Verifying dcbase.org...
dcbase.org verified!
Verifying http://www.dcbase.org...
http://www.dcbase.org verified!
Signing certificate...
Certificate signed!

When running acme-tiny.

Once this works reliably, the whole process should be run automatically as a cron job often enough to stay ahead of Let’s Encrypt’s 90-day cycle. However, one cannot renew too often:

The main limit is Certificates per Registered Domain (20 per week). A registered domain is, generally speaking, the part of the domain you purchased from your domain name registrar. For instance, in the name http://www.example.com, the registered domain is example.com. In new.blog.example.co.uk, the registered domain is example.co.uk. We use the Public Suffix List to calculate the registered domain.

If you have a lot of subdomains, you may want to combine them into a single certificate, up to a limit of 100 Names per Certificate. Combined with the above limit, that means you can issue certificates containing up to 2,000 unique subdomains per week. A certificate with multiple names is often called a SAN certificate, or sometimes a UCC certificate.

Once Let’s Encrypt certificate renewal’s configured, Strong Ciphers for Apache, nginx and Lighttpd and BetterCrypto provide reasonable recommendations, while BetterCrypto’s Crypto Hardening guide discusses more deeply rationales behind these choices.

Finally, SSL Server Test and Analyse your HTTP response headers offer sanity checks for multiple successfully secured subdomains served by nginx over HTTPS using Let’s Encrypt certificates.

Hardening DC++ Cryptography: TLS, HTTPS, and KEYP

BEAST, CRIME, BREACH, and Lucky 13 together left DC++ with no secure TLS support. Since then, the triple handshake attack, Heartbleed, POODLE for both SSL 3 and TLS, FREAK, and Logjam have multiplied hazards.

Fortunately, in the intervening year and a half, in response:

  • poy introduces direct, encrypted private messages in DC++ 0.830.
  • DC++ 0.840 sees substantial, wide-ranging improvements in KEYP and HTTPS support from Crise, anticipating Google sunsetting SHA1 by several months and detecting man-in-the-middle attempts across both KEYP and HTTPS.
  • OpenSSL 1.0.1g, included in DC++ 0.842, fixes Heartbleed.
  • DC++ 0.850 avoids CRIME and BREACH by disabling TLS compression; avoids RC4 vulnerabilities by removing support for RC4; prevents BEAST by supporting TLS 1.1 and 1.2; mitigates Lucky 13 through preferring AES-GCM ciphersuites; removes support for increasingly factorable 512-bit and 1024-bit DH and RSA ephemeral TLS keys; and with all but one ciphersuite, AES128-SHA, deprecated and included for DC++ pre-0.850 compatibility, uses either DHE or ECDHE ciphersuites to provide perfect forward secrecy, mitigating any future Heartbleed-like vulnerabilities.
  • DC++ 0.851 uses a new OpenSSL 1.0.2 API to constrain allowed elliptic curves to those for which OpenSSL provides constant-time assembly code to avoid timing side-channel attacks.

These KEYP, TLS, and HTTPS improvements have not only fixed known weaknesses, but prevent DC++ 0.850 and 0.851 from ever having been vulnerable to either FREAK or Logjam. As with perfect forward secrecy, these changes increase DC++’s ongoing security against yet-unknown cryptographic developments.

The upcoming version switches URLs in documentation, in menu items, and of the GeoIP downloads from HTTP to HTTPS. While these changes do not and cannot prevent attacks perfectly, it should now provide users with improved and still-improving cryptographic security for the benefit of all DC++ users.

DC Development hub revived

Following a two-month-long hiatus, adcs://hub.dcbase.org:16591 hosts the DC development hub again.