SSE3 in DC++

The next DC++ release will require SSE3. Steam’s hardware survey currently lists SSE3 as having 99.96% penetration. All AMD and Intel x86 CPUs since the Athlon 64 X2 in 2005 and Intel Core in January 2006 have supported SSE3. Even earlier, though, all Pentium 4 steppings since Prescott which support the NX bit required by Windows 8 and 10 also support SSE3, which extends the effective Intel support back to 2004. I can’t find an Intel CPU which supports NX (required for Win8/10) but not SSE3. Finally, this effectively affects only 32-bit builds, since 64-bit builds exclusively use SSE for floating-point arithmetic.

This effects two basic transformations, one minor and one major, depending on how well the existing code compiles. The minor improvement derives from functions such as bool SettingsDialog::handleClosing() using one instruction rather than two, from

bool SettingsDialog::handleClosing() {
	dwt::Point pt = getWindowSize();
	SettingsManager::getInstance()->set(SettingsManager::SETTINGS_WIDTH,
    cvttss2si eax,DWORD PTR [esp+0x18] ;; eax is just a temporary
    mov    DWORD PTR [edx+0x87c],eax   ;; which is promptly stored to mem

to

bool SettingsDialog::handleClosing() {
	dwt::Point pt = getWindowSize();
	SettingsManager::getInstance()->set(SettingsManager::SETTINGS_WIDTH,
    fisttp DWORD PTR [edx+0x87c]      ;; no byway through eax (also, less register pressure)

However, sometimes cvttss2si and related SSE/SSE2 instructions don’t fit as well, so g++ had been relying on fistp. These instances previously produced terrible code generation; without SSE3, only using through SSE2, part of void SearchFrame::runSearch() compiles to:

	auto llsize = static_cast(lsize);
    fnstcw WORD PTR [ebp-0x50e]     ;; save FP control word to mem
    movzx  eax,WORD PTR [ebp-0x50e] ;; zero-extend-move it to eax
    mov    ah,0xc                   ;; build new control word
    mov    WORD PTR [ebp-0x510],ax  ;; place control word in mem for fldcw
    fld    QWORD PTR [ebp-0x520]    ;; load lsize from mem (same as below)
    fldcw  WORD PTR [ebp-0x510]     ;; load new control word
    fistp  QWORD PTR [ebp-0x548]    ;; with correct control word, round lsize
    fldcw  WORD PTR [ebp-0x50e]     ;; restore previous control word

All 6 red-highlighted lines just scaffold around the actual fistp doing the floating point-to-int rounding, which can cost 80 cycles or more for this single innocuous-looking line of code. By contrast, using fisttp from SSE3, that same fragment collapses to:

	auto llsize = static_cast(lsize);
    fld    QWORD PTR [ebp-0x520]    ;; same as above; load lsize
    fisttp QWORD PTR [ebp-0x548]    ;; convert it. simple.

This pattern recurs many times through DC++, including void AdcHub::handle(AdcCommand::GET which has a portion halving in size and dramatically increasing in speed from

		// Ideal size for m is n * k / ln(2), but we allow some slack
		// When h >= 32, m can't go above 2^h anyway since it's stored in a size_t.
		if(m > (5 * Util::roundUp((int64_t)(n * k / log(2.)), (int64_t)64)) || (h < 32 && m > static_cast(1U << h))) {
    mov    DWORD PTR [esp+0x1c],edi
    xor    ecx,ecx
    imul   eax,DWORD PTR [esp+0x18]
    movd   xmm0,eax
    movq   QWORD PTR [esp+0x58],xmm0
    fild   QWORD PTR [esp+0x58]
    fdiv   QWORD PTR ds:0xca8
    fnstcw WORD PTR [esp+0x22]     ;; same control word dance as before
    movzx  eax,WORD PTR [esp+0x22]
    mov    ah,0xc                  ;; same control word
    mov    WORD PTR [esp+0x20],ax  ;; but fldcw loads from mem not reg
    fldcw  WORD PTR [esp+0x20]     ;; load C and C++-compatible rounding mode
    fistp  QWORD PTR [esp+0x58]    ;; the actual conversion
    fldcw  WORD PTR [esp+0x22]     ;; restore previous
    mov    eax,DWORD PTR [esp+0x58]
    mov    edx,DWORD PTR [esp+0x5c]

to, using the fisttp SSE3 instruction,

		// Ideal size for m is n * k / ln(2), but we allow some slack
		// When h >= 32, m can't go above 2^h anyway since it's stored in a size_t.
		if(m > (5 * Util::roundUp((int64_t)(n * k / log(2.)), (int64_t)64)) || (h < 32 && m > static_cast(1U << h))) {
    mov    DWORD PTR [esp+0x20],edi
    xor    ecx,ecx
    imul   eax,DWORD PTR [esp+0x1c]
    movd   xmm0,eax
    movq   QWORD PTR [esp+0x58],xmm0
    fild   QWORD PTR [esp+0x58]
    fdiv   QWORD PTR ds:0xca8
    fisttp QWORD PTR [esp+0x58]    ;; replaces all seven red lines
    mov    eax,DWORD PTR [esp+0x58]
    mov    edx,DWORD PTR [esp+0x5c]

This specific control word save/convert float/control word restore pattern recurs 19 other times across the current codebase in the dcpp, dwt, and win32 directories, including DownloadManager::getRunningAverage(); HashBloom::get_m(size_t n, size_t k); QueueItem::getDownloadedBytes(); Transfer::getParams(); UploadManager::getRunningAverage(); Grid::calcSizes(…); HashProgressDlg::updateStats(); TransferView::on(HttpManagerListener::Updated, …); and TransferView::onTransferTick(…).

Know your FPU: Fixing Floating Fast provides microbenchmarks showing just how slow this fistp-based technique can be due to the fnstcw/fldcw 80+-cycle FPU pipeline flush and therefore how much faster code which replaces it can become:

Fixed tests...
Testing ANSI fixed() ... Time = 2974.57 ms
Testing fistp fixed()... Time = 3100.84 ms
Testing Sree fixed() ... Time =  606.80 ms

SSE3 provides not simply some hidden code generation aesthetic quality improvement, but a speed increase across much of DC++.

Splitting IDENTIFY to support multiple share profiles in ADC

ADC insufficiently precisely orders the IDENTIFY and NORMAL states such that ADC clients can properly support multiple share profiles. Several client software-independent observations imply this protocol deficiency:

  • ADC clients define download queue sources by CID, such that if sharing client presents multiple shares it must be through different CIDs, barring backwards-incompatible and queue-crippling requirements to only connect to a source via the hub through which it was queued.
  • A multiply-sharing ADC client in the server role must know the CTM token associated with a client-client connection to determine unambiguously which shares to present and therefore which CID to present to the non-server client.
  • ADC’s SUP specification, as illustrated by the example client-client connection, states that when “the server receives this message in a client-client connection in the PROTOCOL state, it should reply in kind, send an INF about itself, and move to the IDENTIFY state”; this implies the server client sending its CINF before the non-server client sends the CTM token in the TO field with its CINF.
  • Either the server or non-server client may be the downloader and vice versa. As such, by the time both the server and non-server clients in a client-client connection sends their CINF commands, they must know, since either may be a multiply-sharing client about to upload files, which CTM token with which to associate the connection.
  • The non-server client can unambiguously track which client-client connections it should associate with each CTM token by locally associating that token with each outbound client-client connection it creates, an association a server-client listening for inbound connections by cannot reliably create until the non-server client sends it a CINF with a token field.

Together, these ADC properties show that a server client which uploads using multiple share profiles must know which CID to send, but must do so before it has enough information to determine via the CTM token the correct share profile and thus the correct CID. Such a putatively multiply-sharing ADC client cannot, therefore, remain consistent with all of the listed constraints.

Most constraints prove impractical or undesirable to change, but by clarifying the SUP specification and IDENTIFY states, one can fix this ADC oversight while remaining compatible with DC++ and ncdc, with jucy apparently requiring adjustment. In particular, I propose to:

  1. Modify SUP and INF to require rather that the non-server client, rather than the server client, send the first INF; and
  2. in order to do so, split the IDENTIFY state into SERVER-IDENTIFY and CLIENT-IDENTIFY, whereby
  3. the next state after SUP in a client-client connection is CLIENT-IDENTIFY, which transitions to SERVER-IDENTIFY, which finally transitions as now to NORMAL

This effectively splits the IDENTIFY state into CLIENT-IDENTIFY and SERVER-IDENTIFY to ensure that they send their CINF commands in an order consistent with the requirement that both clients know the CTM token when they send their CINF command, finally allowing ADC to reliably support multiple share profiles.

Such a change appears compatible with both DC++ and ncdc, because both simply respond to CSUP with CINF immediately, regardless of what its partner in a client-client connection does. The only change required in DC++ and ncdc is for the server client to wait for the non-server client to send its CINF before sending a reply CINF rather than replying immediately to the non-server client’s CSUP.

jucy would need adjustment because it currently, by only triggering a non-server client’s CINF, containing the CTM token, in response to the server client’s pre-token CINF. A server client which waits for a jucy-based non-server client to send the first CINF will wait indefinitely.

Thus, by simply requiring that the non-server client in a client-client connection sends its CINF first, in a manner already compatible with DC++-based clients and ncdc and almost compatible with jucy, ADC-based can finally provide reliable multiple share profiles.

Software and code for TTH generation and validation

There are multiple sites dedicated for providing information about generation and validatation of TTHs. There are many languages written to handle TTHs, so you should see this list as a small selection of all the implementations out there.

  • TigerTree Hash Code project at SourceForge intend on providing implementations for multiple languages.
  • ThexCS – TTH (tiger tree hash) maker in C# at CodeProject intend to provide a UI for simple TTH generation.
  • tthsum is probably the most widely known stand-alone application that can generate TTHs. The TTH generation was the one from DC++, but later changed to the original Tiger authors’ implementation. tthsum is in most Linux distributions.
  • Obviously, DC clients can generate TTHs…

    Don’t forget that you can make topic suggestions for blog posts in our “Blog Topic Suggestion Box!”

    The GPL is not a EULA

    GPL software therefore gain nothing by prompting the user to agree to or disagree with the GPL and DC++ will stop doing so. This holds both for the GPLv2 and GPLv3:

    Some software packaging systems have a place which requires you to click through or otherwise indicate assent to the terms of the GPL. This is neither required nor forbidden. With or without a click through, the GPL’s rules remain the same.

    Merely agreeing to the GPL doesn’t place any obligations on you. You are not required to agree to anything to merely use software which is licensed under the GPL. You only have obligations if you modify or distribute the software. If it really bothers you to click through the GPL, nothing stops you from hacking the GPLed software to bypass this.

    The upcoming version of DC++ therefore does not ask the user to assent or otherwise to the GPL during installation.

    Interestingly, several other top-ranking GPL-using SourceForge projects, about half of the tested sample, equally uselessly also require Windows users to agree to the GPL before allowing installation:

    Copyright notice for Tiger implementations

    I and Jacek began a while back discussing the Tiger implementation in DC++. The implementation is a C++ port of the original C code available on the official Tiger website; http://www.cs.technion.ac.il/~biham/Reports/Tiger/

    What was noticable about the C code was that there was no license attached; this means that the implementation fall under default copyright laws (in this case, US laws). As such, any type of modification or derivative (which the C++ implementation might be considered as) is most likely not directly allowed.

    I sent an e-mail to the authors to rectify the situation and make sure there is sound lawful ground for DC++, other derivatives and users of the C++ code.

    The following is what I sent to Eli Biham, one of the authors;

    Under what license is your C implementation of the Tiger (http://www.cs.technion.ac.il/~biham/Reports/Tiger/) algorithm? The source code doesn’t state any license explicitly, nor does the Tiger main page. As it stands now, it is not possible to use your implementation, as is, in an application without getting explicit permission from you.

    Biham responded;

    Dear Fredrik,

    I hereby allow you to use it, provided it will compute Tiger, and state
    the names of the authors in it.

    Clearly, the usual disclaimers hold, e.g., that it’s use will be legal,
    and that it will not be exported to countries banned by law, that the
    authors will not be responsible for the code, your software, nor
    anything else.

    Regards,

    Eli

    In an effort to create a cleaner phrasing that would suit source code, I rephrased and added the following to the DC++ source (in TigerHash.cpp/h, in DC++ 0.780);

    /*
    * The Tiger algorithm was written by Eli Biham and Ross Anderson and
    * is available on the official Tiger algorithm page .
    * The below Tiger implementation is a C++ version of their original C code.
    * Permission was granted by Eli Biham to use with the following conditions;
    * a) This note must be retained.
    * b) The algorithm must correctly compute Tiger.
    * c) The algorithm’s use must be legal.
    * d) The algorithm may not be exported to countries banned by law.
    * e) The authors of the C code are not responsible of this use of the code,
    * the software or anything else.
    */

    If you are using the C implementation or a derivate (including DC++’s implementation), you must include a similar notice.

    Feel free to use my phrasing or write your own (adherring to Biham’s restrictions).

    Don’t forget that you can make topic suggestions for blog posts in our “Blog Topic Suggestion Box!”

    Translation update

    There’s a new release coming in a few days and the translations templates on launchpad have been updated – if you want your language to be complete please have a look…

    In the future I’ll only post these updates to http://sourceforge.net/mailarchive/forum.php?forum_name=dcplusplus-devel, so please subscribe if you’re interested in keeping your translation up to date…it’s a low-traffic list, so you don’t have to worry…

    DC++ at the local Bazaar

    The latest news about the developing of DC++ are about the change of the repository from Subversion to Bazaar.

    How does this affect the normal DC++ user ? The answer is, with nothing at all. It just affects the more advanced users that can compile their own binaries from the latest source code, or the people that want to contribute with patches or something.

    Bazaar is a new version control system similar to CVS and Subversion that presumably has more advantages than the other two. You can still use the old svn repository from sourceforge, because the new bazaar one and this one will be auto synchronized ( with some few days delay, the most ), the sole difference is that the commits are now made first to the bazaar and then they appear into the svn repository.

    To have access to the bazaar, you need the bazaar client, available from here. Also , to install bzr you need Python 2.4 the least.
    Once you have it installed and in your path, you can simply checkout the repository by :
    bzr branch http://bazaar.launchpad.net/~dcplusplus-team/dcplusplus/trunk dcplusplus
    This will checkout the entire repository into the dcplusplus folder.
    First difference from subversion : the initial checkout lasts much longer , because all the revisions are being downloaded ( diffs between them anyway ), so after the checkout you don’t need internet connection to update to any older revision or see the diffs . It took here about 5-8 minutes to complete, so be patient.
    Don’t worry about the space this all thing is taking, even if it’s branching all the revisions, I heard it still uses less space than the svn checkout ( hi arne =).

    If you had commit access to the repository or you are in the dcplusplus-team, I can explain how to gain ssh access to the repository.
    If you use linux, you need a ssh tool that can create a key and use it as login ( standard ssh I think it works…)
    For windows, I used PuTTY. You can get it from here. You need PuTTY.exe, Pageant.exe, Plink.exe and PuTTYgen.exe.

    Start up PuTTYgen.exe and generate new key (Notice the nice randomness generation =).
    Save both the public and private keys.
    Go to the launchpad site and into your profile go to Add SSH key and paste the public key information from PuttyGen ( some strings that puttygen puts in some box where it says : “Public key for pasting into authorized files… “)

    Now open up PuTTY.exe, and connect to bazaar.launchpad.net using SSH. You will be asked to add into Putty’s cache the server’s fingerprint. Pick “Yes” so that the fingerprint is being added permanently. ( This is required so that you can connect using plink, otherwise it can’t connect because it doesn’t recognize the fingerprint as “safe” ).

    Open up pageant.exe and load your private key into it ( the one generated with puttygen ).

    Now, add a new variable to your system ( Right click my computer, advanced, environment variables ) named BZR_SSH and set the value to plink.

    Open up console and try
    bzr branch bzr+ssh://<name>@bazaar.launchpad.net/~dcplusplus-team/dcplusplus/trunk dcplusplus
    Where <name> is your launchpad nickname.
    There you go.

    What arne suggests about commiting : “then in the dcplusplus folder “bzr reconfigure –checkout”. That’ll configure bzr to work just like svn, committing each revision you make to the main repository when you do “bzr commit”. Alternatively, you don’t do the reconfigure thing, and all your “bzr commit” commits will be local until you do a bzr push”

    I’ve been reluctant about this whole Bazaar since I don’t see any real advantage about it over svn. Maybe time will tell.
    I’m also waiting for feedback or help requests if my post wasn’t explanatory enough.