SSE3 in DC++

The next DC++ release will require SSE3. Steam’s hardware survey currently lists SSE3 as having 99.96% penetration. All AMD and Intel x86 CPUs since the Athlon 64 X2 in 2005 and Intel Core in January 2006 have supported SSE3. Even earlier, though, all Pentium 4 steppings since Prescott which support the NX bit required by Windows 8 and 10 also support SSE3, which extends the effective Intel support back to 2004. I can’t find an Intel CPU which supports NX (required for Win8/10) but not SSE3. Finally, this effectively affects only 32-bit builds, since 64-bit builds exclusively use SSE for floating-point arithmetic.

This effects two basic transformations, one minor and one major, depending on how well the existing code compiles. The minor improvement derives from functions such as bool SettingsDialog::handleClosing() using one instruction rather than two, from

bool SettingsDialog::handleClosing() {
	dwt::Point pt = getWindowSize();
	SettingsManager::getInstance()->set(SettingsManager::SETTINGS_WIDTH,
    cvttss2si eax,DWORD PTR [esp+0x18] ;; eax is just a temporary
    mov    DWORD PTR [edx+0x87c],eax   ;; which is promptly stored to mem

to

bool SettingsDialog::handleClosing() {
	dwt::Point pt = getWindowSize();
	SettingsManager::getInstance()->set(SettingsManager::SETTINGS_WIDTH,
    fisttp DWORD PTR [edx+0x87c]      ;; no byway through eax (also, less register pressure)

However, sometimes cvttss2si and related SSE/SSE2 instructions don’t fit as well, so g++ had been relying on fistp. These instances previously produced terrible code generation; without SSE3, only using through SSE2, part of void SearchFrame::runSearch() compiles to:

	auto llsize = static_cast(lsize);
    fnstcw WORD PTR [ebp-0x50e]     ;; save FP control word to mem
    movzx  eax,WORD PTR [ebp-0x50e] ;; zero-extend-move it to eax
    mov    ah,0xc                   ;; build new control word
    mov    WORD PTR [ebp-0x510],ax  ;; place control word in mem for fldcw
    fld    QWORD PTR [ebp-0x520]    ;; load lsize from mem (same as below)
    fldcw  WORD PTR [ebp-0x510]     ;; load new control word
    fistp  QWORD PTR [ebp-0x548]    ;; with correct control word, round lsize
    fldcw  WORD PTR [ebp-0x50e]     ;; restore previous control word

All 6 red-highlighted lines just scaffold around the actual fistp doing the floating point-to-int rounding, which can cost 80 cycles or more for this single innocuous-looking line of code. By contrast, using fisttp from SSE3, that same fragment collapses to:

	auto llsize = static_cast(lsize);
    fld    QWORD PTR [ebp-0x520]    ;; same as above; load lsize
    fisttp QWORD PTR [ebp-0x548]    ;; convert it. simple.

This pattern recurs many times through DC++, including void AdcHub::handle(AdcCommand::GET which has a portion halving in size and dramatically increasing in speed from

		// Ideal size for m is n * k / ln(2), but we allow some slack
		// When h >= 32, m can't go above 2^h anyway since it's stored in a size_t.
		if(m > (5 * Util::roundUp((int64_t)(n * k / log(2.)), (int64_t)64)) || (h < 32 && m > static_cast(1U << h))) {
    mov    DWORD PTR [esp+0x1c],edi
    xor    ecx,ecx
    imul   eax,DWORD PTR [esp+0x18]
    movd   xmm0,eax
    movq   QWORD PTR [esp+0x58],xmm0
    fild   QWORD PTR [esp+0x58]
    fdiv   QWORD PTR ds:0xca8
    fnstcw WORD PTR [esp+0x22]     ;; same control word dance as before
    movzx  eax,WORD PTR [esp+0x22]
    mov    ah,0xc                  ;; same control word
    mov    WORD PTR [esp+0x20],ax  ;; but fldcw loads from mem not reg
    fldcw  WORD PTR [esp+0x20]     ;; load C and C++-compatible rounding mode
    fistp  QWORD PTR [esp+0x58]    ;; the actual conversion
    fldcw  WORD PTR [esp+0x22]     ;; restore previous
    mov    eax,DWORD PTR [esp+0x58]
    mov    edx,DWORD PTR [esp+0x5c]

to, using the fisttp SSE3 instruction,

		// Ideal size for m is n * k / ln(2), but we allow some slack
		// When h >= 32, m can't go above 2^h anyway since it's stored in a size_t.
		if(m > (5 * Util::roundUp((int64_t)(n * k / log(2.)), (int64_t)64)) || (h < 32 && m > static_cast(1U << h))) {
    mov    DWORD PTR [esp+0x20],edi
    xor    ecx,ecx
    imul   eax,DWORD PTR [esp+0x1c]
    movd   xmm0,eax
    movq   QWORD PTR [esp+0x58],xmm0
    fild   QWORD PTR [esp+0x58]
    fdiv   QWORD PTR ds:0xca8
    fisttp QWORD PTR [esp+0x58]    ;; replaces all seven red lines
    mov    eax,DWORD PTR [esp+0x58]
    mov    edx,DWORD PTR [esp+0x5c]

This specific control word save/convert float/control word restore pattern recurs 19 other times across the current codebase in the dcpp, dwt, and win32 directories, including DownloadManager::getRunningAverage(); HashBloom::get_m(size_t n, size_t k); QueueItem::getDownloadedBytes(); Transfer::getParams(); UploadManager::getRunningAverage(); Grid::calcSizes(…); HashProgressDlg::updateStats(); TransferView::on(HttpManagerListener::Updated, …); and TransferView::onTransferTick(…).

Know your FPU: Fixing Floating Fast provides microbenchmarks showing just how slow this fistp-based technique can be due to the fnstcw/fldcw 80+-cycle FPU pipeline flush and therefore how much faster code which replaces it can become:

Fixed tests...
Testing ANSI fixed() ... Time = 2974.57 ms
Testing fistp fixed()... Time = 3100.84 ms
Testing Sree fixed() ... Time =  606.80 ms

SSE3 provides not simply some hidden code generation aesthetic quality improvement, but a speed increase across much of DC++.

Splitting IDENTIFY to support multiple share profiles in ADC

ADC insufficiently precisely orders the IDENTIFY and NORMAL states such that ADC clients can properly support multiple share profiles. Several client software-independent observations imply this protocol deficiency:

  • ADC clients define download queue sources by CID, such that if sharing client presents multiple shares it must be through different CIDs, barring backwards-incompatible and queue-crippling requirements to only connect to a source via the hub through which it was queued.
  • A multiply-sharing ADC client in the server role must know the CTM token associated with a client-client connection to determine unambiguously which shares to present and therefore which CID to present to the non-server client.
  • ADC’s SUP specification, as illustrated by the example client-client connection, states that when “the server receives this message in a client-client connection in the PROTOCOL state, it should reply in kind, send an INF about itself, and move to the IDENTIFY state”; this implies the server client sending its CINF before the non-server client sends the CTM token in the TO field with its CINF.
  • Either the server or non-server client may be the downloader and vice versa. As such, by the time both the server and non-server clients in a client-client connection sends their CINF commands, they must know, since either may be a multiply-sharing client about to upload files, which CTM token with which to associate the connection.
  • The non-server client can unambiguously track which client-client connections it should associate with each CTM token by locally associating that token with each outbound client-client connection it creates, an association a server-client listening for inbound connections by cannot reliably create until the non-server client sends it a CINF with a token field.

Together, these ADC properties show that a server client which uploads using multiple share profiles must know which CID to send, but must do so before it has enough information to determine via the CTM token the correct share profile and thus the correct CID. Such a putatively multiply-sharing ADC client cannot, therefore, remain consistent with all of the listed constraints.

Most constraints prove impractical or undesirable to change, but by clarifying the SUP specification and IDENTIFY states, one can fix this ADC oversight while remaining compatible with DC++ and ncdc, with jucy apparently requiring adjustment. In particular, I propose to:

  1. Modify SUP and INF to require rather that the non-server client, rather than the server client, send the first INF; and
  2. in order to do so, split the IDENTIFY state into SERVER-IDENTIFY and CLIENT-IDENTIFY, whereby
  3. the next state after SUP in a client-client connection is CLIENT-IDENTIFY, which transitions to SERVER-IDENTIFY, which finally transitions as now to NORMAL

This effectively splits the IDENTIFY state into CLIENT-IDENTIFY and SERVER-IDENTIFY to ensure that they send their CINF commands in an order consistent with the requirement that both clients know the CTM token when they send their CINF command, finally allowing ADC to reliably support multiple share profiles.

Such a change appears compatible with both DC++ and ncdc, because both simply respond to CSUP with CINF immediately, regardless of what its partner in a client-client connection does. The only change required in DC++ and ncdc is for the server client to wait for the non-server client to send its CINF before sending a reply CINF rather than replying immediately to the non-server client’s CSUP.

jucy would need adjustment because it currently, by only triggering a non-server client’s CINF, containing the CTM token, in response to the server client’s pre-token CINF. A server client which waits for a jucy-based non-server client to send the first CINF will wait indefinitely.

Thus, by simply requiring that the non-server client in a client-client connection sends its CINF first, in a manner already compatible with DC++-based clients and ncdc and almost compatible with jucy, ADC-based can finally provide reliable multiple share profiles.

Software and code for TTH generation and validation

There are multiple sites dedicated for providing information about generation and validatation of TTHs. There are many languages written to handle TTHs, so you should see this list as a small selection of all the implementations out there.

  • TigerTree Hash Code project at SourceForge intend on providing implementations for multiple languages.
  • ThexCS – TTH (tiger tree hash) maker in C# at CodeProject intend to provide a UI for simple TTH generation.
  • tthsum is probably the most widely known stand-alone application that can generate TTHs. The TTH generation was the one from DC++, but later changed to the original Tiger authors’ implementation. tthsum is in most Linux distributions.
  • Obviously, DC clients can generate TTHs…

    Don’t forget that you can make topic suggestions for blog posts in our “Blog Topic Suggestion Box!”

    The GPL is not a EULA

    GPL software therefore gain nothing by prompting the user to agree to or disagree with the GPL and DC++ will stop doing so. This holds both for the GPLv2 and GPLv3:

    Some software packaging systems have a place which requires you to click through or otherwise indicate assent to the terms of the GPL. This is neither required nor forbidden. With or without a click through, the GPL’s rules remain the same.

    Merely agreeing to the GPL doesn’t place any obligations on you. You are not required to agree to anything to merely use software which is licensed under the GPL. You only have obligations if you modify or distribute the software. If it really bothers you to click through the GPL, nothing stops you from hacking the GPLed software to bypass this.

    The upcoming version of DC++ therefore does not ask the user to assent or otherwise to the GPL during installation.

    Interestingly, several other top-ranking GPL-using SourceForge projects, about half of the tested sample, equally uselessly also require Windows users to agree to the GPL before allowing installation:

    Copyright notice for Tiger implementations

    I and Jacek began a while back discussing the Tiger implementation in DC++. The implementation is a C++ port of the original C code available on the official Tiger website; http://www.cs.technion.ac.il/~biham/Reports/Tiger/

    What was noticable about the C code was that there was no license attached; this means that the implementation fall under default copyright laws (in this case, US laws). As such, any type of modification or derivative (which the C++ implementation might be considered as) is most likely not directly allowed.

    I sent an e-mail to the authors to rectify the situation and make sure there is sound lawful ground for DC++, other derivatives and users of the C++ code.

    The following is what I sent to Eli Biham, one of the authors;

    Under what license is your C implementation of the Tiger (http://www.cs.technion.ac.il/~biham/Reports/Tiger/) algorithm? The source code doesn’t state any license explicitly, nor does the Tiger main page. As it stands now, it is not possible to use your implementation, as is, in an application without getting explicit permission from you.

    Biham responded;

    Dear Fredrik,

    I hereby allow you to use it, provided it will compute Tiger, and state
    the names of the authors in it.

    Clearly, the usual disclaimers hold, e.g., that it’s use will be legal,
    and that it will not be exported to countries banned by law, that the
    authors will not be responsible for the code, your software, nor
    anything else.

    Regards,

    Eli

    In an effort to create a cleaner phrasing that would suit source code, I rephrased and added the following to the DC++ source (in TigerHash.cpp/h, in DC++ 0.780);

    /*
    * The Tiger algorithm was written by Eli Biham and Ross Anderson and
    * is available on the official Tiger algorithm page .
    * The below Tiger implementation is a C++ version of their original C code.
    * Permission was granted by Eli Biham to use with the following conditions;
    * a) This note must be retained.
    * b) The algorithm must correctly compute Tiger.
    * c) The algorithm’s use must be legal.
    * d) The algorithm may not be exported to countries banned by law.
    * e) The authors of the C code are not responsible of this use of the code,
    * the software or anything else.
    */

    If you are using the C implementation or a derivate (including DC++’s implementation), you must include a similar notice.

    Feel free to use my phrasing or write your own (adherring to Biham’s restrictions).

    Don’t forget that you can make topic suggestions for blog posts in our “Blog Topic Suggestion Box!”

    Translation update

    There’s a new release coming in a few days and the translations templates on launchpad have been updated – if you want your language to be complete please have a look…

    In the future I’ll only post these updates to http://sourceforge.net/mailarchive/forum.php?forum_name=dcplusplus-devel, so please subscribe if you’re interested in keeping your translation up to date…it’s a low-traffic list, so you don’t have to worry…

    DC++ at the local Bazaar

    The latest news about the developing of DC++ are about the change of the repository from Subversion to Bazaar.

    How does this affect the normal DC++ user ? The answer is, with nothing at all. It just affects the more advanced users that can compile their own binaries from the latest source code, or the people that want to contribute with patches or something.

    Bazaar is a new version control system similar to CVS and Subversion that presumably has more advantages than the other two. You can still use the old svn repository from sourceforge, because the new bazaar one and this one will be auto synchronized ( with some few days delay, the most ), the sole difference is that the commits are now made first to the bazaar and then they appear into the svn repository.

    To have access to the bazaar, you need the bazaar client, available from here. Also , to install bzr you need Python 2.4 the least.
    Once you have it installed and in your path, you can simply checkout the repository by :
    bzr branch http://bazaar.launchpad.net/~dcplusplus-team/dcplusplus/trunk dcplusplus
    This will checkout the entire repository into the dcplusplus folder.
    First difference from subversion : the initial checkout lasts much longer , because all the revisions are being downloaded ( diffs between them anyway ), so after the checkout you don’t need internet connection to update to any older revision or see the diffs . It took here about 5-8 minutes to complete, so be patient.
    Don’t worry about the space this all thing is taking, even if it’s branching all the revisions, I heard it still uses less space than the svn checkout ( hi arne =).

    If you had commit access to the repository or you are in the dcplusplus-team, I can explain how to gain ssh access to the repository.
    If you use linux, you need a ssh tool that can create a key and use it as login ( standard ssh I think it works…)
    For windows, I used PuTTY. You can get it from here. You need PuTTY.exe, Pageant.exe, Plink.exe and PuTTYgen.exe.

    Start up PuTTYgen.exe and generate new key (Notice the nice randomness generation =).
    Save both the public and private keys.
    Go to the launchpad site and into your profile go to Add SSH key and paste the public key information from PuttyGen ( some strings that puttygen puts in some box where it says : “Public key for pasting into authorized files… “)

    Now open up PuTTY.exe, and connect to bazaar.launchpad.net using SSH. You will be asked to add into Putty’s cache the server’s fingerprint. Pick “Yes” so that the fingerprint is being added permanently. ( This is required so that you can connect using plink, otherwise it can’t connect because it doesn’t recognize the fingerprint as “safe” ).

    Open up pageant.exe and load your private key into it ( the one generated with puttygen ).

    Now, add a new variable to your system ( Right click my computer, advanced, environment variables ) named BZR_SSH and set the value to plink.

    Open up console and try
    bzr branch bzr+ssh://<name>@bazaar.launchpad.net/~dcplusplus-team/dcplusplus/trunk dcplusplus
    Where <name> is your launchpad nickname.
    There you go.

    What arne suggests about commiting : “then in the dcplusplus folder “bzr reconfigure –checkout”. That’ll configure bzr to work just like svn, committing each revision you make to the main repository when you do “bzr commit”. Alternatively, you don’t do the reconfigure thing, and all your “bzr commit” commits will be local until you do a bzr push”

    I’ve been reluctant about this whole Bazaar since I don’t see any real advantage about it over svn. Maybe time will tell.
    I’m also waiting for feedback or help requests if my post wasn’t explanatory enough.

    Coral network coming to a DC++ near you

    The upcoming version of DC++ will include more hub list addresses than it already does. However, while these additions are technically pointing to different addresses, they reference the same hub lists as the current set does.

    What DC++ is going to utilize is the Coral distributed network to download hub lists. It does that by appending “nyud.net” to the hub list host. What Coral will do is that it will download the hub list, cache it in the distributed network and when you try to use the hub list, you’re using the cached one. This will mean that hub list owners will be able to have a lower upload bandwidth to distrubute their file.

    Coral will be able to catch newly updated material within 5 minutes. Beyond that, there’s an automatic expiry limit before the file will be discarded in the network, which is set to 12 hours.

    (I don’t think it’s possibly to set the expiry limit if the lists are in .bz2. You might be able to pull it off in the XML file, but I don’t know if Coral will treat it properly. Use the HTTP directive HTTP-EQUIV=”EXPIRES” to set the expiry limit.)

    (Yes, this update was intentionally timed after the 0.703 release.)

    Note that this is just a testing phase. If successful, the Coral:d lists will completely replace the others.

    Don’t forget that you can make topic suggestions for blog posts in our “Blog Topic Suggestion Box!”

    Compiling DC++ in new format

    It’s been a while since last release of DC++. What has been going on since? Well, a lot of changes are done, and this means that the compiling process changed too.

    First of all, Microsoft’s Visual C++ is no longer required for the compile process. Instead of it, MinGW is used. MinGW stands for Minimalist Gnu for Windows, and it’s the gcc ( and g++ ) compiler, moved to Windows.

    To install mingw, download from their site the following packages in tar.gz format ( not the sources ) w32api, binutils, mingw-runtime, gcc-core and gcc-g++. You need the 4.2+ version of gcc ( still appears as a technology preview ).

    There are 2 variants of gcc, dw2 and sjlj, it is possible to work with the sjlj variant, but I used dw2 and it worked fine. Untar all those in a folder named MinGW by example, go to MinGW/bin and copy gcc-dw2.exe to gcc.exe and g++-dw2.exe to g++.exe . Now, add the MinGW/bin to your path.

    Install SCons; and add it to your path ( SCons requires Python, so you will need to install that first).

    Download HTML help workshop . Copy the include and library files to the respective directories in the htmlhelp folder. Make sure hhc.exe is in your PATH.

    The installation of STLport is the main problem I faced while trying to compile. First, install CygWin. After that, get STLport ( version 5.1.3 , latest at this time ). Unzip the stlport to stlport folder in the DC++ source. Now, run cygwin, and browse to stlport/build/lib.

    Now, type : make -f gcc.mak After it’s finished, type also

    make -f gcc.mak install

    make -f gcc.mak install-release-static
    make -f gcc.mak install-dbg-static
    make -f gcc.mak install-stldbg-static

    Last three are required so that DC++ will run stand-alone ( with no required dlls).

    Now, open a command prompt and run scons. Following options are available :

    “tools=mingw” – Use mingw for building (default)
    “tools=default” – Use msvc for building (yes, the option value is strange)
    “mode=debug” – Compile a debug build (default)
    “mode=release” – Compile an optimized release build

    I used scons tools=mingw mode=release

    If you get some error about uPnP or something like that, you need to get natupnp.h; and paste it to MinGW/include folder.

    If some other references errors show up, try running scons again.

    Don’t worry if lots of warnings appear ( they don’t stop the compiling process, and they will be fixed in near future ).

    Don’t get scared if the .exe is a bit large ( it contains redundant symbols ). The exe will be optimised into smaller size ( I got 88 MB exe, and after optimization it should get below 8 MB )

    Cross-compiling DC++ on Linux

    Just some notes:

    • If you use a binary-based distribution, don’t use the default STLport – that’ll have been compiled for the native platform, not Win32. Compile your own copy from source, however you do it.
    • Your distribution might have a MinGW cross-compiler already packaged and available. Debian, Ubuntu, and possibly RPM distributions (I don’t use any and thus can’t verify that). The Gentoo site is down, so I didn’t check.
    • The default MinGW filename prefix in build_util.py is i386-mingw32. You might need to adjust that, either by setting the MINGW_PREFIX variable or just editing build_util.py. For example, the correct prefix on my development system is i586-mingw32msvc.
    • On at least the versions of the MinGW runtime packaged with the Linux distribution I use, commdlg.h lacks the OPENFILENAME_SIZE_VERSION_400 constant (it’s 76 decimal) and winuser.h XBUTTON1/XBUTTON2 (VK_* and MK_* aren’t the same). Get the corrected header files from the official MinGW download site if necessary; they’re as portable as necessary.
    • Certain older versions of SCons on MinGW are vulnerable to a bug which DC++’s build system triggers. Either update SCons or apply the patch available if you experience that bug.
    • Finally, regardless of platform, using SCons’s —implicit-deps-unchanged option can dramatically speed up compiles if not too much has changed.

    Don’t forget that you can make topic suggestions for blog posts in our “Blog Topic Suggestion Box!”