Mixed-hash DC hubs

They work fine if clients and hubs support both TTH and its successor adequately long.

While transitioning to a TTH successor, currently interoperable clients and hubs all supporting only TTH will diverge. In examining the consequences of such diversity, one can partition concerns into client-hub communication irrelevant to other clients; hub-mediated communication between two clients; and direct client-client communication. In each case, one can look at scenarios with complete, partial, and no supported hash function overlap. Complete overlap defines the all-TTH status quo and, clearly, works without complication for all forms of DC communication, so this post focuses on the remaining situations. In general,

Almost as straightforwardly, ADC but not NMDC client-hub communication irrelevant to other clients requires partial but not complete hash function overlap but only between each individual client/hub pair, and don’t create specific mixed-hash hub problems; otherwise, an ADC hub indicates STA error code 47. For ADC, This category consists of GPA, PAS, PID/CID negotiation (with length caveats as relate to other clients interpreting the resulting CID), and the establishment of a session hash function; NMDC does not depend on hashing at all for analogous functionality. Thus, for NMDC, no problems occur here. ADC’s greater usage of hashing requires correspondingly more care.

Specifically, GPA and PAS require that SUP had established some shared hash function between the client logging in and the hub, but otherwise have no bearing on mixed-hash-function DC hubs. Deriving the CID from the PID involves the session hash algorithm, which as with GPA/PAS merely requires partial hash function support overlap between each separate client and a hub. Length concerns do exist here, but become relevant only with hub-mediated communication between two clients.

Indeed, clients communicating via a hub comprise the bulk of DC client-hub communication. Of these, INF, SCH, and RES directly involve hashed content or CIDs. SCH ($Search) allows one to search by TTH and would also allow one to search by TTH’s successor. Such searches can only return results from clients which support the hash in question, so as before, partial overlap between clients works adequately. However, to avoid incentivizing clients which support both TTH and its successor to broadcast both searches and double auto-search bandwidth, a combined search method containing both hashes might prove useful. Similarly, RES specifies that clients must provide the session hash of their file, but also “are encouraged to supply additional fields if available”, which might include non-session hash functions they happen to support, such that as with the first client-hub communication category, partial hash function support overlap between any pair of clients suffices, but no overlap does not.

A more subtle and ADC-specific issue issue arises via RES’s U-type message header and INF’s ID field whereby ADC software commonly checks for exactly 39-byte CIDs. While clients need not support whatever specific hash algorithm produced a CID, the ADC specification requires that they support variable-length CIDs. Example of other hash function output lengths which, minimally, should be supported include:

Bits Bytes Bytes (base32) Supporting Hashes
192 24 39 Tiger
224 28 45 Skein, Keccak, other SHA-3 finalists, SHA-2
256 32 52 Skein, Keccak, other SHA-3 finalists, SHA-2
384 48 77 Skein, Keccak, other SHA-3 finalists, SHA-2
512 64 103 Skein, Keccak, other SHA-3 finalists, SHA-2

Finally, direct client-client communications introduces CSUP ($Supports), GET/GFI/SND ($Get/$Send) via the TTH/ share root or its successor, and filelists, all of which work if and only if partial hash function support overlap exists. CSUP otherwise fails with error code 54 and some subset of hash roots and hash trees regarding some filelist must be mutually understood, so as with the other cases, partial but not complete hash function support overlap between any given pair of clients is required.

Encouragingly, since together client-hub communication irrelevant to other clients; hub-mediated communication between two clients; and direct client-client communication cover all DC communication, partial hash function support overlap between any given pair of DC clients or servers suffices to ensure that all clients might fully functionally interact with each other. This results in a smooth, usable transition period for both NMDC and ADC so long as clients and hubs only drop TTH support once its successor becomes sufficiently ubiquitous. Further, relative to ADC, poy has observed that “all the hash function changes on NMDC is the file list (already a new, amendable format) and searches (an extension) so a protocol freeze shouldn’t matter there”, which creates an even easier transition than ADC in NMDC.

In service of such an outcome, I suggest two parallel sets of recommendations, one whenever convenient and the other closer to a decision on a TTH replacement. More short-term:

  • Ensure ADC software obeys “Clients must be prepared to handle CIDs of varying lengths.”
  • Create an ADC mechanism by which clients supporting both TTH and its successor can search via both without doubling (broadcast) search traffic. Otherwise, malincentives propagate.
  • Ensure BLOM scales to multiple hash functions.
  • Update phrasing in ADC specification to clarify that all known hashes for a file should be included in RES, not just session hash.

As the  choice of TTH’s successor approaches:

  • Disallow new hash function from being 192 bits to avoid ambiguity with Tiger or TTH hashes. I suggest 224 or 256-bit output; SHA-2 and all SHA-3 finalists (including Keccak and Skein) offer both sizes.
  • Pick either a single filelist with all supported hashes or multiple filelists, each of which only supports one hash. I favor the former; it especially helps during a transition period for even a client downloading via TTH’s successor to be able to autosearch and otherwise interact with clients which don’t yet support the new hash function, without needing to download an entire new filelist.
  • Barring a more dramatic break in Tiger than thus far seen, clients should retain TIGR support until the majority of ADC hubs and NMDC or ADC clients offer support for the successor hash function’s extension.

By doing so, clients both supporting only TTH and both TTH and new hash function should be capable of interacting without problems, transparently to end-users, while over time creating a critical mass of new hash function-supporting clients such that eventually client and hub software might outright drop Tiger and TTH support.

A Decade of TTH: Its Selection and Uncertain Future

NMDC and ADC rely on the Tiger Tree Hash to identify files. DC requires a cryptographic hash function to avoid the previous morass of pervasive similar, but not identical, files. A bare cryptographic hash primitive such as SHA-1 did not suffice because not only did the files need identification as a whole but in separate parts, allowing reliable resuming and multi-source downloading, and per-segment integrity verification (RevConnect unsuccessfully attempted to reliably use multi-source downloading precisely because it could not rely on cryptographic hashes).

Looking for inspiration from other P2P software, I found that BitTorrent used (and uses) piecewise SHA-1 with per-torrent segment sizes. Since the DC share model asks that same hash function work across entire shares, this does not work. eDonkey2000 and eMule, with per-user shares similar to those of DC, resolved this with fixed, 9MB piecewise MD4, but this segment size scaled poorly, ensured that fixing corruption demanded at least 9MB of retransmission, and used the weak and soon-broken MD4. Gnutella, though, had found an elegant, scalable solution in TTH.

This Tiger Tree hash, which I thus copied from Gnutella, scales to both large and small files while depending on what was at the time a secure-looking Tiger hash function. It smoothly, adaptively sizes a hash tree while retaining interoperability between all such sizes of files files on a hub. By 2003, I had released BCDC++ which used TTH. However, the initial version of hash trees implemented by Gnutella and DC used the same hash primitive for leaf and internal tree nodes. This left it open to collisions, fixed by using different leaf and internal hash primitives. Both Gnutella and DC quickly adopted this fix and DC has followed this second version of THEX to specify TTH for the last decade.

Though it has served DC well, TTH might soon need a replacement. The Tiger hash primitive underlying it by now lists as broken due to a combination of a practical 1-bit pseudocollision attack on all rounds, a similarly feasible full collision on all but 5 of its 24 rounds, and full, albeit theoretical, 24-round pre-images (“Advanced Meet-in-the-Middle Preimage Attacks”, 2010, Guo et al). If one can collide or find preimages of Tiger, one can also trivially collide or find preimages of TTH. We are therefore investigating alternative cryptographic hash primitives to which we might transition as Tiger looks increasingly insecure and collision-prone, focusing on SHA-2 and SHA-3.

ADC 1.0.2 released

A new version of the base ADC protocol is now released, version 1.0.2.

The document may look slightly different, especially with the addition of commands in the table of contents. The document itself (its content) is not that much modified (except for state management, see below).

An important part of the document is a new addition, a terminology section where difficult words or phrases are specified. This list is obviously meant to be much more than mere four items but it’s at least a start.

The STA previously didn’t specify who had the responsibility for action when a STA is sent with the severity Fatal (2). This has always been the originator of the message, and this is now explicit.

The state management is re-worded and restructured. All information about state has now been moved to its own section, allowing an implementator a quick and comprehensive overview on the requirements for the state management. Previously, the state management was sprinkled all across the document, making it difficult for a person to properly implement a state machine in their software. This has meant that state management information is now removed from each command (only thing remaining is an explicit note about in which state each command is used). Certain information is also clarified, such as what to call the parties in a client to client connection (“client party” and “server party”) and state transitions.

Version 1.0.1 of ADC was also ambiguous in state management when it came to one important part: who shall send the first INF in a client to client connection. This is important because it has the ramification that it makes multi-share difficult. The current specification is now not ambiguous, and makes the following stance: the first party to send the INF is the connecting party (“client party”). No known implementation suffer from this explicit note, as all manage this scenario just fine. Basically, this change means that multiple shares (per hub) may not be too far off.

The new version also brings in a new time where we can safely and appropriately update the base document. There was an announcement period when the document was going to be released which meant that developers have had time to adjust their software and give feedback in a timely manner.

Splitting IDENTIFY to support multiple share profiles in ADC

ADC insufficiently precisely orders the IDENTIFY and NORMAL states such that ADC clients can properly support multiple share profiles. Several client software-independent observations imply this protocol deficiency:

  • ADC clients define download queue sources by CID, such that if sharing client presents multiple shares it must be through different CIDs, barring backwards-incompatible and queue-crippling requirements to only connect to a source via the hub through which it was queued.
  • A multiply-sharing ADC client in the server role must know the CTM token associated with a client-client connection to determine unambiguously which shares to present and therefore which CID to present to the non-server client.
  • ADC’s SUP specification, as illustrated by the example client-client connection, states that when “the server receives this message in a client-client connection in the PROTOCOL state, it should reply in kind, send an INF about itself, and move to the IDENTIFY state”; this implies the server client sending its CINF before the non-server client sends the CTM token in the TO field with its CINF.
  • Either the server or non-server client may be the downloader and vice versa. As such, by the time both the server and non-server clients in a client-client connection sends their CINF commands, they must know, since either may be a multiply-sharing client about to upload files, which CTM token with which to associate the connection.
  • The non-server client can unambiguously track which client-client connections it should associate with each CTM token by locally associating that token with each outbound client-client connection it creates, an association a server-client listening for inbound connections by cannot reliably create until the non-server client sends it a CINF with a token field.

Together, these ADC properties show that a server client which uploads using multiple share profiles must know which CID to send, but must do so before it has enough information to determine via the CTM token the correct share profile and thus the correct CID. Such a putatively multiply-sharing ADC client cannot, therefore, remain consistent with all of the listed constraints.

Most constraints prove impractical or undesirable to change, but by clarifying the SUP specification and IDENTIFY states, one can fix this ADC oversight while remaining compatible with DC++ and ncdc, with jucy apparently requiring adjustment. In particular, I propose to:

  1. Modify SUP and INF to require rather that the non-server client, rather than the server client, send the first INF; and
  2. in order to do so, split the IDENTIFY state into SERVER-IDENTIFY and CLIENT-IDENTIFY, whereby
  3. the next state after SUP in a client-client connection is CLIENT-IDENTIFY, which transitions to SERVER-IDENTIFY, which finally transitions as now to NORMAL

This effectively splits the IDENTIFY state into CLIENT-IDENTIFY and SERVER-IDENTIFY to ensure that they send their CINF commands in an order consistent with the requirement that both clients know the CTM token when they send their CINF command, finally allowing ADC to reliably support multiple share profiles.

Such a change appears compatible with both DC++ and ncdc, because both simply respond to CSUP with CINF immediately, regardless of what its partner in a client-client connection does. The only change required in DC++ and ncdc is for the server client to wait for the non-server client to send its CINF before sending a reply CINF rather than replying immediately to the non-server client’s CSUP.

jucy would need adjustment because it currently, by only triggering a non-server client’s CINF, containing the CTM token, in response to the server client’s pre-token CINF. A server client which waits for a jucy-based non-server client to send the first CINF will wait indefinitely.

Thus, by simply requiring that the non-server client in a client-client connection sends its CINF first, in a manner already compatible with DC++-based clients and ncdc and almost compatible with jucy, ADC-based can finally provide reliable multiple share profiles.

ADC Recommendations

A while back (a really long time ago, it appears), I started the document ADC Recommendations. The intent is to create a document that can be reviewed for best-practices, common implementations and other useful information that need not be in the official specification(s).

Also, my intent was to have the document be more frequently updated (once done), so that it can quickly reference the latest software, so as to not having to update versions for the specifications, for simply guidance.

If you want to add more or revise the existing content, leave a comment below or go to the ADCPortal forum post.

Don’t forget that you can make topic suggestions for blog posts in our “Blog Topic Suggestion Box!”

Base32 encoding in ADC

The ADC Protocol specification is well-defined for the most part, but there is a lack of information on Base32 encoded strings. Hopefully this post helps clear up a few things relating to them.

The specification for the Base32 encoding is defined in RFC 4648. The method to use when converting bytes into a Base32 encoded string is to take the first 40 bits of data, divide them up into 8 groups of 5, then convert each group of 5 bits to it’s character representation (using the Base32 lookup table). This is to be done repeatedly until there is no more data left to encode. In the case where the number of total bits is not divisible by 40, there will be a shortage of bits in the final group of 40. In this case, the last group of 5 bits there is data for is to be padded with 0′s (if needed) and the remainder of the 8 characters in the group are set to a padding character (‘=’).

The padding character can be excluded but only if the specification of the standard referring to the RFC explicitly states so. When the padding character is omitted, taking the input in 40 bit chunks becomes unnecessary. Simply take the input 5 bits at a time until the end of the data. If the number of bits in the input doesn’t divide by 5, pad the last set of 5 with 0′s.

When converting a Base32 encoded string to raw bytes of data, generate a binary representation of the encoded string (using the Base32 lookup table), then take every 8 bits and store them in a byte. If the length of the Base32 encoded string multiplied by 5 doesn’t divide evenly the extra bits are discarded.

The only clue as to if Base32 encoded strings should be padded or not is the line ‘base32_character ::= simple_alpha | [2-7]‘. This (in my mind) does not explicitly state that padding should be omitted, it simply states what Base32 characters can be and leaves the interpretation up to developers. It’s a small leap, but enough that it could make people look through alternate sources for confirmation.

A quick recap:
When converting from raw bytes to Base32, pad the extra bits with 0′s to make the final character and omit the padding character(s). When converting from Base32 to raw bytes, discard the extra bits.

This information was learned through searching through the Base32 specification, DC++ source code, Googling, guesswork and trial-and-error. A few additional footnotes in the ADC protocol specification would go a long way for developers who choose to implement an ADC-compliant application from scratch, without using the DC++ core (which is developed by the author of the ADC protocol).

This post was written by pR0Ps, the author of NetChatLink.

If you have something you want to post, drop a note in the suggestion box or mail me.

Don’t forget that you can make topic suggestions for blog posts in our “Blog Topic Suggestion Box!”

Syntax diagram of ADC BNF

I went ahead and generated some syntax diagrams for the ADC BNF at http://www-cgi.uni-regensburg.de/~brf09510/syntax.html.

I used the W3C-BNF since that is what the ADC specification (almost) states its BNF in. The following is the input;

...

[1] message ::= message_body? eol
[2] message_body ::= (b_message_header | cih_message_header | de_message_header | f_message_header | u_message_header | message_header)
(separator positional_parameter)* (separator named_parameter)*
[3] b_message_header ::= 'B' command_name separator my_sid
[4] cih_message_header ::= ('C' | 'I' | 'H') command_name
[5] de_message_header ::= ('D' | 'E') command_name separator my_sid separator target_sid
[6] f_message_header ::= 'F' command_name separator my_sid separator (('+'|'-') feature_name)+
[7] u_message_header ::= 'U' command_name separator my_cid
[8] command_name ::= simple_alpha simple_alphanum simple_alphanum
[9] positional_parameter ::= parameter_value
[10] named_parameter ::= parameter_name parameter_value?
[11] parameter_name ::= simple_alpha simple_alphanum
[12] parameter_value ::= escaped_letter+
[13] target_sid ::= encoded_sid
[14] my_sid ::= encoded_sid
[15] encoded_sid ::= base32_character base32_character base32_character base32_character
[16] my_cid ::= encoded_cid
[17] encoded_cid ::= base32_character+
[18] base32_character ::= simple_alpha | [2-7]
[19] feature_name ::= simple_alpha simple_alphanum simple_alphanum simple_alphanum
[20] escaped_letter ::= [^ \#x0a] | escape 's' | escape 'n' | escape escape
[21] escape ::= '\\'
[22] simple_alpha ::= [A-Z]
[23] simple_alphanum ::= [A-Z0-9]
[24] eol ::= #x0a
[25] separator ::= ' '

...

(Note that the W3C-BNF doesn’t support ‘{3}’ etc so I had to extend those instances.)

The following is the output;

Don’t forget that you can make topic suggestions for blog posts in our “Blog Topic Suggestion Box!”

I4/I6 should be broadcasted regardless of TCP4/TCP6

In ADC, it’s possible for clients to announce from which IP they’re connecting. This IP is later usually used to identify what address to connect to if they accept incoming TCP connections. That is, when you want to connect to someone, you announce ‘I4′ (for IPv4; I6 for IPv6) and send a “connect to me” message, or CTM. You also announce TCP4 or TCP6 in the feature field for others.

In ADC (in NMDC as well, really), there are two types of users; those who accept incoming TCP connections and those that do not. The former is usually called ‘active’ and the latter ‘passive’. Active users can connect to other active users. Active users can connect to passive users. Passive users can connect to active users (by a ‘reverse’ CTM message). Passive users can’t connect to other passive users (I’m going to purposefully ignore NAT traversal since it allows passive to passive connections, but it isn’t useful in this discussion).

The reverse CTM for passive to active connections work by a simple mechanism. The passive user says to the active user “RCM”, reverse connect to me. The active user will then proceed to connect to the passive user, and the downloading will commense.

Active clients are basically required to signal their address in the I4/I6 field (that is simply the nature of the field), whereas passive users are not required (but keep reading).

So far so good, nothing bad has happened.

Now, imagine the case where an active user will want to connect to another active user. They both signal I4 (I’ll use it for simplicity sakes, but it applies to I6 as well). The downloader signal address 1.1.1.1, the uploader signals 2.2.2.2. The downloader send to the uploader “connect to me”. The uploader connects through 2.2.2.2.2 to 1.1.1.1, and the communication continues. Everything’s all right for now.

Imagine instead that the downloader is a passive user. The passive user will say to the active user ‘send a connect to me so I can connect to you’. The active user will say ‘connect to 1.1.1.1′, which the passive user connects to. However, the connection can come from anywhere, and not necessarily from the IP the passive user connected to the hub with! That is because the active user does not have any knowledge from which IP the passive user should connect from.

(Now, obviously, there will be a token sent which the passive user will need to verify, but the problem of connection-point do not go away.)

So solve the problem, passive clients should publish I4/I6 regardless of whether they support TCP4/TCP6 or not. If you look at the specification, TCP4/TCP6 require I4/I6 but not the other way around, so this change (i.e. ‘everyone should send ‘I4/I6′) should not have any effect on existing implementations.

Do note that the hub can of course send I4/I6, regardless of whether the client sends it or not.

Don’t forget that you can make topic suggestions for blog posts in our “Blog Topic Suggestion Box!”

Propose your ideas

Hi people just wanted to share relevant information regarding ADC Development

We recently updated our mediawiki installation and now that its updated i decided to rewamp the proposal list so that mr. Ullner gets a good tool when looking at extensions to include in the protocol.

Proposed Extensions

The idea is that everyone with an wiki account helps out and adds and removes ideas if they are included or denied entry and we use the protocol idea forum on adcportal as an official place if you wanna add just the spec or a link to the spec thats fine as long as it gets posted there so Ullner doesnt have to chase the idea over the net.

Hope this will improve development and document who did what in the future :)

ADC Extensions – 1.0.6

A new version is out of ADC Extensions that bring much to ADC. Have a look at http://adc.sourceforge.net/versions/ADC-EXT-1.0.6.html.

The list;
•Added KEYP extension for providing certificate substitution protection in ADCS.
•Added note to signal DFAV.
•Added SUDP extension for encryption of UDP traffic.
•Added TYPE extension for chat state notifications.
•Added FEED extension for RSS feeds.
•Added SEGA extension for grouping of file extensions in SCH.
•Added failover hub addresses to the hub’s INF.
•Added free slots to the client’s INF.
•Added ADCS extension for encryption in ADC.

Most of these extensions are people familiar with. I think one or two of them deserve their own posts, so a detailed description of each item will have to wait for another day.

Follow

Get every new post delivered to your Inbox.