Bloom filters

Currently, to ensure that a search query reaches all clients that might have files in their share matching it, DC hubs must broadcast that search query to all connected clients. This works, but scales poorly bandwidthwise with the number of distinct clients in a hub. DC++ 0.704 adds support for Bloom filters to ameliorate this bandwidth issue.

A Bloom filter, as used by a DC client or hub, concisely summarizes the contents of a share by constructing a bit vector which allows for probabalistic queries as to the presence of a file with a specified hash in that share. A motivating advantage of such a construction that whereas even filelists which contain only hashes will grow linearly with the number of files in the share. Further, such a fully accurate representation should not be compressible, because cryptographic hashes aim for each output bit to be independently 50% likely enabled and thus maximize apparent entropy. Instead, Bloom filters sacrifice losslessness in favour of an output that has a small size which need not grow with that of the share.

However, to remain useful in this filesharing context, such lossiness must remain limited. In particular, a Bloom filter produces false positives but not false negatives: it can state that an object exists when it does not, but it will never state conversely that an object does not exist if it does. A hub, given a Bloom filter representation of a share can thus safely refrain from sending a hash search query to a connected client the Bloom filter of the share of which has no record of a hash. This can save considerable bandwidth and allow a combination of more clients within the same bandwidth or a similar number of clients using less bandwidth.

A Bloom filter is determined by the hash keys in use and the table size. The smaller the ratio between the number of objects in the table and size of the table, the lower the false positive rate. Further, an optimal table does not utilize just one key but multiple keys simultaneously, such that the presence of an object in the table implies that the bits corresponding to each keys as applied to the input object are all enabled. Thus, when inserting an object for which hash functions used as keys map to bits 58, 1093, and 9865, then those bits are all enabled. Then to check whether that object might exist in the table, then one checks whether any one of bits 58, 1093, and 9865 are enabled. If all of them are, then it is possible that the object is in the table. However, if any of them are not, then the object is not in the table.

Bloom filters, then, allow for a scalable, efficiently constructed, efficiently queried, and efficiently communicated summary of the contents of a share which allows for hubs to spend less bandwidth wastefully broadcasting search queries.

Don’t forget that you can make topic suggestions for blog posts in our “Blog Topic Suggestion Box!”

3 Responses to Bloom filters

  1. pietry says:

    If the hub keeps a map of all shares inside the hub, is it possible for the hub to search inside its map and forward the search query only to the clients in question ( ones having the files ) ? Or even not forward at all, just send the results. The clients could query the hub instead of broadcasting the search ( HSCH … ) then the hub sends the results ( IRES .. ) and if the client searching wants to get that file it can initiate some connection ( DCTM.. )

  2. cologic says:

    Yes, it’s possible for the hub to forward search queries only to clients which plausibly have a file, but no, this representation is far too sparse (somewhat intentionally, due to other concerns) for the hub to return search results on its own. To get an idea how that might work, look at the centralized form of eDonkey/eMule – avoiding broadcast messages and queries appears to be how they support 100k+ clients per server; to do so the servers function exactly as you describe, but that results in limits on share size to, commonly, a few thousand files per client (look up hard and soft limits).

    The Bloom filter ADC extension provides the hub with only that bit vector and therefore not the means to respond to queries on its own but merely to prune their routing back to other connected clients.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: