bzip2 remains optimal for filelists
July 5, 2011 Leave a comment
Several open-source file compressors are available. Currently, DC++ uses bzip2 with the default 900KiB block size, corresponding to bzip2 -9. Potential free alternatives include other bzip2 block sizes, zlib, xz, XMill, and FreeArc. I benchmarked the compression times and compression efficiencies of each alternative across a test set of 441 filelists; this ADCPortal thread has methodological details and test scripts.
Across the 9 zlib settings, 9 bzip2 settings, 10 XMill settings, 16 xz settings, and 21 FreeArc settings tested, many were both slower than and resulted in larger files than another than those of another compressor/setting pair. I started by filtering these out. Thus, all compressor/setting pairs charted have some advantage over all other such compressor/setting pairs.
Because the ratios are relative to the status quo (bzip2 -9 at (1,1) in the scatter plot), all compressor/setting pairs to the left of the points around (1,1) in the scatter plot create larger filelists more quickly than bzip2 -9. The few points to the right, mostly far to the right, are slower but result in slightly smaller files, up to 4% smaller. After filtering out the more space-efficient but 4 times slower (and and dependent on hundreds of MiB of RAM to compress) FreeArc settings on the high-end of the x axis, corresponding to FreeArc settings -5 and above, one is left with a more compact scatter plot:
This allows one to distinguish between all the other points more easily, but demonstrates that across the set of all tested filelists, the status quo at (1, 1) beats all others in compression efficiency. To the left lie mostly zlib and XMill variations which preprocess the XML filelist input then feed it to zlib, as well as a couple of low FreeArc settings. This distribution of compressor time and space efficiency holds true with minor adjustments when dividing filelists into quintiles by raw list size as well:
Those three points clustered around (4.4, 0.97) are, as before, high FreeArc settings. Though this was obscured by the previous charts, they only become advantageous at all with the largest quintile of filelists, those above 3.4MiB. For at least 80% of the 441 tested filelists, they provide negligible advantage of any sort. Because those large lists for which the 4x speed penalty would bite most are also the only filelists they would provide much size advantage to (e.g. 5% of a 5MiB-bzip2’ed list would save 256KiB, but even 5% of a 1MiB-bzip2’d list saves only 51KiB), the more CPU-intensive FreeArc settings are uselessly wasteful for DC filelists. Removing them creates a more readable scatter plot:
The bzip2 -9 status quo resides, as stated, at (1,1) by construction of ratios. At best, of the speed-wise feasible alternative compressors/settings, one might extract perhaps a 2-3% compressed list size reduction, at the cost of 20% slower compression; having to maintain two compressors indefinitely and thus in practice having to compress twice; and adopting nonstandard or abandoned (XMill), albeit still open-source, compressors.
bzip2 -9, the existing status quo, has proven a remarkably durable compromise between speed and compressed size and even given an opportunity to select a compression algorithm without regard for backwards compatibility, I would choose it again for a DC-like protocol. Combined with the high costs of transitioning to anything else, bzip2 -9 thus remains an the optimal filelist compression algorithm for DC.