What do HashIndex.xml and HashData.dat do?

These two file contain the hashes for your shared files and your queued files.

HashIndex.xml is divided in two sections. The first section contain the root hash of all shared and queued files, the files’ TTH blocksize and the full size of the files. The second section contain the root hash of all shared and queued files, the files’ names and a timestamp.

HashData.dat contain the entire hash tree of all shared and queued files. Comments found in HashManager.cpp:

* The data file is very simple in its format. The first 8 bytes
* are filled with an int64_t (little endian) of the next write position
* in the file counting from the start (so that file can be grown in chunks).
* We start with a 1 mb file, and then grow it as needed to avoid fragmentation.
* To find data inside the file, use the corresponding index file.
* Since file is never deleted, space will eventually be wasted, so a rebuild
* should occasionally be done.

Where the ‘index file’ is of course HashIndex.xml.
Rebuild by using the /rebuild chat command. Rebuild flushes out all old hashes that were in your queue and that were shared. Note: If you do /rebuild, remember to share everything you don’t want to re-hashed.

4 Responses to What do HashIndex.xml and HashData.dat do?

  1. Ullner says:

    From the old blog:
    #####
    “How big will the hash get on 100,000 files = 500 gig?”
    I have no exact figures, but I currently share around 300 GB that is around 4200 files. My HashData.dat is 80 MB and HashIndex.xml is 7.5 MB.
    It isn’t so much the size of the files that will increase HashData.dat and HashIndex.xml, it is the amount of files. Having 100,000 files will definitely make the two files large.

    Remember that it isn’t solely the files you share that is in HashIndex.xml/HashData.dat, it is also the files you have queued.
    Publicerad av Fredrik Ullner – 2006-01-30 21:27:12
    [Klicka på profilfotot för att öppna kontaktkortet]
    I like the hash speed and booting is wonderful. with 400 Gig.
    I run dc++ in its own 10 gig hard partition formatted FAT32 based on microsoft tech notes of faster writing compaired to XP-NTFS which does a lot of drive write backchecking. I maintain a safety files on a different drive. Defrag of 10 gig is very easy, Everything seems 5% faster. I dont do any fragmention of windows drive or data. Norton util would show me thousands of deleted fragments in the recycle bin each day on older versions. How big will the hash get on 100,000 files = 500 gig? I may have to bump the drive to 20 gig.
    I can send the microsoft tech notes send email.
    Thanks
    Jim
    Publicerad av jim sherwood – 2006-01-30 14:01:56

  2. applegrew says:

    I have a doubt to clear about HashData.dat and HashData.xml. Hope you will help me.
    First, HashData.dat actually stores the bytes of hashes of the blocks of a tiger tree. i.e. a file X will be divided into blocks (in size as given in the xml file’s BLOCK property) and using these blocks the tree will be formed, eventually giving the root hash. We store the root hash in the xml file but the leaves and rest of the nodes are stored in the .dat file. The node hashes in the .dat file are laid out using breadth-first to map it into serialized form. The starting location offset of this serialized tree is given in the xml file by INDEX property.

    Am I correct till this point? Now I have some questions for which I couldn’t find the answers. How do you know till which point in the .dat we need to read to read out a full tree of a file? I presume it is probably calculated using the BLOCK size and the SIZE property. Well if yes then its ok. Now how do we decide the Block size??? It have seen that this value takes some standard values, but it is not a single constant. How it is calculated then? Will you please clear my doubts?
    The XML file’s index property gives the byte offset of the starting

  3. applegrew says:

    oops the last line of my previous post was unintended. My post was till

    “……….Will you please clear my doubts?”

  4. Pingback: The case of a missing tree « DC++: Just These Guys, Ya Know?

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: