diff options
-rw-r--r-- | doc/internals.mdwn | 5 | ||||
-rw-r--r-- | doc/internals/hashing.mdwn | 34 |
2 files changed, 38 insertions, 1 deletions
diff --git a/doc/internals.mdwn b/doc/internals.mdwn index 8ca035510..de8167965 100644 --- a/doc/internals.mdwn +++ b/doc/internals.mdwn @@ -10,6 +10,7 @@ to the file content. First there are two levels of directories used for hashing, to prevent too many things ending up in any one directory. +See [[hashing]] for details. Each subdirectory has the [[name_of_a_key|key_format]] in one of the [[key-value_backends|backends]]. The file inside also has the name of the key. @@ -107,7 +108,9 @@ somewhere else. These log files record [[location_tracking]] information for file contents. Again these are placed in two levels of subdirectories -for hashing. The name of the key is the filename, and the content +for hashing. See [[hashing]] for details. + +The name of the key is the filename, and the content consists of a timestamp, either 1 (present) or 0 (not present), and the UUID of the repository that has or lacks the file content. diff --git a/doc/internals/hashing.mdwn b/doc/internals/hashing.mdwn new file mode 100644 index 000000000..3c1d86b0c --- /dev/null +++ b/doc/internals/hashing.mdwn @@ -0,0 +1,34 @@ +In both the .git/annex directory and the git-annex branch, two levels of +hash directories are used, to avoid issues with too many files in one +directory. + +Two separate hash methods are used. One, the old hash format, is only used +for non-bare git repositories. The other, the new hash format, is used for +bare git repositories, the git-annex branch, and on special remotes as +well. + +## new hash format + +This uses two directories, each with a three-letter name, such as "f87/4d5" + +The directory names come from the md5sum of the [[key|key_format]]. + +Note that you cannot use the `md5sum` utility from coreutils to generate +the same hash. Why it generates something else is unknown. The md5 hash +libraries for programming languages will work though. + +For example: + + python -c 'import hashlib, sys; print hashlib.md5(sys.argv[1]).hexdigest()' + +## old hash format + +This uses two directories, each with a two-letter name, such as "pX/1J" + +It takes the md5sum of the key, but rather than a string, represents it as 4 +32bit words. Only the first word is used. It is converted into a string by the +same mechanism that would be used to encode a normal md5sum value into a +string, but where that would normally encode the bits using the 16 characters +0-9a-f, this instead uses the 32 characters "0123456789zqjxkmvwgpfZQJXKMVWGPF". +The first 2 letters of the resulting string are the first directory, and the +second 2 are the second directory. |