diff options
author | Joey Hess <joey@kitenet.net> | 2010-10-16 15:23:03 -0400 |
---|---|---|
committer | Joey Hess <joey@kitenet.net> | 2010-10-16 15:23:03 -0400 |
commit | a31dc74806f165e01f56dbc3322e738a921cc6e9 (patch) | |
tree | 71b50e71c36b086a4dd0cb89d3e372c58e231177 /doc | |
parent | 684011175cc75bb6a667e65ba0ec6cabd1f0897a (diff) |
update
Diffstat (limited to 'doc')
-rw-r--r-- | doc/git-annex.mdwn | 182 |
1 files changed, 182 insertions, 0 deletions
diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn new file mode 100644 index 000000000..bb216f038 --- /dev/null +++ b/doc/git-annex.mdwn @@ -0,0 +1,182 @@ +git-annex allows managing files with git, without checking the file +contents into git. While that may seem paradoxical, it is useful when +dealing with files larger than git can currently easily handle, whether due +to limitations in memory, checksumming time, or disk space. + +Even without file content tracking, being able to manage files with git, +move files around and delete files with versioned directory trees, and use +branches and distributed clones, are all very handy reasons to use git. And +annexed files can co-exist in the same git repository with regularly +versioned files, which is convenient for maintaining documents, Makefiles, +etc that are associated with annexed files but that benefit from full +revision control. + +Enough broad picture, here's how it actually looks: + +* `git annex add $file` moves the file into `.git/annex/`, and replaces + it with a symlink pointing at the annexed file, and then calls `git add` + to version the *symlink*. (If the file has already been annexed, it does + nothing.) +* If you use normal git push/pull commands, the annexed file content + won't be transferred, but the symlinks will be. So different clones of a + repository can have different sets of annexed files available. +* You can move the symlink around, copy it, delete it, etc, and commit changes + as desired using git. Reading the symlink will always get you the annexed + file content, or the link may be broken if the content is not currently + available. +* `git annex push $repository` pushes *all* annexed files to the specified + repository. +* `git annex pull $repository` pulls *all* annexed files from the specified + repository. +* `git annex want $file` indicates that you want access to a file's + content, without immediatly transferring it. +* `git annex get $file` is used to transfer a specified file, and/or + files previously indicated with `git annex want`. If a configured + repository has it, or it is available from other key/value storage, + it will be immediatly downloaded. +* `git annex drop $file` indicates that you no longer want the file's + content to be available in this repository. +* `git annex unannex $file` undoes a `git annex add`. But use `git annex drop` + if you're just done with a file; only use `unannex` if you + accidentially added a file. + +Oh yeah, "$file" in the above can be any number of files, or directories, +same as you'd pass to "git add" or "git rm". +So "git annex add ." or "git annex get dir/" work fine. + +## copies + +git-annex can be configured to try to keep N copies of a file's content +available across all repositories. By default, N is 1; it is configured by +annex.numcopies. + +`git annex drop` attempts to check with other git remotes, to check that N +copies of the file exist. If enough repositories cannot be verified to have +it, it will retain the file content to avoid data loss. + +For example, consider three repositories: Server, Laptop, and USB. Both Server +and USB have a copy of a file, and N=1. If on Laptop, you `git annex get +$file`, this will transfer it from either Server or USB (depending on which +is available), and there are now 3 copies of the file. + +Suppose you want to free up space on Laptop again, and you `git annex drop` the file +there. If USB is connected, or Server can be contacted, git-annex can check +that it still has a copy of the file, and the content is removed from +Laptop. But if USB is currently disconnected, and Server also cannot be +contacted, it can't verify that it is safe to drop the file, and will +refuse to do so. + +With N=2, in order to drop the file content from Laptop, it would need access +to both USB and Server. + +Note that different repositories can be configured with different values of +N. So just because Laptop has N=2, this does not prevent the number of +copies falling to 1, when USB and Server have N=1. + +## key/value storage + +git-annex uses a key/value abstraction layer to allow file contents to be +stored in different ways. In theory, any key/value storage system could be +used to store the file contents, and git-annex would then retrieve them +as needed and put them in `.git/annex/`. + +When a file is annexed, a key is generated from its content and/or metadata. +The file checked into git symlinks to the key. This key can later be used +to retrieve the file's content (its value). This key generation must be +stable for a given file content, name, and size. + +Multiple pluggable backends are supported, and more than one can be used +to store different files' contents in a given repository. + +* `WORM` ("Write Once, Read Many") This backend stores the file's content + only in `.git/annex/`, and assumes that any file with the same basename, + size, and modification time has the same content. So with this backend, + files can be moved around, but should never be added to or changed. + This is the default, and the least expensive backend. +* `SHA1` -- This backend stores the file's content in + `.git/annex/`, with a name based on its sha1 checksum. This backend allows + modifications of files to be tracked. Its need to generate checksums + can make it slow for large files. +* `URL` -- This backend downloads the file's content from an external URL. + +## location tracking + +git-annex keeps track of on which repository it last saw a file's content. +This can be useful when using it for archiving with offline storage. When +you indicate you want a file, git-annex will tell you which repositories +have the file's content. For example: + + # git annex get myfile + git-annex: unable to get: myfile + To get that file, need access to one of these remotes: usbdrive + +Location tracking information is stored in `.git-annex/$key.log`. +Repositories record their UUID and the date when they get or drop +a file's content. (Git is configured to use a union merge for this file, +so the lines may be in arbitrary order, but it will never conflict.) + +The optional file `.git-annex/uuid.log` can be created to add a description +to a UUID. If git-annex needs a file from some repository, and it cannot find +the repository amoung the remotes, it will use the description from this +file when asking for the repository to be made available. The file format +is a UUID, a space, and the rest of the line is its description. For +example: + + UUID d3d2474c-d5c3-11df-80a9-002170d25c55 USB drive in red enclosure + UUID 60cf39c8-d5c6-11df-aa8b-93fda39008d6 my colocated server + +## configuration + +* `annex.uuid` -- a unique UUID for this repository +* `annex.numcopies` -- number of copies of files to keep (default: 1) +* `annex.backends` -- space-separated list of names of + the key/value backends to use. The first listed is used to store + new files. (default: "WORM SHA1 URL") +* `remote.<name>.annex-cost` -- When determining which repository to + transfer annexed files from or to, ones with lower costs are preferred. + The default cost is 100 for local repositories, and 200 for remote + repositories. Note that other factors may be configured when pushing + files to repositories, in particular, whether the repository is on + a filesystem with sufficient free space. +* `remote.<name>.annex-uuid` -- git-annex caches UUIDs of repositories + here. + +## issues + +### symlinks + +If the symlink to annexed content is relative, moving it to a subdir will +break it. But it it's absolute, moving the git repo (or mounting its drive +elsewhere) will break it. Either: + +* Use relative links and need `git annex mv` to move (or post-commit + hook that caches moves and updates links). +* Use absolute links and need `git annex fixlinks` when location changes; + note that would also mean that git would see the symlink targets changed + and want to commit the change. And, other clones of the repo would + diverge and there would be conflicts on the symlink text. Ugh. + +Hard links are not an option, because git would then happily commit the +file content. Amoung other reasons.. + +### free space determination + +Need a way to tell how much free space is available on the disk containing +a given repository. The repository may be remote, so ssh may need to be +used. + +Similarly, need a way to tell the size of a file before downloading it from +remote, to check local disk space. + +### auto-drop files on rm + +When git-rm removed a file, it should get dropped too. Of course, it may +not be dropped right away, depending on number of copies available. + +### branching + +The use of `.git-annex` to store logs means that if a repo has branches +and the user switched between them, git-annex will see different logs in +the different branches, and so may miss info about what remotes have which +files (though it can re-learn). An alternative would be to +store the log data directly in the git repo as `pristine-tar` does. |