From 3a549f9bc47b966216b73f465d77bfc3381856da Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 23 Nov 2015 16:53:05 -0400 Subject: smudge design --- doc/devblog/day_339_smudging_out_direct_mode.mdwn | 56 ++++++ doc/todo/smudge.mdwn | 203 ++++++++++++++++++---- 2 files changed, 225 insertions(+), 34 deletions(-) create mode 100644 doc/devblog/day_339_smudging_out_direct_mode.mdwn (limited to 'doc') diff --git a/doc/devblog/day_339_smudging_out_direct_mode.mdwn b/doc/devblog/day_339_smudging_out_direct_mode.mdwn new file mode 100644 index 000000000..8e82f31af --- /dev/null +++ b/doc/devblog/day_339_smudging_out_direct_mode.mdwn @@ -0,0 +1,56 @@ +I'm considering ways to get rid of direct mode, replacing it with something +better implemented using [[todo/smudge]] filters. + +## git-lfs + +I started by trying out git-lfs, to see what I can learn from it. My +feeling is that git-lfs brings an admirable simplicity to using git with +large files. For example, it uses a push-hook to automatically +upload file contents before pushing a branch. + +But its simplicity comes at the cost of being centralized. You can't make a +git-lfs repository locally and clone it onto other drive and have the local +repositories interoperate to pass file contents around. Everything has to +go back through a centralized server. I'm willing to pay complexity costs +for decentralization. + +Its simplicity also means that the user doesn't have much control over what +files are present in their checkout of a repository. git-lfs downloads +all the files in the work tree. It doesn't have facilities for dropping +files to free up space, or for configuring a repository to only want to get +a subset of files in the first place. Some of this could be added to it +I suppose. + +## replacing direct mode + +Anyway, as smudge/clean filters stand now, they can't be used to set up +git-annex symlinks; their interface doesn't allow it. But, I was able to +think up a design that uses smudge/clean filters to cover the same use +cases that direct mode covers now. + +Thanks to the clean filter, adding a file with `git add` would check in a +small file that points to the git-annex object. When a file has been added +this way, the file in the work tree remains the only copy of the object +until you use git-annex to copy it to another repository. So if you modify +the work tree file, you can lose the old version of the object. + +This is analagous to how direct mode works now, and it avoids needing to +store 2 copies of every file in the local repository. + +In the same repository, you could also use `git annex add` to check +in a git-annex symlink, which would protect the object from modification, +in the good old indirect mode way. `git annex lock` and `git annex unlock` +could switch a file between those two modes. + +So this allows mixing directly writable annexed files and locked down +annexed files in the same repository. All regular git commands and all +git-annex commands can be used on both sorts of files. + +That's much more flexible than the current direct mode, and I think it will +be able to be implemented in a simpler, more scalable, and robust way too. +I can lose the direct mode merge code, and remove hundreds of lines of +other special cases for direct mode. + +The downside, perhaps, is that for a repository to be usable on a crippled +filesystem, all the files in it will need to be unlocked. A file can't +easily be unlocked in one checkout and locked in another checkout. diff --git a/doc/todo/smudge.mdwn b/doc/todo/smudge.mdwn index b11b1dedc..0982d7288 100644 --- a/doc/todo/smudge.mdwn +++ b/doc/todo/smudge.mdwn @@ -15,6 +15,10 @@ available files, and checksum them, which is too expensive. > git to handle this sort of case in an efficient way.. just needs someone > to do the work. --[[Joey]] +>> Update 2015: git status only calls the clean filter for files +>> that the index says are modified, so this is no longer a problem. +>> --[[Joey]] + ---- The clean filter is run when files are staged for commit. So a user could copy @@ -36,35 +40,26 @@ add` files, and just being able to use `git add` or `git commit -a`, and have it use git-annex when .gitattributes says to. Also, annexed files can be directly modified without having to `git annex unlock`. -### design +### configuration In .gitattributes, the user would put something like "* filter=git-annex". This way they could control which files are annexed vs added normally. -(git-annex could have further controls to allow eg, passing small files -through to regular processing. At least .gitattributes is a special case, -it should never be annexed...) - -For files not configured this way, git-annex could continue to use -its symlink method -- this would preserve backwards compatability, -and even allow mixing the two methods in a repo as desired. - -To find files in the repository that are annexed, git-annex would do -`ls-files` as now, but would check if found files have the appropriate -filter, rather than the current symlink checks. To determine the key -of a file, rather than reading its symlink, git-annex would need to -look up the git blob associated with the file -- this can be done -efficiently using the existing code in `Branch.catFile`. - -The clean filter would inject the file's content into the annex, and hard -link from the annex to the file. Avoiding duplication of data. +It would also be good to allow using this without having to specify +the files in .gitattributes. Just use "* filter=git-annex" there, and then +let git-annex decide which files to annex and which to pass through the +smudge and clean filters as-is. The smudge filter can just read a little of +its input to see if it's a pointer to an annexed file. The clean filter +could apply annex.largefiles to decide whether to annex a file's content or +not. -The smudge filter can't do that, so to avoid duplication of data, it -might always create an empty file. To get the content, `git annex get` -could be used (which would hard link it). A `post-checkout` hook might -be used to set up hard links for all currently available content. +For files not configured this way in .gitattributes, git-annex could +continue to use its symlink method -- this would preserve backwards +compatability, and even allow mixing the two methods in a repo as desired. +(But not switching an existing repo between indirect and direct modes; +the user decides which mode to use when adding files to the repo.) -#### clean +### clean The trick is doing it efficiently. Since git a2b665d, v1.7.4.1, something like this works to provide a filename to the clean script: @@ -100,26 +95,166 @@ can't be fixed.) > but it seems to avoid this problem. > --[[Joey]] -#### smudge +### smudge The smudge script can also be provided a filename with %f, but it cannot directly write to the file or git gets unhappy. > Still the case in 2015. Means an unnecesary read and pipe of the file -> even if the content is already locally available on disk. --[[Joey]] +P> even if the content is already locally available on disk. --[[Joey]] + +### partial checkouts + +It's important that git-annex supports partial checkouts of the content of +a repository. This allows repositories to be checked out when there's not +available disk space for all files in the repository. + +The way git-lfs uses smudge/clean filters, which is similar to that +described above, does not support partial checkouts; it always tries to +download the contents of all files. Indeed, git-lfs seems to keep 2 copies +of newly added files; one in the work tree and one in .git/lfs/objects/, +at least before it sends the latter to the server. This lack of control +over which data is checked out and duplication of the data limits the +usefulness of git-lfs on truely large amounts of data. + +To support partial checkouts, `git annex get` and `git annex drop` need to +be able to be used. + +To avoid data duplication when adding a new object, the clean filter could +hard link from the work tree file to the annex object. Although the +user could change the work tree file w/o breaking the hard link and this +would corrupt the annexed object. Could remove write permissions to avoid +that (mostly), but that would lose some of the benefits of smudge/clean as +the user wouldn't be able to modify annexed files. +> This may be one of those things where different tradeoffs meet different +> user's needs and so a repo could be switched between the two modes as +> needed.) + +The smudge filter can't modify the work tree file on its own -- git always +modifies the file after getting the output of the smudge filter, and will +stumble over any modifications that the smudge filter makes. And, it's +important that the smudge filter never fail as that will leave the repo in +a bad state. + +So, to support partial checkouts and avoid data dupliciation, the smudge +filter should provide some dummy content, probably including the key of the +file. (The clean filter should detect when it's operating on that dummy +content, and provide the same key as it would if the file content was +present.) + +To get the real content, use `git annex get`. (A `post-checkout` hook could +run that on all files if the user wants that behavior, or a config setting +could make the smudge filter automatically get file's contents.) -### dealing with partial content availability +I've a demo implementation of this technique in the scripts below. -The smudge filter cannot be allowed to fail, that leaves the tree and -index in a weird state. So if a file's content is requested by calling -the smudge filter, the trick is to instead provide dummy content, -indicating it is not available (and perhaps saying to run "git-annex get"). +### design -Then, in the clean filter, it has to detect that it's cleaning a file -with that dummy content, and make sure to provide the same identifier as -it would if the file content was there. +Goal: Get rid of current direct mode, using smudge/clean filters instead to +cover the same use cases, more flexibly and robustly. -I've a demo implementation of this technique in the scripts below. +Use case 1: + +A user wants to be able to edit files, and git-add, git commit, +without needing to worry about using git-annex to unlock files, add files, +etc. + +Use case 2: + +Using git-annex on a crippled filesystem that does not support symlinks. + +Data: + +* An annex pointer file has as its first line the git-annex key + that it's standing in for. Subsequent lines of the file might + be a message saying that the file's content is not currently available. + An annex pointer file is checked into the git repository the same way + that an annex symlink is checked in. +* file2key maps are maintained by git-annex, to keep track of + what files are pointers at keys. + +Configuration: + +* .gitattributes tells git which files to use git-annex's smudge/clean + filters with. Typically, all files except for dotfiles: + + * filter=annex + .* !filter + +* annex.largefiles tells git-annex which files should in fact be put in + the annex. Other files are passed through the smudge/clean as-is and + have their contents stored in git. + +git-annex clean: + +* Run by `git add` (and diff and status, etc), and passed the + filename, as well as fed the file content on stdin. + + Look at configuration to decide if this file's content belongs in the + annex. If not, output the file content to stdout. + + Generate annex key from filename and content from stdin. + + Hard link .git/annex/objects to the file, if it doesn't already exist. + (On platforms not supporting hardlinks, copy the file to + .git/annex/objects.) + + This is done to prevent losing the only copy of a file when eg + doing a git checkout of a different branch. But, no attempt is made to + protect the object from being modified. If a user wants to + protect object contents from modification, they should use + `git annex add`, not `git add`, or they can `git annex lock` after adding,. + + There could be a configuration knob to cause a copy to be made to + .git/annex/objects -- useful for those crippled filesystems. It might + also drop that copy once the object gets uploaded to another repo ... + But that gets complicated quickly. + + Update file2key map. + + Output the pointer file content to stdout. + +git-annex smudge: + +* Run by eg `git checkout` and passed the filename, as well as fed + the pointer file content on stdin. + + Updates file2key map. + + Outputs the same pointer file content to stdout. + +git annex direct/indirect: + + Previously these commands switched in and out of direct mode. + Now they become no-ops. + +git annex lock/unlock: + + Makes sense for these to change to switch files between using + git-annex symlinks and pointers. So, this provides both a way to + transition repositories to using pointers, and a cleaner unlock/lock + for repos using symlinks. + + unlock will stage a pointer file, and will copy the content of the object + out of .git/annex/objects to the work tree file. (Might want a --hardlink + switch.) + + lock will replace the current work tree file with the symlink, and stage it. + Note that multiple work tree files could point to the same object. + So, if the link count is > 1, replace the annex object with a copy of + itself to break such a hard link. Always finish by locking down the + permissions of the annex object. + +All other git-annex commands that look at annex symlinks to get keys will +need fall back to checking if a given work tree file is stored in git as +pointer file. This can be done by checking the file2key map (or by looking +it up in the index). + +Note that I have not verified if file2key maps can be maintained +consistently using the smudge/clean filters. Seems likely to work, +based on when I see smudge/clean filters being run. The file2key +optimisation may not be needed though, looking at the index +might be fast enough. ---- -- cgit v1.2.3