smudge design

author: Joey Hess <joeyh@joeyh.name> 2015-11-23 16:53:05 -0400
committer: Joey Hess <joeyh@joeyh.name> 2015-11-23 16:53:05 -0400
commit: 3a549f9bc47b966216b73f465d77bfc3381856da (patch)
tree: 5673e271508f4a618f1023b199962f33129a5af7
parent: f038c6c472755ff80dc98dcc8c3dc713124f34a7 (diff)
2 files changed, 225 insertions, 34 deletions
diff --git a/doc/devblog/day_339_smudging_out_direct_mode.mdwn b/doc/devblog/day_339_smudging_out_direct_mode.mdwn
new file mode 100644
index 000000000..8e82f31af
--- /dev/null
+++ b/doc/devblog/day_339_smudging_out_direct_mode.mdwn
@@ -0,0 +1,56 @@
+I'm considering ways to get rid of direct mode, replacing it with something
+better implemented using [[todo/smudge]] filters.
+
+## git-lfs
+
+I started by trying out git-lfs, to see what I can learn from it. My
+feeling is that git-lfs brings an admirable simplicity to using git with
+large files. For example, it uses a push-hook to automatically
+upload file contents before pushing a branch.
+
+But its simplicity comes at the cost of being centralized. You can't make a
+git-lfs repository locally and clone it onto other drive and have the local
+repositories interoperate to pass file contents around. Everything has to
+go back through a centralized server. I'm willing to pay complexity costs
+for decentralization.
+
+Its simplicity also means that the user doesn't have much control over what
+files are present in their checkout of a repository. git-lfs downloads
+all the files in the work tree. It doesn't have facilities for dropping
+files to free up space, or for configuring a repository to only want to get
+a subset of files in the first place. Some of this could be added to it 
+I suppose.
+
+## replacing direct mode
+
+Anyway, as smudge/clean filters stand now, they can't be used to set up
+git-annex symlinks; their interface doesn't allow it. But, I was able to
+think up a design that uses smudge/clean filters to cover the same use
+cases that direct mode covers now.
+
+Thanks to the clean filter, adding a file with `git add` would check in a
+small file that points to the git-annex object. When a file has been added
+this way, the file in the work tree remains the only copy of the object
+until you use git-annex to copy it to another repository. So if you modify
+the work tree file, you can lose the old version of the object.
+
+This is analagous to how direct mode works now, and it avoids needing to
+store 2 copies of every file in the local repository.
+
+In the same repository, you could also use `git annex add` to check
+in a git-annex symlink, which would protect the object from modification,
+in the good old indirect mode way. `git annex lock` and `git annex unlock` 
+could switch a file between those two modes.
+
+So this allows mixing directly writable annexed files and locked down
+annexed files in the same repository. All regular git commands and all
+git-annex commands can be used on both sorts of files.
+
+That's much more flexible than the current direct mode, and I think it will
+be able to be implemented in a simpler, more scalable, and robust way too.
+I can lose the direct mode merge code, and remove hundreds of lines of
+other special cases for direct mode.
+
+The downside, perhaps, is that for a repository to be usable on a crippled
+filesystem, all the files in it will need to be unlocked. A file can't
+easily be unlocked in one checkout and locked in another checkout.
diff --git a/doc/todo/smudge.mdwn b/doc/todo/smudge.mdwn
index b11b1dedc..0982d7288 100644
--- a/doc/todo/smudge.mdwn
+++ b/doc/todo/smudge.mdwn
@@ -15,6 +15,10 @@ available files, and checksum them, which is too expensive.
 > git to handle this sort of case in an efficient way.. just needs someone
 > to do the work. --[[Joey]] 
 
+>> Update 2015: git status only calls the clean filter for files
+>> that the index says are modified, so this is no longer a problem.
+>> --[[Joey]]
+
 ----
 
 The clean filter is run when files are staged for commit. So a user could copy
@@ -36,35 +40,26 @@ add` files, and just being able to use `git add` or `git commit -a`,
 and have it use git-annex when .gitattributes says to. Also, annexed
 files can be directly modified without having to `git annex unlock`.
 
-### design
+### configuration
 
 In .gitattributes, the user would put something like "* filter=git-annex".
 This way they could control which files are annexed vs added normally.
 
-(git-annex could have further controls to allow eg, passing small files
-through to regular processing. At least .gitattributes is a special case,
-it should never be annexed...)
-
-For files not configured this way, git-annex could continue to use
-its symlink method -- this would preserve backwards compatability,
-and even allow mixing the two methods in a repo as desired.
-
-To find files in the repository that are annexed, git-annex would do
-`ls-files` as now, but would check if found files have the appropriate
-filter, rather than the current symlink checks. To determine the key
-of a file, rather than reading its symlink, git-annex would need to
-look up the git blob associated with the file -- this can be done
-efficiently using the existing code in `Branch.catFile`.
-
-The clean filter would inject the file's content into the annex, and hard
-link from the annex to the file. Avoiding duplication of data.
+It would also be good to allow using this without having to specify
+the files in .gitattributes. Just use "* filter=git-annex" there, and then
+let git-annex decide which files to annex and which to pass through the
+smudge and clean filters as-is. The smudge filter can just read a little of
+its input to see if it's a pointer to an annexed file. The clean filter
+could apply annex.largefiles to decide whether to annex a file's content or
+not.
 
-The smudge filter can't do that, so to avoid duplication of data, it
-might always create an empty file. To get the content, `git annex get`
-could be used (which would hard link it). A `post-checkout` hook might
-be used to set up hard links for all currently available content.
+For files not configured this way in .gitattributes, git-annex could
+continue to use its symlink method -- this would preserve backwards
+compatability, and even allow mixing the two methods in a repo as desired.
+(But not switching an existing repo between indirect and direct modes;
+the user decides which mode to use when adding files to the repo.)
 
-#### clean
+### clean
 
 The trick is doing it efficiently. Since git a2b665d, v1.7.4.1,
 something like this works to provide a filename to the clean script:
@@ -100,26 +95,166 @@ can't be fixed.)
 > but it seems to avoid this problem.
 > --[[Joey]]
 
-#### smudge
+### smudge
 
 The smudge script can also be provided a filename with %f, but it
 cannot directly write to the file or git gets unhappy.
 
 > Still the case in 2015. Means an unnecesary read and pipe of the file
-> even if the content is already locally available on disk. --[[Joey]]
+P> even if the content is already locally available on disk. --[[Joey]]
+
+### partial checkouts
+
+It's important that git-annex supports partial checkouts of the content of
+a repository. This allows repositories to be checked out when there's not
+available disk space for all files in the repository.
+
+The way git-lfs uses smudge/clean filters, which is similar to that
+described above, does not support partial checkouts; it always tries to
+download the contents of all files. Indeed, git-lfs seems to keep 2 copies
+of newly added files; one in the work tree and one in .git/lfs/objects/,
+at least before it sends the latter to the server. This lack of control
+over which data is checked out and duplication of the data limits the
+usefulness of git-lfs on truely large amounts of data.
+
+To support partial checkouts, `git annex get` and `git annex drop` need to
+be able to be used.
+
+To avoid data duplication when adding a new object, the clean filter could
+hard link from the work tree file to the annex object. Although the
+user could change the work tree file w/o breaking the hard link and this
+would corrupt the annexed object. Could remove write permissions to avoid
+that (mostly), but that would lose some of the benefits of smudge/clean as
+the user wouldn't be able to modify annexed files. 
+> This may be one of those things where different tradeoffs meet different
+> user's needs and so a repo could be switched between the two modes as
+> needed.)
+
+The smudge filter can't modify the work tree file on its own -- git always
+modifies the file after getting the output of the smudge filter, and will
+stumble over any modifications that the smudge filter makes. And, it's
+important that the smudge filter never fail as that will leave the repo in
+a bad state.
+
+So, to support partial checkouts and avoid data dupliciation, the smudge
+filter should provide some dummy content, probably including the key of the
+file. (The clean filter should detect when it's operating on that dummy
+content, and provide the same key as it would if the file content was
+present.)
+
+To get the real content, use `git annex get`. (A `post-checkout` hook could
+run that on all files if the user wants that behavior, or a config setting
+could make the smudge filter automatically get file's contents.)
 
-### dealing with partial content availability
+I've a demo implementation of this technique in the scripts below.
 
-The smudge filter cannot be allowed to fail, that leaves the tree and
-index in a weird state. So if a file's content is requested by calling
-the smudge filter, the trick is to instead provide dummy content,
-indicating it is not available (and perhaps saying to run "git-annex get").
+### design
 
-Then, in the clean filter, it has to detect that it's cleaning a file
-with that dummy content, and make sure to provide the same identifier as
-it would if the file content was there. 
+Goal: Get rid of current direct mode, using smudge/clean filters instead to
+cover the same use cases, more flexibly and robustly.
 
-I've a demo implementation of this technique in the scripts below.
+Use case 1:
+
+A user wants to be able to edit files, and git-add, git commit,
+without needing to worry about using git-annex to unlock files, add files,
+etc.
+
+Use case 2:
+
+Using git-annex on a crippled filesystem that does not support symlinks.
+
+Data:
+
+* An annex pointer file has as its first line the git-annex key
+  that it's standing in for. Subsequent lines of the file might
+  be a message saying that the file's content is not currently available.
+  An annex pointer file is checked into the git repository the same way
+  that an annex symlink is checked in.
+* file2key maps are maintained by git-annex, to keep track of
+  what files are pointers at keys.
+
+Configuration: 
+
+* .gitattributes tells git which files to use git-annex's smudge/clean
+  filters with. Typically, all files except for dotfiles:
+
+	* filter=annex
+	.* !filter
+
+* annex.largefiles tells git-annex which files should in fact be put in 
+  the annex. Other files are passed through the smudge/clean as-is and
+  have their contents stored in git.
+
+git-annex clean:
+
+* Run by `git add` (and diff and status, etc), and passed the
+  filename, as well as fed the file content on stdin.
+
+  Look at configuration to decide if this file's content belongs in the
+  annex. If not, output the file content to stdout.
+
+  Generate annex key from filename and content from stdin.
+
+  Hard link .git/annex/objects to the file, if it doesn't already exist.
+  (On platforms not supporting hardlinks, copy the file to
+  .git/annex/objects.)
+
+  This is done to prevent losing the only copy of a file when eg
+  doing a git checkout of a different branch. But, no attempt is made to 
+  protect the object from being modified. If a user wants to
+  protect object contents from modification, they should use
+  `git annex add`, not `git add`, or they can `git annex lock` after adding,.
+
+  There could be a configuration knob to cause a copy to be made to
+  .git/annex/objects -- useful for those crippled filesystems. It might
+  also drop that copy once the object gets uploaded to another repo ...
+  But that gets complicated quickly.
+
+  Update file2key map.
+
+  Output the pointer file content to stdout.
+
+git-annex smudge:
+
+* Run by eg `git checkout` and passed the filename, as well as fed
+  the pointer file content on stdin.
+
+  Updates file2key map.
+
+  Outputs the same pointer file content to stdout.
+
+git annex direct/indirect:
+
+  Previously these commands switched in and out of direct mode.
+  Now they become no-ops.
+
+git annex lock/unlock:
+
+  Makes sense for these to change to switch files between using
+  git-annex symlinks and pointers. So, this provides both a way to
+  transition repositories to using pointers, and a cleaner unlock/lock
+  for repos using symlinks.
+
+  unlock will stage a pointer file, and will copy the content of the object
+  out of .git/annex/objects to the work tree file. (Might want a --hardlink
+  switch.)
+  
+  lock will replace the current work tree file with the symlink, and stage it.
+  Note that multiple work tree files could point to the same object.
+  So, if the link count is > 1, replace the annex object with a copy of
+  itself to break such a hard link. Always finish by locking down the
+  permissions of the annex object.
+
+All other git-annex commands that look at annex symlinks to get keys will
+need fall back to checking if a given work tree file is stored in git as
+pointer file. This can be done by checking the file2key map (or by looking
+it up in the index).
+
+Note that I have not verified if file2key maps can be maintained
+consistently using the smudge/clean filters. Seems likely to work,
+based on when I see smudge/clean filters being run. The file2key
+optimisation may not be needed though, looking at the index 
+might be fast enough.
 
 ----
author	Joey Hess <joeyh@joeyh.name>	2015-11-23 16:53:05 -0400
committer	Joey Hess <joeyh@joeyh.name>	2015-11-23 16:53:05 -0400
commit	3a549f9bc47b966216b73f465d77bfc3381856da (patch)
tree	5673e271508f4a618f1023b199962f33129a5af7
parent	f038c6c472755ff80dc98dcc8c3dc713124f34a7 (diff)