summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Joey Hess <joeyh@joeyh.name>2015-11-23 18:10:50 -0400
committerGravatar Joey Hess <joeyh@joeyh.name>2015-11-23 18:10:50 -0400
commit1f2d57d064cdb594de6311a827b72e4ebcc4141d (patch)
treedc8071228e580f6154a789a90a7da6a6417e7fdd
parent5ca4be908563759cc9b722fd81a4142a0afa9e98 (diff)
notes on merge
-rw-r--r--doc/todo/smudge.mdwn80
1 files changed, 37 insertions, 43 deletions
diff --git a/doc/todo/smudge.mdwn b/doc/todo/smudge.mdwn
index 3bb6b1dfd..335f69be8 100644
--- a/doc/todo/smudge.mdwn
+++ b/doc/todo/smudge.mdwn
@@ -101,53 +101,45 @@ The smudge script can also be provided a filename with %f, but it
cannot directly write to the file or git gets unhappy.
> Still the case in 2015. Means an unnecesary read and pipe of the file
-P> even if the content is already locally available on disk. --[[Joey]]
+> even if the content is already locally available on disk. --[[Joey]]
### partial checkouts
-It's important that git-annex supports partial checkouts of the content of
-a repository. This allows repositories to be checked out when there's not
-available disk space for all files in the repository.
-
-The way git-lfs uses smudge/clean filters, which is similar to that
-described above, does not support partial checkouts; it always tries to
-download the contents of all files. Indeed, git-lfs seems to keep 2 copies
-of newly added files; one in the work tree and one in .git/lfs/objects/,
-at least before it sends the latter to the server. This lack of control
-over which data is checked out and duplication of the data limits the
-usefulness of git-lfs on truely large amounts of data.
-
-To support partial checkouts, `git annex get` and `git annex drop` need to
-be able to be used.
-
-To avoid data duplication when adding a new object, the clean filter could
-hard link from the work tree file to the annex object. Although the
-user could change the work tree file w/o breaking the hard link and this
-would corrupt the annexed object. Could remove write permissions to avoid
-that (mostly), but that would lose some of the benefits of smudge/clean as
-the user wouldn't be able to modify annexed files.
-> This may be one of those things where different tradeoffs meet different
-> user's needs and so a repo could be switched between the two modes as
-> needed.)
-
-The smudge filter can't modify the work tree file on its own -- git always
-modifies the file after getting the output of the smudge filter, and will
-stumble over any modifications that the smudge filter makes. And, it's
-important that the smudge filter never fail as that will leave the repo in
-a bad state.
-
-So, to support partial checkouts and avoid data dupliciation, the smudge
-filter should provide some dummy content, probably including the key of the
-file. (The clean filter should detect when it's operating on that dummy
-content, and provide the same key as it would if the file content was
-present.)
-
-To get the real content, use `git annex get`. (A `post-checkout` hook could
-run that on all files if the user wants that behavior, or a config setting
-could make the smudge filter automatically get file's contents.)
+.. Are very important, otherwise a repo can't scale past the size of the
+smallest client's disk!
+
+It would be nice if the smudge filter could hard link or symlink a work
+tree file to the annex object.
+
+But currently, the smudge filter can't modify the work tree file on its own
+-- git always modifies the file after getting the output of the smudge
+filter, and will stumble over any modifications that the smudge filter
+makes. And, it's important that the smudge filter never fail as that will
+leave the repo in a bad state.
+
+Seems the best that can be done is for the smudge filter to copy from the
+annex object when the object is present. When it's not present, the smudge
+filter should provide a pointer to its content.
+
+The clean filter should detect when it's operating on that pointer file.
I've a demo implementation of this technique in the scripts below.
+### deduplication
+
+.. Is nice; needing 2 copies of every annexed file is annoying.
+
+Unfortunately, when using smudge/clean, `git merge` does not preserve a
+smudged file in the work tree when renaming it. It instead deletes the old
+file and asks the smudge filter to smudge the new filename.
+
+So, copies need to be maintained in .git/annex/objects, though it's ok
+to use hard links to the work tree files.
+
+Even if hard links are used, smudge needs to output the content of an
+annexed file, which will result in duplication when merging in renames of
+files.
+
### design
Goal: Get rid of current direct mode, using smudge/clean filters instead to
@@ -203,7 +195,8 @@ git-annex clean:
.git/annex/objects.)
This is done to prevent losing the only copy of a file when eg
- doing a git checkout of a different branch. But, no attempt is made to
+ doing a git checkout of a different branch, or merging a commit that
+ renames or deletes a file. But, no attempt is made to
protect the object from being modified. If a user wants to
protect object contents from modification, they should use
`git annex add`, not `git add`, or they can `git annex lock` after adding,.
@@ -224,7 +217,8 @@ git-annex smudge:
Updates file2key map.
- Outputs the same pointer file content to stdout.
+ When an object is present in the annex, outputs its content to stdout.
+ Otherwise, outputs the file pointer content.
git annex direct/indirect: