diff options
author | Joey Hess <joeyh@joeyh.name> | 2015-11-23 18:10:50 -0400 |
---|---|---|
committer | Joey Hess <joeyh@joeyh.name> | 2015-11-23 18:10:50 -0400 |
commit | 1f2d57d064cdb594de6311a827b72e4ebcc4141d (patch) | |
tree | dc8071228e580f6154a789a90a7da6a6417e7fdd /doc | |
parent | 5ca4be908563759cc9b722fd81a4142a0afa9e98 (diff) |
notes on merge
Diffstat (limited to 'doc')
-rw-r--r-- | doc/todo/smudge.mdwn | 80 |
1 files changed, 37 insertions, 43 deletions
diff --git a/doc/todo/smudge.mdwn b/doc/todo/smudge.mdwn index 3bb6b1dfd..335f69be8 100644 --- a/doc/todo/smudge.mdwn +++ b/doc/todo/smudge.mdwn @@ -101,53 +101,45 @@ The smudge script can also be provided a filename with %f, but it cannot directly write to the file or git gets unhappy. > Still the case in 2015. Means an unnecesary read and pipe of the file -P> even if the content is already locally available on disk. --[[Joey]] +> even if the content is already locally available on disk. --[[Joey]] ### partial checkouts -It's important that git-annex supports partial checkouts of the content of -a repository. This allows repositories to be checked out when there's not -available disk space for all files in the repository. - -The way git-lfs uses smudge/clean filters, which is similar to that -described above, does not support partial checkouts; it always tries to -download the contents of all files. Indeed, git-lfs seems to keep 2 copies -of newly added files; one in the work tree and one in .git/lfs/objects/, -at least before it sends the latter to the server. This lack of control -over which data is checked out and duplication of the data limits the -usefulness of git-lfs on truely large amounts of data. - -To support partial checkouts, `git annex get` and `git annex drop` need to -be able to be used. - -To avoid data duplication when adding a new object, the clean filter could -hard link from the work tree file to the annex object. Although the -user could change the work tree file w/o breaking the hard link and this -would corrupt the annexed object. Could remove write permissions to avoid -that (mostly), but that would lose some of the benefits of smudge/clean as -the user wouldn't be able to modify annexed files. -> This may be one of those things where different tradeoffs meet different -> user's needs and so a repo could be switched between the two modes as -> needed.) - -The smudge filter can't modify the work tree file on its own -- git always -modifies the file after getting the output of the smudge filter, and will -stumble over any modifications that the smudge filter makes. And, it's -important that the smudge filter never fail as that will leave the repo in -a bad state. - -So, to support partial checkouts and avoid data dupliciation, the smudge -filter should provide some dummy content, probably including the key of the -file. (The clean filter should detect when it's operating on that dummy -content, and provide the same key as it would if the file content was -present.) - -To get the real content, use `git annex get`. (A `post-checkout` hook could -run that on all files if the user wants that behavior, or a config setting -could make the smudge filter automatically get file's contents.) +.. Are very important, otherwise a repo can't scale past the size of the +smallest client's disk! + +It would be nice if the smudge filter could hard link or symlink a work +tree file to the annex object. + +But currently, the smudge filter can't modify the work tree file on its own +-- git always modifies the file after getting the output of the smudge +filter, and will stumble over any modifications that the smudge filter +makes. And, it's important that the smudge filter never fail as that will +leave the repo in a bad state. + +Seems the best that can be done is for the smudge filter to copy from the +annex object when the object is present. When it's not present, the smudge +filter should provide a pointer to its content. + +The clean filter should detect when it's operating on that pointer file. I've a demo implementation of this technique in the scripts below. +### deduplication + +.. Is nice; needing 2 copies of every annexed file is annoying. + +Unfortunately, when using smudge/clean, `git merge` does not preserve a +smudged file in the work tree when renaming it. It instead deletes the old +file and asks the smudge filter to smudge the new filename. + +So, copies need to be maintained in .git/annex/objects, though it's ok +to use hard links to the work tree files. + +Even if hard links are used, smudge needs to output the content of an +annexed file, which will result in duplication when merging in renames of +files. + ### design Goal: Get rid of current direct mode, using smudge/clean filters instead to @@ -203,7 +195,8 @@ git-annex clean: .git/annex/objects.) This is done to prevent losing the only copy of a file when eg - doing a git checkout of a different branch. But, no attempt is made to + doing a git checkout of a different branch, or merging a commit that + renames or deletes a file. But, no attempt is made to protect the object from being modified. If a user wants to protect object contents from modification, they should use `git annex add`, not `git add`, or they can `git annex lock` after adding,. @@ -224,7 +217,8 @@ git-annex smudge: Updates file2key map. - Outputs the same pointer file content to stdout. + When an object is present in the annex, outputs its content to stdout. + Otherwise, outputs the file pointer content. git annex direct/indirect: |