diff options
author | Joey Hess <joey@kitenet.net> | 2012-06-13 14:03:38 -0400 |
---|---|---|
committer | Joey Hess <joey@kitenet.net> | 2012-06-13 14:03:38 -0400 |
commit | 24da48816d936316753be6cabfe83e3e51af647b (patch) | |
tree | ea776c216e9786c68edb40d6917a364ce8d59582 | |
parent | 59a7b3a51a7cdfb8528ebc44a26a7577f28254d4 (diff) | |
parent | c1566757972e4774476a10bc6f865a88823e8938 (diff) |
Merge branch 'master' into watch
-rw-r--r-- | doc/design/assistant/blog/day_7__bugfixes.mdwn | 45 | ||||
-rw-r--r-- | doc/design/assistant/blog/day_7__bugfixes/profile.png | bin | 0 -> 47098 bytes | |||
-rw-r--r-- | doc/design/assistant/blog/day_7__bugfixes/profile2.png | bin | 0 -> 230937 bytes | |||
-rw-r--r-- | doc/design/assistant/inotify.mdwn | 116 |
4 files changed, 111 insertions, 50 deletions
diff --git a/doc/design/assistant/blog/day_7__bugfixes.mdwn b/doc/design/assistant/blog/day_7__bugfixes.mdwn new file mode 100644 index 000000000..3704969e3 --- /dev/null +++ b/doc/design/assistant/blog/day_7__bugfixes.mdwn @@ -0,0 +1,45 @@ +Kickstarter is over. Yay! + +Today I worked on the bug where `git annex watch` turned regular files +that were already checked into git into symlinks. So I made it check +if a file is already in git before trying to add it to the annex. + +The tricky part was doing this check quickly. Unless I want to write my +own git index parser (or use one from Hackage), this check requires running +`git ls-files`, once per file to be added. That won't fly if a huge +tree of files is being moved or unpacked into the watched directory. + +Instead, I made it only do the check during `git annex watch`'s initial +scan of the tree. This should be ok, because once it's running, you +won't be adding new files to git anyway, since it'll automatically annex +new files. This is good enough for now, but there are at least two problems +with it: + +* Someone might `git merge` in a branch that has some regular files, + and it would add the merged in files to the annex. +* Once `git annex watch` is running, if you modify a file that was + checked into git as a regular file, the new version will be added + to the annex. + +I'll probably come back to this issue, and may well find myself directly +querying git's index. + +--- + +I've started work to fix the memory leak I see when running `git annex +watch` in a large repository (40 thousand files). As always with a Haskell +memory leak, I crack open [Real World Haskell's chapter on profiling](http://book.realworldhaskell.org/read/profiling-and-optimization.html). + +Eventually this yields a nice graph of the problem: + +[[!img profile.png alt="memory profile"]] + +So, looks like a few minor memory leaks, and one huge leak. Stared +at this for a while and trying a few things, and got a much better result: + +[[!img profile2.png alt="memory profile"]] + +I may come back later and try to improve this further, but it's not bad memory +usage. But, it's still rather slow to start up in such a large repository, +and its initial scan is still doing too much work. I need to optimize +more.. diff --git a/doc/design/assistant/blog/day_7__bugfixes/profile.png b/doc/design/assistant/blog/day_7__bugfixes/profile.png Binary files differnew file mode 100644 index 000000000..702af1ca0 --- /dev/null +++ b/doc/design/assistant/blog/day_7__bugfixes/profile.png diff --git a/doc/design/assistant/blog/day_7__bugfixes/profile2.png b/doc/design/assistant/blog/day_7__bugfixes/profile2.png Binary files differnew file mode 100644 index 000000000..4d487d02a --- /dev/null +++ b/doc/design/assistant/blog/day_7__bugfixes/profile2.png diff --git a/doc/design/assistant/inotify.mdwn b/doc/design/assistant/inotify.mdwn index 9fe6938c4..60c598673 100644 --- a/doc/design/assistant/inotify.mdwn +++ b/doc/design/assistant/inotify.mdwn @@ -1,44 +1,53 @@ Finish "git annex watch" command, which runs, in the background, watching via inotify for changes, and automatically annexing new files, etc. -There is a `watch` branch in git that adds such a command. To make this -really useful, it needs to: +There is a `watch` branch in git that adds the command. -- on startup, add any files that have appeared since last run **done** -- on startup, fix the symlinks for any renamed links **done** -- on startup, stage any files that have been deleted since last run - (seems to require a `git commit -a` on startup, or at least a - `git add --update`, which will notice deleted files) **done** -- notice new files, and git annex add **done** -- notice renamed files, auto-fix the symlink, and stage the new file location - **done** -- handle cases where directories are moved outside the repo, and stop - watching them **done** -- when a whole directory is deleted or moved, stage removal of its - contents from the index **done** -- notice deleted files and stage the deletion - (tricky; there's a race with add since it replaces the file with a symlink..) - **done** -- Gracefully handle when the default limit of 8192 inotified directories - is exceeded. This can be tuned by root, so help the user fix it. - **done** -- periodically auto-commit staged changes (avoid autocommitting when - lots of changes are coming in) **done** -- coleasce related add/rm events for speed and less disk IO **done** -- don't annex `.gitignore` and `.gitattributes` files **done** -- run as a daemon **done** -- tunable delays before adding new files, etc +## known bugs + +* A process has a file open for write, another one closes it, + and so it's added. Then the first process modifies it. + + Or, a process has a file open for write when `git annex watch` starts + up, it will be added to the annex. If the process later continues + writing, it will change content in the annex. + + This changes content in the annex, and fsck will later catch + the inconsistency. + + Possible fixes: + + * Somehow track or detect if a file is open for write by any processes. + * Or, when possible, making a copy on write copy before adding the file + would avoid this. + * Or, as a last resort, make an expensive copy of the file and add that. + * Tracking file opens and closes with inotify could tell if any other + processes have the file open. But there are problems.. It doesn't + seem to differentiate between files opened for read and for write. + And there would still be a race after the last close and before it's + injected into the annex, where it could be opened for write again. + Would need to detect that and undo the annex injection or something. + +* If a file is checked into git as a normal file and gets modified + (or merged, etc), it will be converted into an annexed file. + See [[blog/day_7__bugfixes]] + +## todo + +- Support OSes other than Linux; it only uses inotify currently. + OSX and FreeBSD use the same mechanism, and there is a Haskell interface + for it, +- Run niced and ioniced? Seems to make sense, this is a background job. - configurable option to only annex files meeting certian size or filename criteria -- option to check files not meeting annex criteria into git directly +- option to check files not meeting annex criteria into git directly, + automatically - honor .gitignore, not adding files it excludes (difficult, probably needs my own .gitignore parser to avoid excessive running of git commands to check for ignored files) - Possibly, when a directory is moved out of the annex location, - unannex its contents. -- Support OSes other than Linux; it only uses inotify currently. - OSX and FreeBSD use the same mechanism, and there is a Haskell interface - for it, + unannex its contents. (Does inotify tell us where the directory moved + to so we can access it?) ## the races @@ -61,25 +70,6 @@ Many races need to be dealt with by this code. Here are some of them. Fixed this problem; Now it hard links the file to a temp directory and operates on the hard link, which is also made unwritable. -* A process has a file open for write, another one closes it, and so it's - added. Then the first process modifies it. - - **Currently unfixed**; This changes content in the annex, and fsck will - later catch the inconsistency. - - Possible fixes: - - * Somehow track or detect if a file is open for write by any processes. - * Or, when possible, making a copy on write copy before adding the file - would avoid this. - * Or, as a last resort, make an expensive copy of the file and add that. - * Tracking file opens and closes with inotify could tell if any other - processes have the file open. But there are problems.. It doesn't - seem to differentiate between files opened for read and for write. - And there would still be a race after the last close and before it's - injected into the annex, where it could be opened for write again. - Would need to detect that and undo the annex injection or something. - * File is added and then replaced with another file before the annex add makes its symlink. @@ -108,3 +98,29 @@ Many races need to be dealt with by this code. Here are some of them. Not a problem; The removal event removes the old file from the index, and the add event adds the new one. + +## done + +- on startup, add any files that have appeared since last run **done** +- on startup, fix the symlinks for any renamed links **done** +- on startup, stage any files that have been deleted since last run + (seems to require a `git commit -a` on startup, or at least a + `git add --update`, which will notice deleted files) **done** +- notice new files, and git annex add **done** +- notice renamed files, auto-fix the symlink, and stage the new file location + **done** +- handle cases where directories are moved outside the repo, and stop + watching them **done** +- when a whole directory is deleted or moved, stage removal of its + contents from the index **done** +- notice deleted files and stage the deletion + (tricky; there's a race with add since it replaces the file with a symlink..) + **done** +- Gracefully handle when the default limit of 8192 inotified directories + is exceeded. This can be tuned by root, so help the user fix it. + **done** +- periodically auto-commit staged changes (avoid autocommitting when + lots of changes are coming in) **done** +- coleasce related add/rm events for speed and less disk IO **done** +- don't annex `.gitignore` and `.gitattributes` files **done** +- run as a daemon **done** |