summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Joey Hess <joey@kitenet.net>2012-06-13 14:03:38 -0400
committerGravatar Joey Hess <joey@kitenet.net>2012-06-13 14:03:38 -0400
commit24da48816d936316753be6cabfe83e3e51af647b (patch)
treeea776c216e9786c68edb40d6917a364ce8d59582
parent59a7b3a51a7cdfb8528ebc44a26a7577f28254d4 (diff)
parentc1566757972e4774476a10bc6f865a88823e8938 (diff)
Merge branch 'master' into watch
-rw-r--r--doc/design/assistant/blog/day_7__bugfixes.mdwn45
-rw-r--r--doc/design/assistant/blog/day_7__bugfixes/profile.pngbin0 -> 47098 bytes
-rw-r--r--doc/design/assistant/blog/day_7__bugfixes/profile2.pngbin0 -> 230937 bytes
-rw-r--r--doc/design/assistant/inotify.mdwn116
4 files changed, 111 insertions, 50 deletions
diff --git a/doc/design/assistant/blog/day_7__bugfixes.mdwn b/doc/design/assistant/blog/day_7__bugfixes.mdwn
new file mode 100644
index 000000000..3704969e3
--- /dev/null
+++ b/doc/design/assistant/blog/day_7__bugfixes.mdwn
@@ -0,0 +1,45 @@
+Kickstarter is over. Yay!
+
+Today I worked on the bug where `git annex watch` turned regular files
+that were already checked into git into symlinks. So I made it check
+if a file is already in git before trying to add it to the annex.
+
+The tricky part was doing this check quickly. Unless I want to write my
+own git index parser (or use one from Hackage), this check requires running
+`git ls-files`, once per file to be added. That won't fly if a huge
+tree of files is being moved or unpacked into the watched directory.
+
+Instead, I made it only do the check during `git annex watch`'s initial
+scan of the tree. This should be ok, because once it's running, you
+won't be adding new files to git anyway, since it'll automatically annex
+new files. This is good enough for now, but there are at least two problems
+with it:
+
+* Someone might `git merge` in a branch that has some regular files,
+ and it would add the merged in files to the annex.
+* Once `git annex watch` is running, if you modify a file that was
+ checked into git as a regular file, the new version will be added
+ to the annex.
+
+I'll probably come back to this issue, and may well find myself directly
+querying git's index.
+
+---
+
+I've started work to fix the memory leak I see when running `git annex
+watch` in a large repository (40 thousand files). As always with a Haskell
+memory leak, I crack open [Real World Haskell's chapter on profiling](http://book.realworldhaskell.org/read/profiling-and-optimization.html).
+
+Eventually this yields a nice graph of the problem:
+
+[[!img profile.png alt="memory profile"]]
+
+So, looks like a few minor memory leaks, and one huge leak. Stared
+at this for a while and trying a few things, and got a much better result:
+
+[[!img profile2.png alt="memory profile"]]
+
+I may come back later and try to improve this further, but it's not bad memory
+usage. But, it's still rather slow to start up in such a large repository,
+and its initial scan is still doing too much work. I need to optimize
+more..
diff --git a/doc/design/assistant/blog/day_7__bugfixes/profile.png b/doc/design/assistant/blog/day_7__bugfixes/profile.png
new file mode 100644
index 000000000..702af1ca0
--- /dev/null
+++ b/doc/design/assistant/blog/day_7__bugfixes/profile.png
Binary files differ
diff --git a/doc/design/assistant/blog/day_7__bugfixes/profile2.png b/doc/design/assistant/blog/day_7__bugfixes/profile2.png
new file mode 100644
index 000000000..4d487d02a
--- /dev/null
+++ b/doc/design/assistant/blog/day_7__bugfixes/profile2.png
Binary files differ
diff --git a/doc/design/assistant/inotify.mdwn b/doc/design/assistant/inotify.mdwn
index 9fe6938c4..60c598673 100644
--- a/doc/design/assistant/inotify.mdwn
+++ b/doc/design/assistant/inotify.mdwn
@@ -1,44 +1,53 @@
Finish "git annex watch" command, which runs, in the background, watching via
inotify for changes, and automatically annexing new files, etc.
-There is a `watch` branch in git that adds such a command. To make this
-really useful, it needs to:
+There is a `watch` branch in git that adds the command.
-- on startup, add any files that have appeared since last run **done**
-- on startup, fix the symlinks for any renamed links **done**
-- on startup, stage any files that have been deleted since last run
- (seems to require a `git commit -a` on startup, or at least a
- `git add --update`, which will notice deleted files) **done**
-- notice new files, and git annex add **done**
-- notice renamed files, auto-fix the symlink, and stage the new file location
- **done**
-- handle cases where directories are moved outside the repo, and stop
- watching them **done**
-- when a whole directory is deleted or moved, stage removal of its
- contents from the index **done**
-- notice deleted files and stage the deletion
- (tricky; there's a race with add since it replaces the file with a symlink..)
- **done**
-- Gracefully handle when the default limit of 8192 inotified directories
- is exceeded. This can be tuned by root, so help the user fix it.
- **done**
-- periodically auto-commit staged changes (avoid autocommitting when
- lots of changes are coming in) **done**
-- coleasce related add/rm events for speed and less disk IO **done**
-- don't annex `.gitignore` and `.gitattributes` files **done**
-- run as a daemon **done**
-- tunable delays before adding new files, etc
+## known bugs
+
+* A process has a file open for write, another one closes it,
+ and so it's added. Then the first process modifies it.
+
+ Or, a process has a file open for write when `git annex watch` starts
+ up, it will be added to the annex. If the process later continues
+ writing, it will change content in the annex.
+
+ This changes content in the annex, and fsck will later catch
+ the inconsistency.
+
+ Possible fixes:
+
+ * Somehow track or detect if a file is open for write by any processes.
+ * Or, when possible, making a copy on write copy before adding the file
+ would avoid this.
+ * Or, as a last resort, make an expensive copy of the file and add that.
+ * Tracking file opens and closes with inotify could tell if any other
+ processes have the file open. But there are problems.. It doesn't
+ seem to differentiate between files opened for read and for write.
+ And there would still be a race after the last close and before it's
+ injected into the annex, where it could be opened for write again.
+ Would need to detect that and undo the annex injection or something.
+
+* If a file is checked into git as a normal file and gets modified
+ (or merged, etc), it will be converted into an annexed file.
+ See [[blog/day_7__bugfixes]]
+
+## todo
+
+- Support OSes other than Linux; it only uses inotify currently.
+ OSX and FreeBSD use the same mechanism, and there is a Haskell interface
+ for it,
+- Run niced and ioniced? Seems to make sense, this is a background job.
- configurable option to only annex files meeting certian size or
filename criteria
-- option to check files not meeting annex criteria into git directly
+- option to check files not meeting annex criteria into git directly,
+ automatically
- honor .gitignore, not adding files it excludes (difficult, probably
needs my own .gitignore parser to avoid excessive running of git commands
to check for ignored files)
- Possibly, when a directory is moved out of the annex location,
- unannex its contents.
-- Support OSes other than Linux; it only uses inotify currently.
- OSX and FreeBSD use the same mechanism, and there is a Haskell interface
- for it,
+ unannex its contents. (Does inotify tell us where the directory moved
+ to so we can access it?)
## the races
@@ -61,25 +70,6 @@ Many races need to be dealt with by this code. Here are some of them.
Fixed this problem; Now it hard links the file to a temp directory and
operates on the hard link, which is also made unwritable.
-* A process has a file open for write, another one closes it, and so it's
- added. Then the first process modifies it.
-
- **Currently unfixed**; This changes content in the annex, and fsck will
- later catch the inconsistency.
-
- Possible fixes:
-
- * Somehow track or detect if a file is open for write by any processes.
- * Or, when possible, making a copy on write copy before adding the file
- would avoid this.
- * Or, as a last resort, make an expensive copy of the file and add that.
- * Tracking file opens and closes with inotify could tell if any other
- processes have the file open. But there are problems.. It doesn't
- seem to differentiate between files opened for read and for write.
- And there would still be a race after the last close and before it's
- injected into the annex, where it could be opened for write again.
- Would need to detect that and undo the annex injection or something.
-
* File is added and then replaced with another file before the annex add
makes its symlink.
@@ -108,3 +98,29 @@ Many races need to be dealt with by this code. Here are some of them.
Not a problem; The removal event removes the old file from the index, and
the add event adds the new one.
+
+## done
+
+- on startup, add any files that have appeared since last run **done**
+- on startup, fix the symlinks for any renamed links **done**
+- on startup, stage any files that have been deleted since last run
+ (seems to require a `git commit -a` on startup, or at least a
+ `git add --update`, which will notice deleted files) **done**
+- notice new files, and git annex add **done**
+- notice renamed files, auto-fix the symlink, and stage the new file location
+ **done**
+- handle cases where directories are moved outside the repo, and stop
+ watching them **done**
+- when a whole directory is deleted or moved, stage removal of its
+ contents from the index **done**
+- notice deleted files and stage the deletion
+ (tricky; there's a race with add since it replaces the file with a symlink..)
+ **done**
+- Gracefully handle when the default limit of 8192 inotified directories
+ is exceeded. This can be tuned by root, so help the user fix it.
+ **done**
+- periodically auto-commit staged changes (avoid autocommitting when
+ lots of changes are coming in) **done**
+- coleasce related add/rm events for speed and less disk IO **done**
+- don't annex `.gitignore` and `.gitattributes` files **done**
+- run as a daemon **done**