summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Joey Hess <joey@kitenet.net>2012-06-13 19:26:22 -0400
committerGravatar Joey Hess <joey@kitenet.net>2012-06-13 19:26:22 -0400
commit6be8cc18024ffa49545a03cbdd977f00a6fc0c2f (patch)
tree5c6e4844bcf31d3555e5f9ca21ab67a018e63d4d
parente7bb454bedb84dda27ae8d82bf8be0063c403108 (diff)
blog for the day
-rw-r--r--doc/design/assistant/blog/day_8__speed.mdwn67
1 files changed, 67 insertions, 0 deletions
diff --git a/doc/design/assistant/blog/day_8__speed.mdwn b/doc/design/assistant/blog/day_8__speed.mdwn
new file mode 100644
index 000000000..52c4de7a2
--- /dev/null
+++ b/doc/design/assistant/blog/day_8__speed.mdwn
@@ -0,0 +1,67 @@
+Since last post, I've worked on speeding up `git annex watch`'s startup time
+in a large repository.
+
+The problem was that its initial scan was naively staging every symlink in
+the repository, even though most of them are, presumably, staged correctly
+already. This was done in case the user copied or moved some symlinks
+around while `git annex watch` was not running -- we want to notice and
+commit such changes at startup.
+
+Since I already had the `stat` info for the symlink, it can look at the
+`ctime` to see if the symlink was made recently, and only stage it if so.
+This sped up startup in my big repo from longer than I cared to wait (10+
+minutes, or half an hour while profiling) to a minute or so. Of course,
+inotify events are already serviced during startup, so making it scan
+quickly is really only important so people don't think it's a resource hog.
+First impressions are important. :)
+
+But what does "made recently" mean exactly? Well, my answer is possibly
+overengineered, but most of it is really groundwork for things I'll need
+later anyway. I added a new data structure for tracking the status of the
+daemon, which is periodically written to disk by another thread (thread #6!)
+to `.git/annex/daemon.status` Currently it looks like this; I anticipate
+adding lots more info as I move into the [[syncing]] stage:
+
+ lastRunning:1339610482.47928s
+ scanComplete:True
+
+So, only symlinks made after the daemon was last running need to be
+expensively staged on startup. Although, as RichiH pointed out,
+this fails if the clock is changed. But I have been planning to have a
+cleanup thread anyway, that will handle this, and other
+potential problems, so I think that's ok.
+
+Stracing its startup scan, it's fairly tight now. There are some repeated
+`getcwd` syscalls that could be optimised out for a minor speedup.
+
+----
+
+Added the sanity check thread. Thread #8! It currently only does one sanity
+check per day, but the sanity check is a fairly lightweight job,
+so I may make it run more frequently. OTOH, it may never ever find a
+problem, so once per day seems a good compromise.
+
+Currently it's only checking that all files in the tree are properly staged
+in git. I might make it `git annex fsck` later, but fscking the whole tree
+once per day is a bit much. Perhaps it should only fsck a few files per
+day? TBD
+
+Currently any problems found in the sanity check are just fixed and logged.
+It would be good to do something about getting problems that might indicate
+bugs fed back to me, in a privacy-respecting way. TBD
+
+----
+
+I also refactored the code, which was getting far too large to all be in
+one module.
+
+I have been thinking about renaming `git annex watch` to `git annex assistant`,
+but I think I'll leave the command name as-is. Some users might
+want a simple watcher and stager, without the assistant's other features
+like syncing and the webapp. So the next stage of the
+[[roadmap|design/assistant]] will be a different command that also runs
+`watch`.
+
+At this point, I feel I'm done with the first phase of [[inotify]].
+It has a couple known bugs, but it's ready for brave beta testers to try.
+I trust it enough to be running it on my live data.