diff options
author | 2016-09-14 12:18:48 -0400 | |
---|---|---|
committer | 2016-09-14 12:18:48 -0400 | |
commit | 9e850e61cf6cea296d118e77fc7188a0794cc354 (patch) | |
tree | 98550ba9ca9b910d2d53723cb1f0ed1d02489c7d /doc/todo/make_copy_--fast__faster | |
parent | c78821314c9dfbfab91a365d06418feb6f70b795 (diff) |
comment
Diffstat (limited to 'doc/todo/make_copy_--fast__faster')
-rw-r--r-- | doc/todo/make_copy_--fast__faster/comment_7_3f52b6e19035d3c891356c6d98035987._comment | 58 |
1 files changed, 58 insertions, 0 deletions
diff --git a/doc/todo/make_copy_--fast__faster/comment_7_3f52b6e19035d3c891356c6d98035987._comment b/doc/todo/make_copy_--fast__faster/comment_7_3f52b6e19035d3c891356c6d98035987._comment new file mode 100644 index 000000000..c47132591 --- /dev/null +++ b/doc/todo/make_copy_--fast__faster/comment_7_3f52b6e19035d3c891356c6d98035987._comment @@ -0,0 +1,58 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 7""" + date="2016-09-14T15:28:23Z" + content=""" +First, note that git-annex 6.20160619 sped up the git-annex +command startup time significantly. Please be sure to use a current +version in benchmarks, and state the version. + +`git archive` (and `git cat-file --batch --batch-all-objects`) are just +reading packs and loose objects in disk order and dumping out the contents. +`git cat-file --batch` has to look up objects in the pack index files, seek +in the pack, etc. It's not a fair comparison. + +Note that `git annex find`, when used without options like --in or --copies, +does not need to read anything from `git cat-file` at all. The +`GIT_TRACE_PERFORMANCE` you show is misleading; it's just showing how long +the git command is left running, idle. + +`git annex find`'s overhead should be purely traversing the filesystem tree +and checking what symlinks point to files. You can write programs that do +the same thing without using git at all (or only `git ls-files`), and +compare them to git-annex's time; that would be a fairer comparison. +Ideally, `git annex find` would be entirely system call bound and would use +very little CPU itself. + +By contrast, `git annex copy` makes significant use of `git cat-file --batch`, +since it needs to look up location log information to see if the +--to/--from remote has the files. + +`git annex copy -J` already parallelizes the parts of the code that look at +the location log. Including spinning up a separate `git cat-file --batch` +processes for each thread, so they won't contend on such queries. So I +would expect that to make it faster, even leaving aside the speed benefits +of doing the actual copies in parallel. + +My feeling is that the best way to speed these up is going to be in one +of these classes: + +* It's possible that `git cat-file --batch` is somehow slower than it needs + to be. Perhaps it's not doing good caching between queries or has + inneficient seralization/bad stdio buffering. It might just be the case + that using something like libgit2 instead would be faster. + (Due to libgit2's poor interface stability, it would have to be an + optional build flag.) + +* Many small optimisations to the code. The use of Strings throughout + git-annex could well be a source of systematic small innefficiences, + and using ByteString might eliminate those. (But this would be a huge job.) + (The `git cat-file --batch` communication is already done using + bytestrings.) + +* A completely lateral move. For example, if git-annex kept its own + database recording which files are present, then `git annex find` + could do a simple database query and not need to chase all the symlinks. + But such a database needs to somehow be kept in sync or reconciled + with the git index, it's not an easy thing. +"""]] |