[[!comment format=mdwn username="joey" subject="""comment 7""" date="2016-09-14T15:28:23Z" content=""" First, note that git-annex 6.20160619 sped up the git-annex command startup time significantly. Please be sure to use a current version in benchmarks, and state the version. `git archive` (and `git cat-file --batch --batch-all-objects`) are just reading packs and loose objects in disk order and dumping out the contents. `git cat-file --batch` has to look up objects in the pack index files, seek in the pack, etc. It's not a fair comparison. Note that `git annex find`, when used without options like --in or --copies, does not need to read anything from `git cat-file` at all. The `GIT_TRACE_PERFORMANCE` you show is misleading; it's just showing how long the git command is left running, idle. `git annex find`'s overhead should be purely traversing the filesystem tree and checking what symlinks point to files. You can write programs that do the same thing without using git at all (or only `git ls-files`), and compare them to git-annex's time; that would be a fairer comparison. Ideally, `git annex find` would be entirely system call bound and would use very little CPU itself. By contrast, `git annex copy` makes significant use of `git cat-file --batch`, since it needs to look up location log information to see if the --to/--from remote has the files. `git annex copy -J` already parallelizes the parts of the code that look at the location log. Including spinning up a separate `git cat-file --batch` processes for each thread, so they won't contend on such queries. So I would expect that to make it faster, even leaving aside the speed benefits of doing the actual copies in parallel. My feeling is that the best way to speed these up is going to be in one of these classes: * It's possible that `git cat-file --batch` is somehow slower than it needs to be. Perhaps it's not doing good caching between queries or has inneficient seralization/bad stdio buffering. It might just be the case that using something like libgit2 instead would be faster. (Due to libgit2's poor interface stability, it would have to be an optional build flag.) * Many small optimisations to the code. The use of Strings throughout git-annex could well be a source of systematic small innefficiences, and using ByteString might eliminate those. (But this would be a huge job.) (The `git cat-file --batch` communication is already done using bytestrings.) * A completely lateral move. For example, if git-annex kept its own database recording which files are present, then `git annex find` could do a simple database query and not need to chase all the symlinks. But such a database needs to somehow be kept in sync or reconciled with the git index, it's not an easy thing. """]]