summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Joey Hess <joeyh@joeyh.name>2015-04-02 00:19:49 -0400
committerGravatar Joey Hess <joeyh@joeyh.name>2015-04-02 00:19:49 -0400
commit5492d8f4ddbd398e0188da9daed840908d1198c0 (patch)
tree8a4642c9f4ff1e6f7629c2ba78fb20fc011f1cdb
parent48af4b01cd00f6f470aeb1f56f258ce2eeb060c4 (diff)
Significantly sped up processing of large numbers of directories passed to a single git-annex command.
-rw-r--r--Utility/Path.hs6
-rw-r--r--debian/changelog2
-rw-r--r--doc/bugs/feeding_git_annex_with_xargs_can_fail.mdwn4
3 files changed, 10 insertions, 2 deletions
diff --git a/Utility/Path.hs b/Utility/Path.hs
index 755436448..ceecf3829 100644
--- a/Utility/Path.hs
+++ b/Utility/Path.hs
@@ -170,17 +170,19 @@ prop_relPathDirToFile_regressionTest = same_dir_shortcurcuits_at_difference
== joinPath ["..", "..", "..", "..", ".git", "annex", "objects", "18", "gk", "SHA256-foo", "SHA256-foo"]
{- Given an original list of paths, and an expanded list derived from it,
- - generates a list of lists, where each sublist corresponds to one of the
+ - partitions the expanded list, so that sublist corresponds to one of the
- original paths. When the original path is a directory, any items
- in the expanded list that are contained in that directory will appear in
- its segment.
+ -
+ - The expanded list must have the same ordering as the original list.
-}
segmentPaths :: [FilePath] -> [FilePath] -> [[FilePath]]
segmentPaths [] new = [new]
segmentPaths [_] new = [new] -- optimisation
segmentPaths (l:ls) new = found : segmentPaths ls rest
where
- (found, rest)=partition (l `dirContains`) new
+ (found, rest) = break (\p -> not (l `dirContains` p)) new
{- This assumes that it's cheaper to call segmentPaths on the result,
- than it would be to run the action separately with each path. In
diff --git a/debian/changelog b/debian/changelog
index 6f9ab15f7..dbea98877 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -22,6 +22,8 @@ git-annex (5.20150328) UNRELEASED; urgency=medium
corresponding to duplicated files they process.
* fsck: Added --distributed and --expire options,
for distributed fsck.
+ * Significantly sped up processing of large numbers of directories
+ passed to a single git-annex command.
-- Joey Hess <id@joeyh.name> Fri, 27 Mar 2015 16:04:43 -0400
diff --git a/doc/bugs/feeding_git_annex_with_xargs_can_fail.mdwn b/doc/bugs/feeding_git_annex_with_xargs_can_fail.mdwn
index 748fb1e55..729d0d2fb 100644
--- a/doc/bugs/feeding_git_annex_with_xargs_can_fail.mdwn
+++ b/doc/bugs/feeding_git_annex_with_xargs_can_fail.mdwn
@@ -9,3 +9,7 @@ Feeding git-annex a long list off directories, eg with xargs can have
git-ls-files results. There is probably an exponential blowup in the time
relative to the number of parameters. Some of the stuff being done to
preserve original ordering etc is likely at fault.
+
+ > I think I've managed to speed this up something like
+ > 1000x or some such. segmentPaths on an utterly insane list of 6 million
+ > files now runs in about 10 seconds. --[[Joey]]