summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Joey Hess <joeyh@joeyh.name>2016-03-14 15:54:46 -0400
committerGravatar Joey Hess <joeyh@joeyh.name>2016-03-14 15:54:46 -0400
commit9b29bd39c8dbf23bdf6930b51aba13992ccc49de (patch)
tree4cca35e79b28f6a27d7b688eb9d61d7e0c30aee1
parent0e2dab692046d242ac68ebc8359493ca76ef51d1 (diff)
followup
-rw-r--r--Annex/HashObject.hs54
-rw-r--r--doc/bugs/__39__add__39___results_in_max_cpu__44___long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment46
2 files changed, 100 insertions, 0 deletions
diff --git a/Annex/HashObject.hs b/Annex/HashObject.hs
new file mode 100644
index 000000000..aa8c2a174
--- /dev/null
+++ b/Annex/HashObject.hs
@@ -0,0 +1,54 @@
+{- git hash-object interface, with handle automatically stored in the Annex monad
+ -
+ - Copyright 2016 Joey Hess <id@joeyh.name>
+ -
+ - Licensed under the GNU GPL version 3 or higher.
+ -}
+
+module Annex.HashObject (
+ hashFile,
+ hashBlob,
+ hashObjectHandle,
+ hashObjectStop,
+) where
+
+import qualified Data.ByteString.Lazy as L
+import qualified Data.Map as M
+import System.PosixCompat.Types
+
+import Annex.Common
+import qualified Git
+import qualified Git.HashObject
+import qualified Annex
+import Git.Types
+import Git.FilePath
+import qualified Git.Ref
+import Annex.Link
+
+hashObjectHandle :: Annex Git.HashObject.HashObjectHandle
+hashObjectHandle = maybe startup return =<< Annex.getState Annex.hashobjecthandle
+ where
+ startup = do
+ inRepo $ Git.hashObjectStart
+ Annex.changeState $ \s -> s { Annex.hashobjecthandle = Just h }
+ return h
+
+hashObjectStop :: Annex ()
+hashObjectStop = maybe noop stop =<< Annex.hashobjecthandle
+ where
+ stop h = do
+ liftIO $ Git.hashObjectStop h
+ Annex.changeState $ \s -> s { Annex.hashobjecthandle = Nothing }
+
+hashFile :: FilePath -> Annex Sha
+hashFile f = do
+ h <- hashObjectHandle
+ Git.HashObject.hashFile h f
+
+{- Note that the content will be written to a temp file.
+ - So it may be faster to use Git.HashObject.hashObject for large
+ - blob contents. -}
+hashBlob :: String -> Annex Sha
+hashBlob content = do
+ h <- hashObjectHandle
+ Git.HashObject.hashFile h content
diff --git a/doc/bugs/__39__add__39___results_in_max_cpu__44___long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment b/doc/bugs/__39__add__39___results_in_max_cpu__44___long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment
new file mode 100644
index 000000000..4bd283aa2
--- /dev/null
+++ b/doc/bugs/__39__add__39___results_in_max_cpu__44___long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment
@@ -0,0 +1,46 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2016-03-14T17:59:08Z"
+ content="""
+If I've done the math right, 5 files per second over 3 hours is only 2000 files.
+The size of the files does matter, since git-annex has to read them all.
+You said the repo grew to 28 gb; does that mean you added 2000 files
+totalling 28 gb in size?
+
+I can add 2000 tiny files (5 bytes each) in 2 seconds on a SSD on Linux.
+
+By using a FAT filesystem, you've forced git-annex to use direct mode.
+Direct mode can be a little slower, but not a great deal. Adding 2000 files
+to a direct mode repo takes around 11 seconds here. (I did a little
+optimisation and sped that up to 7 seconds.)
+
+Doing the same benchmark on a removable USB stick with a FAT filesystem
+was still not slow; 7 seconds again.
+
+But then I had linux mount that FAT filesystem sync (so, it flushes each
+file write to disk, not buffering them), and I start getting closer to your
+slow speed; benchmark took 53 minutes.
+
+So, I think the slow speed you're seeing is quite likely due to a
+combination of, in order from most to least important:
+
+1. Synchronous writes to your disk drive. Fixable in linux by eg, running
+ "mount -o remount,async /path/to/repo" and there's probably something
+ similar for OSX.
+2. External drive being slow to access. (And if a spinning disk, slow to
+ seek.)
+3. git-annex using direct mode on FAT
+
+Also there is a fair amount of faff that git-annex does when adding a file
+around calling rename, stat, mkdir, etc multiple times. It may be possible
+to optimize some of that to get at some speedup on synchronous disks.
+But, I'd not expect more than a few percentage points speedup from such
+optimisation.
+
+One other possiblity is you could be hitting an edge case where direct mode's
+performace is bad. One known such edge case is if you have a lot of files
+that all have the same content. For example, I made 2000 files that were
+all empty; adding them to a direct mode repository gets slower and slower
+to the point it's spending 10 or more seconds per file.
+"""]]