diff options
author | Joey Hess <joeyh@joeyh.name> | 2016-03-14 15:54:46 -0400 |
---|---|---|
committer | Joey Hess <joeyh@joeyh.name> | 2016-03-14 15:54:46 -0400 |
commit | 9b29bd39c8dbf23bdf6930b51aba13992ccc49de (patch) | |
tree | 4cca35e79b28f6a27d7b688eb9d61d7e0c30aee1 | |
parent | 0e2dab692046d242ac68ebc8359493ca76ef51d1 (diff) |
followup
-rw-r--r-- | Annex/HashObject.hs | 54 | ||||
-rw-r--r-- | doc/bugs/__39__add__39___results_in_max_cpu__44___long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment | 46 |
2 files changed, 100 insertions, 0 deletions
diff --git a/Annex/HashObject.hs b/Annex/HashObject.hs new file mode 100644 index 000000000..aa8c2a174 --- /dev/null +++ b/Annex/HashObject.hs @@ -0,0 +1,54 @@ +{- git hash-object interface, with handle automatically stored in the Annex monad + - + - Copyright 2016 Joey Hess <id@joeyh.name> + - + - Licensed under the GNU GPL version 3 or higher. + -} + +module Annex.HashObject ( + hashFile, + hashBlob, + hashObjectHandle, + hashObjectStop, +) where + +import qualified Data.ByteString.Lazy as L +import qualified Data.Map as M +import System.PosixCompat.Types + +import Annex.Common +import qualified Git +import qualified Git.HashObject +import qualified Annex +import Git.Types +import Git.FilePath +import qualified Git.Ref +import Annex.Link + +hashObjectHandle :: Annex Git.HashObject.HashObjectHandle +hashObjectHandle = maybe startup return =<< Annex.getState Annex.hashobjecthandle + where + startup = do + inRepo $ Git.hashObjectStart + Annex.changeState $ \s -> s { Annex.hashobjecthandle = Just h } + return h + +hashObjectStop :: Annex () +hashObjectStop = maybe noop stop =<< Annex.hashobjecthandle + where + stop h = do + liftIO $ Git.hashObjectStop h + Annex.changeState $ \s -> s { Annex.hashobjecthandle = Nothing } + +hashFile :: FilePath -> Annex Sha +hashFile f = do + h <- hashObjectHandle + Git.HashObject.hashFile h f + +{- Note that the content will be written to a temp file. + - So it may be faster to use Git.HashObject.hashObject for large + - blob contents. -} +hashBlob :: String -> Annex Sha +hashBlob content = do + h <- hashObjectHandle + Git.HashObject.hashFile h content diff --git a/doc/bugs/__39__add__39___results_in_max_cpu__44___long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment b/doc/bugs/__39__add__39___results_in_max_cpu__44___long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment new file mode 100644 index 000000000..4bd283aa2 --- /dev/null +++ b/doc/bugs/__39__add__39___results_in_max_cpu__44___long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment @@ -0,0 +1,46 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2016-03-14T17:59:08Z" + content=""" +If I've done the math right, 5 files per second over 3 hours is only 2000 files. +The size of the files does matter, since git-annex has to read them all. +You said the repo grew to 28 gb; does that mean you added 2000 files +totalling 28 gb in size? + +I can add 2000 tiny files (5 bytes each) in 2 seconds on a SSD on Linux. + +By using a FAT filesystem, you've forced git-annex to use direct mode. +Direct mode can be a little slower, but not a great deal. Adding 2000 files +to a direct mode repo takes around 11 seconds here. (I did a little +optimisation and sped that up to 7 seconds.) + +Doing the same benchmark on a removable USB stick with a FAT filesystem +was still not slow; 7 seconds again. + +But then I had linux mount that FAT filesystem sync (so, it flushes each +file write to disk, not buffering them), and I start getting closer to your +slow speed; benchmark took 53 minutes. + +So, I think the slow speed you're seeing is quite likely due to a +combination of, in order from most to least important: + +1. Synchronous writes to your disk drive. Fixable in linux by eg, running + "mount -o remount,async /path/to/repo" and there's probably something + similar for OSX. +2. External drive being slow to access. (And if a spinning disk, slow to + seek.) +3. git-annex using direct mode on FAT + +Also there is a fair amount of faff that git-annex does when adding a file +around calling rename, stat, mkdir, etc multiple times. It may be possible +to optimize some of that to get at some speedup on synchronous disks. +But, I'd not expect more than a few percentage points speedup from such +optimisation. + +One other possiblity is you could be hitting an edge case where direct mode's +performace is bad. One known such edge case is if you have a lot of files +that all have the same content. For example, I made 2000 files that were +all empty; adding them to a direct mode repository gets slower and slower +to the point it's spending 10 or more seconds per file. +"""]] |