From 867caf3d7160926fa8cd83f331d93f60b3b2585a Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 31 Oct 2017 13:13:40 -0400 Subject: close bug; copy benchmarking info to new page --- doc/benchmarking.mdwn | 3 ++ ...nt_10_1af4ac0d37c876912678522895c1656b._comment | 61 ++++++++++++++++++++++ ...nt_11_1ca8d9765e6e3a18ae09df74bc390a00._comment | 49 +++++++++++++++++ ...ent_8_c1f99493f5e5c362d5c39f048280b11b._comment | 45 ++++++++++++++++ ...ent_9_f4d802a28b79905da0cb24af6cb65b0a._comment | 42 +++++++++++++++ doc/todo/make_copy_--fast__faster.mdwn | 4 ++ 6 files changed, 204 insertions(+) create mode 100644 doc/benchmarking.mdwn create mode 100644 doc/benchmarking/comment_10_1af4ac0d37c876912678522895c1656b._comment create mode 100644 doc/benchmarking/comment_11_1ca8d9765e6e3a18ae09df74bc390a00._comment create mode 100644 doc/benchmarking/comment_8_c1f99493f5e5c362d5c39f048280b11b._comment create mode 100644 doc/benchmarking/comment_9_f4d802a28b79905da0cb24af6cb65b0a._comment diff --git a/doc/benchmarking.mdwn b/doc/benchmarking.mdwn new file mode 100644 index 000000000..68fc328d0 --- /dev/null +++ b/doc/benchmarking.mdwn @@ -0,0 +1,3 @@ +This page exsists to collect benchmarking data about git-annex, so it can +be referred to later. If you have a specific instance where git-annex seems +unncessarily slow, please file a bug report about it. diff --git a/doc/benchmarking/comment_10_1af4ac0d37c876912678522895c1656b._comment b/doc/benchmarking/comment_10_1af4ac0d37c876912678522895c1656b._comment new file mode 100644 index 000000000..868b10364 --- /dev/null +++ b/doc/benchmarking/comment_10_1af4ac0d37c876912678522895c1656b._comment @@ -0,0 +1,61 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 10""" + date="2016-09-29T18:33:33Z" + content=""" +* Optimised key2file and file2key. 18% scanning time speedup. +* Optimised adjustGitEnv. 50% git-annex branch query speedup +* Optimised parsePOSIXTime. 10% git-annex branch query speedup +* Tried making catObjectDetails.receive use ByteString for parsing, + but that did not seem to speed it up significantly. + So it parsing is already fairly optimal, it's just that a + lot of data passes through it when querying the git-annex + branch. + +After all that, profiling `git-annex find`: + + Thu Sep 29 16:51 2016 Time and Allocation Profiling Report (Final) + + git-annex.1 +RTS -p -RTS find + + total time = 1.73 secs (1730 ticks @ 1000 us, 1 processor) + total alloc = 1,812,406,632 bytes (excludes profiling overheads) + + COST CENTRE MODULE %time %alloc + + md5 Data.Hash.MD5 28.0 37.9 + catchIO Utility.Exception 10.2 12.5 + inAnnex'.checkindirect Annex.Content 9.9 3.7 + catches Control.Monad.Catch 8.7 5.7 + readish Utility.PartialPrelude 5.7 3.0 + isAnnexLink Annex.Link 5.0 8.4 + keyFile Annex.Locations 4.2 5.8 + spanList Data.List.Utils 4.0 6.3 + startswith Data.List.Utils 2.0 1.3 + +And `git-annex find --not --in web`: + + Thu Sep 29 16:35 2016 Time and Allocation Profiling Report (Final) + + git-annex +RTS -p -RTS find --not --in web + + total time = 5.24 secs (5238 ticks @ 1000 us, 1 processor) + total alloc = 3,293,314,472 bytes (excludes profiling overheads) + + COST CENTRE MODULE %time %alloc + + catObjectDetails.receive Git.CatFile 12.9 5.5 + md5 Data.Hash.MD5 10.6 20.8 + readish Utility.PartialPrelude 7.3 8.2 + catchIO Utility.Exception 6.7 7.3 + spanList Data.List.Utils 4.1 7.4 + readFileStrictAnyEncoding Utility.Misc 3.5 1.3 + catches Control.Monad.Catch 3.3 3.2 + +So, quite a large speedup overall! + +This leaves md5 still unoptimised at 10-28% of CPU use. I looked at switching +it to cryptohash's implementation, but it would require quite a lot of +bit-banging math to pull the used values out of the ByteString containing +the md5sum. +"""]] diff --git a/doc/benchmarking/comment_11_1ca8d9765e6e3a18ae09df74bc390a00._comment b/doc/benchmarking/comment_11_1ca8d9765e6e3a18ae09df74bc390a00._comment new file mode 100644 index 000000000..619351d4c --- /dev/null +++ b/doc/benchmarking/comment_11_1ca8d9765e6e3a18ae09df74bc390a00._comment @@ -0,0 +1,49 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 11""" + date="2017-05-15T21:56:52Z" + content=""" +Switched from MissingH to cryptonite for md5. It did move md5 out of the top CPU spot but +the overall runtime didn't change much. Memory allocations did go down by a +good amount. + +Updated profiles: + + git-annex +RTS -p -RTS find + + total time = 1.63 secs (1629 ticks @ 1000 us, 1 processor) + total alloc = 1,496,336,496 bytes (excludes profiling overheads) + + COST CENTRE MODULE SRC %time %alloc + + catchIO Utility.Exception Utility/Exception.hs:79:1-17 14.1 15.1 + inAnnex'.checkindirect Annex.Content Annex/Content.hs:(108,9)-(119,39) 10.6 4.8 + catches Control.Monad.Catch src/Control/Monad/Catch.hs:(432,1)-(436,76) 8.6 6.9 + spanList Data.List.Utils src/Data/List/Utils.hs:(150,1)-(155,36) 6.7 11.1 + isAnnexLink Annex.Link Annex/Link.hs:35:1-85 5.0 10.2 + keyFile Annex.Locations Annex/Locations.hs:(456,1)-(462,19) 5.0 7.0 + readish Utility.PartialPrelude Utility/PartialPrelude.hs:(48,1)-(50,20) 3.8 2.0 + startswith Data.List.Utils src/Data/List/Utils.hs:103:1-23 3.6 2.3 + splitc Utility.Misc Utility/Misc.hs:(52,1)-(54,25) 3.4 6.5 + s2w8 Data.Bits.Utils src/Data/Bits/Utils.hs:65:1-15 2.6 6.4 + keyPath Annex.Locations Annex/Locations.hs:(492,1)-(494,23) 2.5 4.4 + fileKey.unesc Annex.Locations Annex/Locations.hs:(469,9)-(474,39) 2.0 3.5 + copyAndFreeze Data.ByteArray.Methods Data/ByteArray/Methods.hs:(224,1)-(227,21) 1.8 0.5 + + git-annex +RTS -p -RTS find --not --in web + + total time = 5.33 secs (5327 ticks @ 1000 us, 1 processor) + total alloc = 2,908,489,000 bytes (excludes profiling overheads) + + COST CENTRE MODULE SRC %time %alloc + + catObjectDetails.\ Git.CatFile Git/CatFile.hs:(80,72)-(88,97) 7.8 2.8 + catchIO Utility.Exception Utility/Exception.hs:79:1-17 7.6 8.3 + spanList Data.List.Utils src/Data/List/Utils.hs:(150,1)-(155,36) 5.8 9.1 + readish Utility.PartialPrelude Utility/PartialPrelude.hs:(48,1)-(50,20) 4.5 4.0 + parseResp Git.CatFile Git/CatFile.hs:(113,1)-(124,28) 4.4 2.9 + readFileStrict Utility.Misc Utility/Misc.hs:33:1-59 3.7 1.6 + catches Control.Monad.Catch src/Control/Monad/Catch.hs:(432,1)-(436,76) 3.1 3.6 + encodeW8 Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:(131,1)-(133,70) 3.1 2.3 + +"""]] diff --git a/doc/benchmarking/comment_8_c1f99493f5e5c362d5c39f048280b11b._comment b/doc/benchmarking/comment_8_c1f99493f5e5c362d5c39f048280b11b._comment new file mode 100644 index 000000000..e0f4987a0 --- /dev/null +++ b/doc/benchmarking/comment_8_c1f99493f5e5c362d5c39f048280b11b._comment @@ -0,0 +1,45 @@ +[[!comment format=mdwn + username="joey" + subject="""profiling""" + date="2016-09-26T19:20:36Z" + content=""" +Built git-annex with profiling, using `stack build --profile` + +(For reproduciblity, running git-annex in a clone of the git-annex repo +https://github.com/RichiH/conference_proceedings with rev +2797a49023fc24aff6fcaec55421572e1eddcfa2 checked out. It has 9496 annexed +objects.) + +Profiling `git-annex find +RTS -p`: + + total time = 3.53 secs (3530 ticks @ 1000 us, 1 processor) + total alloc = 3,772,700,720 bytes (excludes profiling overheads) + + COST CENTRE MODULE %time %alloc + + spanList Data.List.Utils 32.6 37.7 + startswith Data.List.Utils 14.3 8.1 + md5 Data.Hash.MD5 12.4 18.2 + join Data.List.Utils 6.9 13.7 + catchIO Utility.Exception 5.9 6.0 + catches Control.Monad.Catch 5.0 2.8 + inAnnex'.checkindirect Annex.Content 4.6 1.8 + readish Utility.PartialPrelude 3.0 1.4 + isAnnexLink Annex.Link 2.6 4.0 + split Data.List.Utils 1.5 0.8 + keyPath Annex.Locations 1.2 1.7 + + +This is interesting! + +Fully 40% of CPU time and allocations are in list (really String) processing, +and the details of the profiling report show that `spanList` and `startsWith` +and `join` are all coming from calls to `replace` in `keyFile` and `fileKey`. +Both functions nest several calls to replace, so perhaps that could be unwound +into a single pass and/or a ByteString used to do it more efficiently. + +12% of run time is spent calculating the md5 hashes for the hash +directories for .git/annex/objects. Data.Hash.MD5 is from missingh, and +it is probably a quite unoptimised version. Switching to the version +if cryptonite would probably speed it up a lot. +"""]] diff --git a/doc/benchmarking/comment_9_f4d802a28b79905da0cb24af6cb65b0a._comment b/doc/benchmarking/comment_9_f4d802a28b79905da0cb24af6cb65b0a._comment new file mode 100644 index 000000000..9692ad2d7 --- /dev/null +++ b/doc/benchmarking/comment_9_f4d802a28b79905da0cb24af6cb65b0a._comment @@ -0,0 +1,42 @@ +[[!comment format=mdwn + username="joey" + subject="""more profiling""" + date="2016-09-26T19:59:43Z" + content=""" +Instead of profiling `git annex copy --to remote`, I profiled `git annex +find --not --in web`, which needs to do the same kind of location log lookup. + + total time = 12.41 secs (12413 ticks @ 1000 us, 1 processor) + total alloc = 8,645,057,104 bytes (excludes profiling overheads) + + COST CENTRE MODULE %time %alloc + + adjustGitEnv Git.Env 21.4 37.0 + catchIO Utility.Exception 13.2 2.8 + spanList Data.List.Utils 12.6 17.9 + parsePOSIXTime Logs.TimeStamp 6.1 5.0 + catObjectDetails.receive Git.CatFile 5.9 2.1 + startswith Data.List.Utils 5.7 3.8 + md5 Data.Hash.MD5 5.1 7.9 + join Data.List.Utils 2.4 6.0 + readFileStrictAnyEncoding Utility.Misc 2.2 0.5 + +The adjustGitEnv overhead is a surprise! It seems it is getting called once +per file, and allocating a new copy of the environment each time. Call stack: +withIndex calls withIndexFile calls addGitEnv calls adjustGitEnv. +Looks like simply making gitEnv be cached at startup would avoid most of +the adjustGitEnv slowdown. + +(The catchIO overhead is a false reading; the detailed profile shows +that all its time and allocations are inherited. getAnnexLinkTarget +is running catchIO in the expensive case, so readSymbolicLink is +the actual expensive bit.) + +The parsePOSIXTime comes from reading location logs. It's implemented +using a generic Data.Time.Format.parseTime, which uses a format string +"%s%Qs". A custom parser that splits into seconds and picoseconds +and simply reads both numbers might be more efficient. + +catObjectDetails.receive is implemented using mostly String and could +probably be sped up by being converted to use ByteString. +"""]] diff --git a/doc/todo/make_copy_--fast__faster.mdwn b/doc/todo/make_copy_--fast__faster.mdwn index 2aea94669..4731806ea 100644 --- a/doc/todo/make_copy_--fast__faster.mdwn +++ b/doc/todo/make_copy_--fast__faster.mdwn @@ -1,3 +1,7 @@ I was trying to copy files which failed to copy (3 out of 6,000) to remote host after copy -J4. Succeeded. But with subsequent runs, apparently even with copy --fast it takes annex 10 sec for annex to realize there is nothing to copy. git ls-files which annex calls returns list immediately, so it is really some parsing/access to data under git-annex branch which takes awhile. I think we had similar discussion before but couldn't find. So I wondered to whine again to see if some optimization is possible to make --fast copies faster, especially whenever there is nothing to copy. [[!meta author=yoh]] + +> closing as yoh is happy. [[done]]. Note that I copied benechmarking +> related comments to the [[/benchmarking]] page for future reference. +> --[[Joey]] -- cgit v1.2.3