aboutsummaryrefslogtreecommitdiff
path: root/doc/benchmarking
diff options
context:
space:
mode:
authorGravatar Joey Hess <joeyh@joeyh.name>2017-10-31 13:13:40 -0400
committerGravatar Joey Hess <joeyh@joeyh.name>2017-10-31 13:13:40 -0400
commit867caf3d7160926fa8cd83f331d93f60b3b2585a (patch)
treeb504ee243ff2811351aa6286ba2c0b953183c7ef /doc/benchmarking
parentcfb23a73fd03627a52deda125b795dd1b86bef09 (diff)
close bug; copy benchmarking info to new page
Diffstat (limited to 'doc/benchmarking')
-rw-r--r--doc/benchmarking/comment_10_1af4ac0d37c876912678522895c1656b._comment61
-rw-r--r--doc/benchmarking/comment_11_1ca8d9765e6e3a18ae09df74bc390a00._comment49
-rw-r--r--doc/benchmarking/comment_8_c1f99493f5e5c362d5c39f048280b11b._comment45
-rw-r--r--doc/benchmarking/comment_9_f4d802a28b79905da0cb24af6cb65b0a._comment42
4 files changed, 197 insertions, 0 deletions
diff --git a/doc/benchmarking/comment_10_1af4ac0d37c876912678522895c1656b._comment b/doc/benchmarking/comment_10_1af4ac0d37c876912678522895c1656b._comment
new file mode 100644
index 000000000..868b10364
--- /dev/null
+++ b/doc/benchmarking/comment_10_1af4ac0d37c876912678522895c1656b._comment
@@ -0,0 +1,61 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 10"""
+ date="2016-09-29T18:33:33Z"
+ content="""
+* Optimised key2file and file2key. 18% scanning time speedup.
+* Optimised adjustGitEnv. 50% git-annex branch query speedup
+* Optimised parsePOSIXTime. 10% git-annex branch query speedup
+* Tried making catObjectDetails.receive use ByteString for parsing,
+ but that did not seem to speed it up significantly.
+ So it parsing is already fairly optimal, it's just that a
+ lot of data passes through it when querying the git-annex
+ branch.
+
+After all that, profiling `git-annex find`:
+
+ Thu Sep 29 16:51 2016 Time and Allocation Profiling Report (Final)
+
+ git-annex.1 +RTS -p -RTS find
+
+ total time = 1.73 secs (1730 ticks @ 1000 us, 1 processor)
+ total alloc = 1,812,406,632 bytes (excludes profiling overheads)
+
+ COST CENTRE MODULE %time %alloc
+
+ md5 Data.Hash.MD5 28.0 37.9
+ catchIO Utility.Exception 10.2 12.5
+ inAnnex'.checkindirect Annex.Content 9.9 3.7
+ catches Control.Monad.Catch 8.7 5.7
+ readish Utility.PartialPrelude 5.7 3.0
+ isAnnexLink Annex.Link 5.0 8.4
+ keyFile Annex.Locations 4.2 5.8
+ spanList Data.List.Utils 4.0 6.3
+ startswith Data.List.Utils 2.0 1.3
+
+And `git-annex find --not --in web`:
+
+ Thu Sep 29 16:35 2016 Time and Allocation Profiling Report (Final)
+
+ git-annex +RTS -p -RTS find --not --in web
+
+ total time = 5.24 secs (5238 ticks @ 1000 us, 1 processor)
+ total alloc = 3,293,314,472 bytes (excludes profiling overheads)
+
+ COST CENTRE MODULE %time %alloc
+
+ catObjectDetails.receive Git.CatFile 12.9 5.5
+ md5 Data.Hash.MD5 10.6 20.8
+ readish Utility.PartialPrelude 7.3 8.2
+ catchIO Utility.Exception 6.7 7.3
+ spanList Data.List.Utils 4.1 7.4
+ readFileStrictAnyEncoding Utility.Misc 3.5 1.3
+ catches Control.Monad.Catch 3.3 3.2
+
+So, quite a large speedup overall!
+
+This leaves md5 still unoptimised at 10-28% of CPU use. I looked at switching
+it to cryptohash's implementation, but it would require quite a lot of
+bit-banging math to pull the used values out of the ByteString containing
+the md5sum.
+"""]]
diff --git a/doc/benchmarking/comment_11_1ca8d9765e6e3a18ae09df74bc390a00._comment b/doc/benchmarking/comment_11_1ca8d9765e6e3a18ae09df74bc390a00._comment
new file mode 100644
index 000000000..619351d4c
--- /dev/null
+++ b/doc/benchmarking/comment_11_1ca8d9765e6e3a18ae09df74bc390a00._comment
@@ -0,0 +1,49 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 11"""
+ date="2017-05-15T21:56:52Z"
+ content="""
+Switched from MissingH to cryptonite for md5. It did move md5 out of the top CPU spot but
+the overall runtime didn't change much. Memory allocations did go down by a
+good amount.
+
+Updated profiles:
+
+ git-annex +RTS -p -RTS find
+
+ total time = 1.63 secs (1629 ticks @ 1000 us, 1 processor)
+ total alloc = 1,496,336,496 bytes (excludes profiling overheads)
+
+ COST CENTRE MODULE SRC %time %alloc
+
+ catchIO Utility.Exception Utility/Exception.hs:79:1-17 14.1 15.1
+ inAnnex'.checkindirect Annex.Content Annex/Content.hs:(108,9)-(119,39) 10.6 4.8
+ catches Control.Monad.Catch src/Control/Monad/Catch.hs:(432,1)-(436,76) 8.6 6.9
+ spanList Data.List.Utils src/Data/List/Utils.hs:(150,1)-(155,36) 6.7 11.1
+ isAnnexLink Annex.Link Annex/Link.hs:35:1-85 5.0 10.2
+ keyFile Annex.Locations Annex/Locations.hs:(456,1)-(462,19) 5.0 7.0
+ readish Utility.PartialPrelude Utility/PartialPrelude.hs:(48,1)-(50,20) 3.8 2.0
+ startswith Data.List.Utils src/Data/List/Utils.hs:103:1-23 3.6 2.3
+ splitc Utility.Misc Utility/Misc.hs:(52,1)-(54,25) 3.4 6.5
+ s2w8 Data.Bits.Utils src/Data/Bits/Utils.hs:65:1-15 2.6 6.4
+ keyPath Annex.Locations Annex/Locations.hs:(492,1)-(494,23) 2.5 4.4
+ fileKey.unesc Annex.Locations Annex/Locations.hs:(469,9)-(474,39) 2.0 3.5
+ copyAndFreeze Data.ByteArray.Methods Data/ByteArray/Methods.hs:(224,1)-(227,21) 1.8 0.5
+
+ git-annex +RTS -p -RTS find --not --in web
+
+ total time = 5.33 secs (5327 ticks @ 1000 us, 1 processor)
+ total alloc = 2,908,489,000 bytes (excludes profiling overheads)
+
+ COST CENTRE MODULE SRC %time %alloc
+
+ catObjectDetails.\ Git.CatFile Git/CatFile.hs:(80,72)-(88,97) 7.8 2.8
+ catchIO Utility.Exception Utility/Exception.hs:79:1-17 7.6 8.3
+ spanList Data.List.Utils src/Data/List/Utils.hs:(150,1)-(155,36) 5.8 9.1
+ readish Utility.PartialPrelude Utility/PartialPrelude.hs:(48,1)-(50,20) 4.5 4.0
+ parseResp Git.CatFile Git/CatFile.hs:(113,1)-(124,28) 4.4 2.9
+ readFileStrict Utility.Misc Utility/Misc.hs:33:1-59 3.7 1.6
+ catches Control.Monad.Catch src/Control/Monad/Catch.hs:(432,1)-(436,76) 3.1 3.6
+ encodeW8 Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:(131,1)-(133,70) 3.1 2.3
+
+"""]]
diff --git a/doc/benchmarking/comment_8_c1f99493f5e5c362d5c39f048280b11b._comment b/doc/benchmarking/comment_8_c1f99493f5e5c362d5c39f048280b11b._comment
new file mode 100644
index 000000000..e0f4987a0
--- /dev/null
+++ b/doc/benchmarking/comment_8_c1f99493f5e5c362d5c39f048280b11b._comment
@@ -0,0 +1,45 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""profiling"""
+ date="2016-09-26T19:20:36Z"
+ content="""
+Built git-annex with profiling, using `stack build --profile`
+
+(For reproduciblity, running git-annex in a clone of the git-annex repo
+https://github.com/RichiH/conference_proceedings with rev
+2797a49023fc24aff6fcaec55421572e1eddcfa2 checked out. It has 9496 annexed
+objects.)
+
+Profiling `git-annex find +RTS -p`:
+
+ total time = 3.53 secs (3530 ticks @ 1000 us, 1 processor)
+ total alloc = 3,772,700,720 bytes (excludes profiling overheads)
+
+ COST CENTRE MODULE %time %alloc
+
+ spanList Data.List.Utils 32.6 37.7
+ startswith Data.List.Utils 14.3 8.1
+ md5 Data.Hash.MD5 12.4 18.2
+ join Data.List.Utils 6.9 13.7
+ catchIO Utility.Exception 5.9 6.0
+ catches Control.Monad.Catch 5.0 2.8
+ inAnnex'.checkindirect Annex.Content 4.6 1.8
+ readish Utility.PartialPrelude 3.0 1.4
+ isAnnexLink Annex.Link 2.6 4.0
+ split Data.List.Utils 1.5 0.8
+ keyPath Annex.Locations 1.2 1.7
+
+
+This is interesting!
+
+Fully 40% of CPU time and allocations are in list (really String) processing,
+and the details of the profiling report show that `spanList` and `startsWith`
+and `join` are all coming from calls to `replace` in `keyFile` and `fileKey`.
+Both functions nest several calls to replace, so perhaps that could be unwound
+into a single pass and/or a ByteString used to do it more efficiently.
+
+12% of run time is spent calculating the md5 hashes for the hash
+directories for .git/annex/objects. Data.Hash.MD5 is from missingh, and
+it is probably a quite unoptimised version. Switching to the version
+if cryptonite would probably speed it up a lot.
+"""]]
diff --git a/doc/benchmarking/comment_9_f4d802a28b79905da0cb24af6cb65b0a._comment b/doc/benchmarking/comment_9_f4d802a28b79905da0cb24af6cb65b0a._comment
new file mode 100644
index 000000000..9692ad2d7
--- /dev/null
+++ b/doc/benchmarking/comment_9_f4d802a28b79905da0cb24af6cb65b0a._comment
@@ -0,0 +1,42 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""more profiling"""
+ date="2016-09-26T19:59:43Z"
+ content="""
+Instead of profiling `git annex copy --to remote`, I profiled `git annex
+find --not --in web`, which needs to do the same kind of location log lookup.
+
+ total time = 12.41 secs (12413 ticks @ 1000 us, 1 processor)
+ total alloc = 8,645,057,104 bytes (excludes profiling overheads)
+
+ COST CENTRE MODULE %time %alloc
+
+ adjustGitEnv Git.Env 21.4 37.0
+ catchIO Utility.Exception 13.2 2.8
+ spanList Data.List.Utils 12.6 17.9
+ parsePOSIXTime Logs.TimeStamp 6.1 5.0
+ catObjectDetails.receive Git.CatFile 5.9 2.1
+ startswith Data.List.Utils 5.7 3.8
+ md5 Data.Hash.MD5 5.1 7.9
+ join Data.List.Utils 2.4 6.0
+ readFileStrictAnyEncoding Utility.Misc 2.2 0.5
+
+The adjustGitEnv overhead is a surprise! It seems it is getting called once
+per file, and allocating a new copy of the environment each time. Call stack:
+withIndex calls withIndexFile calls addGitEnv calls adjustGitEnv.
+Looks like simply making gitEnv be cached at startup would avoid most of
+the adjustGitEnv slowdown.
+
+(The catchIO overhead is a false reading; the detailed profile shows
+that all its time and allocations are inherited. getAnnexLinkTarget
+is running catchIO in the expensive case, so readSymbolicLink is
+the actual expensive bit.)
+
+The parsePOSIXTime comes from reading location logs. It's implemented
+using a generic Data.Time.Format.parseTime, which uses a format string
+"%s%Qs". A custom parser that splits into seconds and picoseconds
+and simply reads both numbers might be more efficient.
+
+catObjectDetails.receive is implemented using mostly String and could
+probably be sped up by being converted to use ByteString.
+"""]]