doc/bugs/runs_of_of_memory_adding_2_million_files/comment_3_3cff88b50eb3872565bccbeb6ee15716._comment


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

[[!comment format=mdwn
 username="http://joeyh.name/"
 ip="209.250.56.55"
 subject="comment 3"
 date="2014-07-04T19:26:00Z"
 content="""
Looking at the code, it's pretty clear why this is using a lot of memory:

<pre>
        fs <- getJournalFiles jl
        liftIO $ do
                h <- hashObjectStart g
                Git.UpdateIndex.streamUpdateIndex g
                        [genstream dir h fs]
                hashObjectStop h
        return $ liftIO $ mapM_ (removeFile . (dir </>)) fs
</pre>

So the big list in `fs` has to be retained in memory after the files are streamed to update-index, in order for them to be deleted!

Fixing is a bit tricky.. New journal files can appear while this is going on, so it can't just run getJournalFiles a second time to get the files to clean. 
Maybe delete the file after it's been sent to git-update-index? But git-update-index is going to want to read the file, and we don't really know when it will choose to do so. It could wait a while after we've sent the filename to it, potentially.

Also, per [[!commit 750c4ac6c282d14d19f79e0711f858367da145e4]], we cannot delete the journal files until *after* the commit, or another git-annex process would see inconsistent data!

I actually think I am going to need to use a temp file to hold the list of files..
"""]]