summaryrefslogtreecommitdiff
path: root/doc/bugs/copy_doesn__39__t_scale.mdwn
blob: a5ca9d9ee696d9c99ffe3da6e49fe9f0c3f67c39 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
It seems that git-annex copies every individual file in a separate
transaction. This is quite costly for mass transfers: each file involves a
separate rsync invocation and the creation of a new commit. Even with a
meager thousand files or so in the annex, I have to wait for fifteen
minutes to copy the contents to another disk, simply because every
individual file involves some disk thrashing. Also, it seems suspicious
that the git-annex branch would get a thousands commits of history from the
simple procedure of copying everything to a new repository. Surely it would
be better to first copy everything and then create only a single commit
that registers the changes to the files' availability?

> git-annex is very careful to commit as infrequently as possible,
> and the current version makes *1* commit after all the copies are
> complete, even if it transferred a billion files. The only overhead
> incurred for each file is writing a journal file.
> You must have an old version.
> --[[Joey]]

(I'm also not quite clear on why rsync is being used when both repositories
are local. It seems to be just overhead.)

> Even when copying to another disk it's often on 
> some slow bus, and the file is by definition large. So it's
> nice to support resumes of interrupted transfers of files.
> Also because rsync has a handy progress display that is hard to get with cp.
> 
> (However, if the copy is to another directory in the same disk, it does
> use cp, and even supports really fast copies on COW filesystems.)
> --[[Joey]]

---

Oneshot mode is now implemented, making git-annex-shell and other
short lifetime processes not bother with committing changes.
[[done]] --[[Joey]] 

Update: Now it makes one commit at the very end of such a mass transfer.
--[[Joey]]