summaryrefslogtreecommitdiff
path: root/doc/devblog/day_322-326__concurrent_drop_safety.mdwn
blob: 6b8da87f753b9559771768087c03a8f19c212bfc (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Well, I've spent all week making `git annex drop --from` safe.

On Tuesday I got a sinking feeling in my stomach, as I realized that
there was hole in git-annex's armor to prevent concurrent drops from
violating numcopies or even losing the last copy of a file.
[[The bug involved an unlikely race condition|bugs/concurrent_drop--from_presence_checking_failures]],
and for all I know it's never happened in real life, but still this is not
good.

Since this is a potential data loss bug, expect a release pretty soon 
with the fix. And, there are 2 things to keep in mind about the fix:

1. If a ssh remote is using an old version of git-annex, a drop may fail.
   Solution will be to just upgrade the git-annex on the remote to the
   fixed version.
2. When a file is present in several special remotes, but not in any
   accessible git repositories, dropping it from one of the special
   remotes will now fail, where before it was allowed.

   Instead, the file has to be moved from one of the special remotes to
   the git repository, and can then safely be dropped from the git repository.

   This is a worrysome behavior change, but unavoidable.

Solving this clearly called for more locking, to prevent concurrency
problems. But, at first I couldn't find a solution that would allow
dropping content that was only located on special remotes. I didn't want to
make special remotes need to involve locking; that would be a nightmare to
implement, and probably some existing special remotes don't have any
way to do locking anyway.

Happily, after thinking about it all through Wednesday, I found a solution,
that while imperfect (see above) is probably the best one feasible. If my
analysis is correct (and it seems so, although I'd like to write a more
formal proof than the ad-hoc one I have so far), no locking is needed on
special remotes, as long as the locking is done just right on the git repos
and remotes. While this is not able to guarantee that numcopies is always
preserved, it is able to guarantee that the last copy of a file is never
removed. And, numcopies *will* always be preserved except for when this
rare race condition occurs.

So, I've been implementing that all of yesterday and today. Getting it
right involves building up 4 different kinds of evidence, which can be
used to make sure that the last copy of a file can't possibly be being
dropped, no matter what other concurrent drops could be happening.
I ended up with a very clean and robust implementation of this, and
a 2 thousand line diff.

Whew!