summaryrefslogtreecommitdiff
path: root/doc/bugs/transfer_lock_file_removal_bug.mdwn
blob: 021ce08a6354ceb943619d96766b79fbe7cd4497 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
There's a race that can result in 2 concurrent downloads
of the same key. This can happen because the transfer lock files get
deleted after a transfer completes.

Scenario:

1. first process creates lock file, takes lock, starts download
2. second process opens existing lock file
3. first process finishes/fails download, so deletes lock file, and then
   drops lock
4. second process takes lock (exclusive, non-blocking) of deleted file!
5. third process sees no lock file, so creates a new one, takes lock,
   starts download

Worst-case, this might result in data loss, if the two concurrent
downloaders confuse one-another. Perhaps the second one overwrites the file
the first was downloading, and then the first, thinking it's written the
file, moves it into the object directory.

Note that the window between 4-5 can be as long as it takes for a
file to download. However, the window between 2-4 is very narrow indeed,
since the second process is not blocking on the lock.
So, this is very unlikely to happen.

But, it could. Can it be prevented?

Yes: The second process, after taking the lock, can check that
the lock file exists, and has the same dev and fileno as the fd it has
locked. If the lock file doesn't exist, or is different, then the
second process can give up.

Oh BTW, this race cannot happen on windows, since on windows, an open lock
file cannot be deleted by another process.

> [[fixed|done]]

Let's also consider if this change will allow for safely expiring stale 
lock files..

Situation before with expiry would have been buggy (which is why it never
tried to expire them):

1. first process creates lock file, takes lock, is interrupted (no more
   lock)
2. second process wants to delete stale lock file; checks if it's locked;
   isn't
3. third process opens existing lock file, prepares to take lock
4. second process deletes lock file
5. third process takes lock
6. third process is now doing a transfer w/o a (visible) lock

But, after this bug fix, the third process checks if it's lock file
exists after taking the lock. Since the second process deleted it,
the third process fails with an error "transfer in progress"
which is perhaps not accurate, but at least it stopped.

For this to work though, the second process needs to actually take
the lock, in non-blocking mode. If it succeeds, it can keep the lock
held while deleting the lock file. This ensures that when the third process
takes the lock, the lock file will already be deleted by the time it checks
if it's there.