aboutsummaryrefslogtreecommitdiff
path: root/doc/bugs/Strange_case_of_data_loss__44___possibly_linked_to_git-annex_with_encrypted_rsync_remote.mdwn
blob: be0060c582498f58523eb386d4f4ede03962838f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
This is not really a proper bug report, but I thought I should post this here
in case someone can find any sane, non-supernatural reason for a strange case
of data loss I have experienced with git-annex.

Some time ago I cloned a bunch of git-annex repos from an external drive (let's
call it disk1) to a new computer (computer3). On one of my repos git-annex
marked a bunch of files corrupt and moved them to .git/annex/bad. Oops, I
thought, I must have a failing disk. Luckily I had offsite backups -- no less
than two other external hard disks (disk2-3), each having a full copy of the
repo in question. However, **both of these** had the same, corrupt files. The
files have the correct size, but are filled with zeroes. Other files in the
repo are fine, and so are other repos.

I have been trying to wrap my head around this but I can't think of any reason
how this could occur. However the files have gotten corrupted in the first
place, the corruption should have been picked up when copying the content to
the external drives disk2 and disk3, right? I have to rule out NSA/MIB/aliens
from messing with me because these files are not that valuable or sensitive.

The files in question were added to git-annex back in 2012, so the trail is
cold on this one. Naturally, I have no idea on how to reproduce this, nor can I
reliably say that git-annex is to blame. I can gather some hints though. The
files were all added on the same commit in 2012, but not all files from that
commit are corrupted. The corrupted files have consecutive file names. The
files were never modified since (except for the corruption), and the content
*may* have been copied via an encrypted rsync transfer repository. I have
always used git-annex on Arch Linux and in indirect more. The files used the
SHA-1 backend.

All these files have a similar tracking log that looks something like this
(uuids replaced with symbolic names):

    1356690700.542152s 1 computer1			<- first added
    1356691074.253815s 1 disk1				<- copied to disk1
    1356719321.145126s 1 rsync				<- copied to rsync repo
    1358070999.435676s 1 rsync				<- copied to rsync repo (again?)
    1362166895.310332s 1 disk2				<- copied to disk2
    1362906850.555869s 1 computer2 (dead)	<- copied to another computer
    1364926664.362195s 0 computer1			<- dropped from computer1 as enough copies in disks
    1374412057.409496s 0 computer2 (dead)	<- dropped from computer2, now dead
    1445691595.764108s 1 disk3				<- copied to disk3
    1445770764.165792s 0 rsync				<- dropped from rsync repo to save space
    1482077052.217353646s 0 disk1			<- first noticed as corrupted on disk1
    1482741278.318274404s 0 disk3			<- WTF, also corrupted on disk3
    1482926246.268440532s 0 disk2			<- double-WTF, also corrupted on disk2

The only thing that strikes odd to me is the double entry with the rsync
remote. The non-corrupted files from the same commit do not seem to have such a
double entry.

So my main question is, has there ever been a bug in git-annex that could have
caused this behavior? Or is there any other realistic explanation for this? In
case this is an existing bug, is there any other evidence I can gather?
Needless to say, the lesson here is to run `git annex fsck` regularly even if
you have offsite backups...