summaryrefslogtreecommitdiff
path: root/doc/devblog/day_270__distributed_fsck.mdwn
blob: 0e25acb2bf5fe330a0e3a3519adecf7f998cbd47 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Added two options to `git annex fsck` that allow for a form of distributed
fsck.  This is useful in situations where repositiories cannot be trusted to
continue to exist, and cannot be checked directly, but you'd still like to
keep track of their status. [[design/iabackup]] is one use case for this.

By running a periodic fsck with the --distributed option,
the repositories can verify that they still exist and that the
information about their contents is still accurate. This is done by
doing an extra update of the location log each time a file is verified by
fsck to still be in the repository.

The other option looks like --expire="30d somerepo:60d". It checks that
each specified repository has recorded a distributed fsck within the specified
time period. If not, the repository is dropped from the location tracking
log. Of course it can always update that later if it's really still around.

Distributed fsck is not the default because those extra location log updates
increase the size of the git-annex branch. I did one thing to keep the size
increase small: An identical line is logged to for each key, including the
timestamp, so git's delta compression will work as well as is possible. But,
there's still commit and tree update overhead. 

Probably doesn't make sense to run distributed fscks too often for that and
other reasons. If the git-annex branch does get too large, there's always
`git annex forget` ...

**(Update: This was later rethought and works much more efficiently now..)**