summaryrefslogtreecommitdiff
path: root/doc/design/assistant/disaster_recovery.mdwn
blob: 1379ccbc2e55d0d2cfac453fd85678704b235136 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
The assistant should help the user recover their repository when things go
wrong.

## dangling lock files

There are a few ways a git repository can get broken that are easily fixed.
One is left over index.lck files. When a commit to a repository fails,
check that nothing else is using it, fix the problem, and redo the commit.

* **done** for .git/annex/index.lock, can be handled safely and automatically.
* **done** for .git/index.lock, only when the assistant is starting up.
* What about local remotes, eg removable drives? git-annex does attempt
  to commit to the git-annex branch of those. It will use the automatic
  fix if any are dangling. It does not commit to the master branch; indeed
  a removable drive typically has a bare repository. So I think nothing to
  do here.
* What about git-annex-shell? If the ssh remote has the assistant running,
  it can take care of it, and if not, it's a server, and perhaps the user
  should be required to fix up if it crashes during a commit. This should
  not affect the assistant anyway.
* **done** Seems that refs can also have stale lock files, for example
  '/storage/emulated/legacy/DCIM/.git/refs/remotes/flick_phonecamera/synced/git-annex.lock'
  All git lock files are now handled (except gc lock files).

## incremental fsck

Add webapp UI to enable incremental fsck **done**

Of course, incremental fsck will run as an niced (and ioniced) background
job. There will need to be a button in the webapp to stop it, in case it's
annoying. **done**

When fsck finds a damanged file, queue a download of the file from a
remote. **done**

Detect when a removable drive is connected in the Cronner, and check
and try to run its remote fsck jobs. **done** (Same mechanism will work for
network remotes becoming connected.)

TODO: If no accessible remote has a file that fsck reported missing,
prompt the user to eg, connect a drive containing it. Or perhaps this is a
special case of a general problem, and the webapp should prompt the user
when any desired file is available on a remote that's not mounted?

TODO: Enhance the Recurrance type to be able to do eg, events that run
once per month on any day, or once per year, or once per week. This
would be especially useful for removable drives, which might not be
plugged in on the 1st of the month. This should be the default in the
webapp (it's already worded to suggest this.)

## git-annex-shell remote fsck

TODO: git-annex-shell fsck support, which would allow cheap fast fscks
of ssh remotes.

Would be nice; otherwise remote fsck is too expensive (downloads
everything) to have the assistant do.

Note that Remote.Git already tries to use this, but the assistant does not
call it for non-local remotes.

## git fsck

Have the sanity checker run git fsck periodically (it's fairly inexpensive,
but still not too often, and should be ioniced and niced). 

If committing to the repository fails, after resolving any dangling lock
files (see above), it should git fsck.

If git fsck finds problems, launch git repository repair.

## git repository repair

There are several ways git repositories can get damanged. 

The most common is empty files in .git/annex/objects and commits that refer
to those objects. When the objects have not yet been pushed anywhere.
I've several times recovered from this manually by
removing the bad files and resetting to before the commits that referred to
them. Then re-staging any divergence in the working tree. This could
perhaps be automated.

As long as the git repository has at least one remote, another method is to
clone the remote, sync from all other remotes, move over .git/config and
.git/annex/objects, and tar up the old broken git repo and `git annex add`
it. This should be automatable and get the user back on their feet. User
could just click a button and have this be done.