diff options
Diffstat (limited to 'doc/design/assistant/disaster_recovery.mdwn')
-rw-r--r-- | doc/design/assistant/disaster_recovery.mdwn | 160 |
1 files changed, 146 insertions, 14 deletions
diff --git a/doc/design/assistant/disaster_recovery.mdwn b/doc/design/assistant/disaster_recovery.mdwn index 28dd41c5a..6fcf95519 100644 --- a/doc/design/assistant/disaster_recovery.mdwn +++ b/doc/design/assistant/disaster_recovery.mdwn @@ -1,39 +1,105 @@ The assistant should help the user recover their repository when things go wrong. +[[!toc ]] + ## dangling lock files There are a few ways a git repository can get broken that are easily fixed. One is left over index.lck files. When a commit to a repository fails, check that nothing else is using it, fix the problem, and redo the commit. -This should be done on both the current repository and any local -repositories. Maybe also make git-annex-shell be able to do it remotely? +* **done** for .git/annex/index.lock, can be handled safely and automatically. +* **done** for .git/index.lock, only when the assistant is starting up. +* What about local remotes, eg removable drives? git-annex does attempt + to commit to the git-annex branch of those. It will use the automatic + fix if any are dangling. It does not commit to the master branch; indeed + a removable drive typically has a bare repository. + However, it does a scan for broken locks anyway if there's a problem + syncing. **done** +* What about git-annex-shell? If the ssh remote has the assistant running, + it can take care of it, and if not, it's a server, and perhaps the user + should be required to fix up if it crashes during a commit. This should + not affect the assistant anyway. +* **done** Seems that refs can also have stale lock files, for example + '/storage/emulated/legacy/DCIM/.git/refs/remotes/flick_phonecamera/synced/git-annex.lock' + All git lock files are now handled (except gc lock files). ## incremental fsck -Add webapp UI to enable incremental fsck, and choose when to start and how -long to run each day. +Add webapp UI to enable incremental fsck **done** + +Of course, incremental fsck will run as an niced (and ioniced) background +job. There will need to be a button in the webapp to stop it, in case it's +annoying. **done** When fsck finds a damanged file, queue a download of the file from a -remote. If no accessible remote has the file, prompt the user to eg, connect -a drive containing it. +remote. **done** + +Detect when a removable drive is connected in the Cronner, and check +and try to run its remote fsck jobs. **done** (Same mechanism will work for +network remotes becoming connected.) + +TODO: If no accessible remote has a file that fsck reported missing, +prompt the user to eg, connect a drive containing it. Or perhaps this is a +special case of a general problem, and the webapp should prompt the user +when any desired file is available on a remote that's not mounted? ## git-annex-shell remote fsck +TODO: git-annex-shell fsck support, which would allow cheap fast fscks +of ssh remotes. + Would be nice; otherwise remote fsck is too expensive (downloads -everything) to have the assistant do. (remote fsck --fast might be worth -having the assistant do) +everything) to have the assistant do. + +Note that Remote.Git already tries to use this, but the assistant does not +call it for non-local remotes. + +## git fsck and repair + +Add git fsck to scheduled self fsck **done** + +TODO: git fsck on ssh remotes? Probably not worth the complexity.. -## git fsck +TODO: If committing to the repository fails, after resolving any dangling +lock files (see above), it should git fsck. This is difficult, because +git commit will also fail if the commit turns out to be empty, or due to +other transient problems.. So commit failures are currently ignored by the +assistant. -Have the sanity checker run git fsck periodically (it's fairly inexpensive, -but still not too often, and should be ioniced and niced). +If git fsck finds problems, launch git repository repair. **done** -If committing to the repository fails, after resolving any dangling lock -files (see above), it should git fsck. +git annex fsck --fast at end of repository repair to ensure +git-annex branch is accurate. **done** -If git fsck finds problems, launch git repository repair. +If syncing with a local repository fails, try to repair it. **done** + +TODO: "Repair" gcrypt remotes, by removing all refs and objects, +and re-pushing. (Since the objects are encrypted data, there is no way +to pull missing ones from anywhere..) +Need to preserve gcrypt-id while doing this! + +TODO: along with displaying alert when there is a problem detected +by consistency check, send an email alert. (Using system MTA?) + +## nudge user to schedule fscks + +Make the webapp encourage users to schedule fscks of their +local repositories. The goal here was that it should not be obnoxious about +repeatedly pestering the user to set that up, but should still encourage +anyone who cares to set it up. + +Maybe: Display a message only once per week, and only after the repository +has existed for at least one full day. But, this will require storing +quite a lot of state. + +Or: Display a message whenever a removable drive is detected to have been +connected. I like this, but what about nudging the main repo? Could do it +every webapp startup, perhaps? **done** + +There should be a "No thanks" button that prevents it nudging again for a +repo. **done** ## git repository repair @@ -51,3 +117,69 @@ clone the remote, sync from all other remotes, move over .git/config and .git/annex/objects, and tar up the old broken git repo and `git annex add` it. This should be automatable and get the user back on their feet. User could just click a button and have this be done. + +This is useful outside git-annex as well, so make it a +git-recover-repository command. + +### detailed design + +Run `git fsck` and parse output to find bad objects. **done** Note that +fsck may fall over and fail to print out all bad objects, when +files are corrupt. So if the fsck exits nonzero, need to collect all +bad objects it did find, and: + +1. If the local repository contains packs, the packs may be corrupt. + So, start by using `git unpack-objects` to unpack all + packs it can handle (which may include parts of corrupt packs) + back to loose objects. And delete all packs. **done** +2. Delete all loose corrupt objects. **done** + +Repeat until fsck finds no new problems. **done** + +Check if there's a remote. If so, and if the bad objects are all +present on it, can simply get all bad objects from the remote, +and inject them back into .git/objects to recover: + +3. Make a new (bare) clone from the remote. + (Note: git does not seem to provide a way to fetch specific missing + objects from the remote. Also, cannot use `--reference` against + a repository with missing refs. So this seems unavoidably + network-expensive.) **done** +5. Rsync objects over. (Turned out to work better than git-cat-file, + because we don't have to walk the graph to add missing objects.) + **done** +6. If each bad object was able to be repaired this way, we're done! + (If not, can reuse the clone for getting objects from the next remote.) + **done** + +If some missing objects cannot be recovered from remotes, find commits in each +local branch that are broken by all remaining missing objects. Some of this can +be parsed from git fsck output, but for eg blobs, the commits need to +be walked to walk the trees, to find trees that refer to the blobs. **done** + +For each branch that is affected, look in the reflog and/or `git log +$branch` to find the last good commit that predates all broken commits. (If +the head commit of a branch is broken, git log is not going to show +anything useful, but the reflog can be used to find past refs for the +branch -- have to first delete the .git/HEAD file if it points to the +broken ref.) **done** + +The basic idea then is to reset the branch to the last good commit +that was found for it. + +* For the HEAD branch, can just reset it. (If no last good commit was found + for the HEAD branch, reset it to a dummy empty commit.) This will + leave git showing any changes made since then as staged in the index and + uncommitted. Or if the index is missing/corrupt, any files in the tree will + show as modified and uncommitted. User (or git-annex assistant) can then + commit as appropriate. Print appropriate warning message. **done** +* Special handling for git-annex branch and index. **done** +* Remote tracking branches can just be removed, and then `git fetch` + from the remote, which will re-download missing objects from it and + reinstate the tracking branch. **done** +* For other branches, reset them to last good commit, or delete + if none was found. **done** +* (Decided not to touch tags.) + +The index file can still refer to objects that were missing. +Rewrite to remove them. **done** |