doc/tips/antipatterns.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107

This page tries to regroup a set of Really Bad Ideas people had with
git-annex in the past that can lead to catastrophic data loss, abusive
disk usage, improper swearing and other unfortunate experiences.

This could also be called the "git annex worst practices", but is
different than [[not|what git annex is not]] in that it covers normal
use cases of git-annex, just implemented in the wrong way. Hopefully,
git-annex should make it as hard as possible to do those things, but
sometimes, you just can't help it, people figure out the worst
possible ways of doing things.

[[!toc]]

---

# **Antipattern**

Symlinking the `.git/annex` directory, in the hope of saving
disk space, is a horrible idea. The general antipattern is:

    git clone repoA repoB
    mv repoB/.git/annex repoB/.git/annex.bak
    ln -s repoA/.git/annex repoB/.git/annex

This is bad because git-annex will believe it has two copy of the
files and then would let you drop the single copy, therefore leading
to data loss.

Proper pattern
--------------

The proper way of doing this is through git-annex's hardlink support,
by cloning the repository with the `--shared` option:

    git clone --shared repoA repoB

This will setup repoB as an "untrusted" repository and use hardlinks
to copy files between the two repos, using space only once. This
works, of course, only on filesystems that support hardlinks, but
that's usually the case for filesystems that support symlinks.

Real world cases
----------------

 * [[forum/share_.git__47__annex__47__objects_across_multiple_repositories_on_one_machine/]]
 * at least one IRC discussion

Fixes
-----

Probably no way to fix this in git-annex - if users want to shoot
themselves in the foot by messing with the backend, there's not much
we can do to change that in this case.

---

# **Antipattern**

Reinit repo with an existing uuid without fsck

To quote the [[git-annex-reinit]] manpage:

> Normally, initializing a repository generates a new, unique
> identifier (UUID) for that repository. Occasionally it may be useful
> to reuse a UUID -- for example, if a repository got deleted, and
> you're setting it back up.

[[git-annex-reinit]] can be used to reuse UUIDs for deleted
repositories. But what happens if you reuse the UUID of an *existing*
repository, or a repository that hasn't been properly emptied before
being declared dead? This can lead to data loss because, in that case,
git-annex may think some files are still present in the revived
repository (while they may not actually be).

Proper pattern
--------------

The proper way of using reinit is to make sure you run
[[git-annex-fsck]] (optionally with `--fast` to save time) on the
revived repo right after running reinit. This will ensure that at
least the location log will be updated, and git-annex will notice if
files are missing.

Real world cases
----------------

 * [[bugs/remotes_disappeared]]

Fixes
-----

An improvement to git-annex here would be to allow
[[todo/reinit_should_work_without_arguments|reinit to work without arguments]]
to at least not encourage UUID reuse. reinit could also recommend
running fsck explicitely. It could even trigger an fsck directly.

The [[git-annex-reinit]] manpage has always suggested running `fsck`,
but the wording has been changed on 2017-01-17.

Other cases
===========

Feel free to add your lessons in catastrophe here! It's educational
and fun, and will improve git-annex for everyone.

PS: should this be a toplevel page instead of being drowned in the
[[tips]] section? Where should it be linked to?  -- [[anarcat]]