summaryrefslogtreecommitdiff
path: root/doc/design/assistant/inotify.mdwn
blob: c2a25673e1412bfcfba7b056eb439ad30af1c100 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
Finish "git annex watch" command, which runs, in the background, watching via
inotify for changes, and automatically annexing new files, etc.

There is a `watch` branch in git that adds the command.

## known bugs

* A process has a file open for write, another one closes it,
  and so it's added. Then the first process modifies it.

  Or, a process has a file open for write when `git annex watch` starts
  up, it will be added to the annex. If the process later continues
  writing, it will change content in the annex.

  This changes content in the annex, and fsck will later catch
  the inconsistency.

  Possible fixes: 

  * Somehow track or detect if a file is open for write by any processes.
    `lsof` could be used, although it would be a little slow, and not
    avoid every possible race.
  * Or, when possible, making a copy on write copy before adding the file
    would avoid this.
  * Or, as a last resort, make an expensive copy of the file and add that.
  * Tracking file opens and closes with inotify could tell if any other
    processes have the file open. But there are problems.. It doesn't
    seem to differentiate between files opened for read and for write.
    And there would still be a race after the last close and before it's
    injected into the annex, where it could be opened for write again.
    Would need to detect that and undo the annex injection or something.

* If a file is checked into git as a normal file and gets modified
  (or merged, etc), it will be converted into an annexed file.
  See [[blog/day_7__bugfixes]]

* When you `git annex unlock` a file, it will immediately be re-locked.

## todo

- Support OSes other than Linux; it only uses inotify currently.
  OSX and FreeBSD use the same mechanism, and there is a Haskell interface
  for it,
- Run niced and ioniced? Seems to make sense, this is a background job.
- configurable option to only annex files meeting certian size or
  filename criteria
- option to check files not meeting annex criteria into git directly,
  automatically
- honor .gitignore, not adding files it excludes (difficult, probably
  needs my own .gitignore parser to avoid excessive running of git commands
  to check for ignored files)
- Possibly, when a directory is moved out of the annex location,
  unannex its contents. (Does inotify tell us where the directory moved
  to so we can access it?)

## the races

Many races need to be dealt with by this code. Here are some of them.

* File is added and then removed before the add event starts.

  Not a problem; The add event does nothing since the file is not present.

* File is added and then removed before the add event has finished
  processing it.
  
  **Minor problem**; When the add's processing of the file (checksum and so
  on) fails due to it going away, there is an ugly error message, but
  things are otherwise ok.

* File is added and then replaced with another file before the annex add
  moves its content into the annex.

  Fixed this problem; Now it hard links the file to a temp directory and
  operates on the hard link, which is also made unwritable.

* File is added and then replaced with another file before the annex add
  makes its symlink.

  **Minor problem**; The annex add will fail creating its symlink since
  the file exists. There is an ugly error message, but the second add
  event will add the new file.

* File is added and then replaced with another file before the annex add
  stages the symlink in git.

  Now fixed; `git annex watch` avoids running `git add` because of this
  race. Instead, it stages symlinks directly into the index, without
  looking at what's currently on disk.

* Link is moved, fixed link is written by fix event, but then that is
  removed by the user and replaced with a file before the event finishes.

  Now fixed; same fix as previous race above.

* File is removed and then re-added before the removal event starts.

  Not a problem; The removal event does nothing since the file exists,
  and the add event replaces it in git with the new one.

* File is removed and then re-added before the removal event finishes.

  Not a problem; The removal event removes the old file from the index, and
  the add event adds the new one.

## done

- on startup, add any files that have appeared since last run **done**
- on startup, fix the symlinks for any renamed links **done**
- on startup, stage any files that have been deleted since last run
  (seems to require a `git commit -a` on startup, or at least a
  `git add --update`, which will notice deleted files) **done**
- notice new files, and git annex add **done**
- notice renamed files, auto-fix the symlink, and stage the new file location
  **done**
- handle cases where directories are moved outside the repo, and stop
  watching them **done**
- when a whole directory is deleted or moved, stage removal of its
  contents from the index **done**
- notice deleted files and stage the deletion
  (tricky; there's a race with add since it replaces the file with a symlink..)
  **done**
- Gracefully handle when the default limit of 8192 inotified directories
  is exceeded. This can be tuned by root, so help the user fix it.
  **done**
- periodically auto-commit staged changes (avoid autocommitting when
  lots of changes are coming in) **done**
- coleasce related add/rm events for speed and less disk IO **done**
- don't annex `.gitignore` and `.gitattributes` files **done**
- run as a daemon **done**