aboutsummaryrefslogtreecommitdiff
path: root/doc/internals.mdwn
blob: 9970a0bbdc9b3220467ddc8bef34b94376655091 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
In the world of git, we're not scared about internal implementation
details, and sometimes we like to dive in and tweak things by hand. Here's
some documentation to that end.

## The .git/ directory

### `.git/annex/objects/aa/bb/*/*`

This is where locally available file contents are actually stored.
Files added to the annex get a symlink checked into git that points
to the file content.

First there are two levels of directories used for hashing, to prevent
too many things ending up in any one directory.
See [[hashing]] for details.

Each subdirectory has the [[name_of_a_key|key_format]] in one of the
[[key-value_backends|backends]]. The file inside also has the name of the key.
This two-level structure is used because it allows the write bit to be removed
from the subdirectories as well as from the files. That prevents accidentally
deleting or changing the file contents. See [[lockdown]] for details.

In [[direct_mode]], file contents are not stored in here, and instead
are stored directly in the file. However, the same symlinks are still
committed to git, internally.

Also in [[direct_mode]], some additional data is stored in these directories.
`.cache` files contain cached file stats used in detecting when a file has
changed, and `.map` files contain a list of file(s) in the work directory
that contain the key.

### `.git/annex/tmp/`

This directory contains partially transferred objects.

### `.git/annex/misctmp/`

This is a temp directory for miscellaneous other temp files.

While .git/annex/objects and .git/annex/tmp can be put on different
filesystems if desired, .git/annex/misctmp 
has to be on the same filesystem as the work tree and git repository.

### `.git/annex/bad/`

git-annex fsck puts any bad objects it finds in here.

### `.git/annex/transfers/`

Contains information files for uploads and downloads that are in progress,
as well as any that have failed. Used especially by the assistant.
It is safe to delete these files.

### `.git/annex/ssh/`

ssh connection caching files are written in here.

### `.git/annex/index`

This is a git index file which git-annex uses to stage files
when preparing commits to the git-annex branch. 

### `.git/annex/journal/`

git-annex uses this to journal changes to the git-annex branch,
before committing a set of changes.

## The git-annex branch

This branch is managed by git-annex, with the contents listed below.

This branch is not connected to your master, etc branches. It it used for
internal tracking of information about git-annex repositories and annexed
objects.

The files stored in this branch are all designed to be auto-merged
using git's [[union merge driver|git-union-merge]]. So each line
has a timestamp, to allow the most recent information to be identified.

### `uuid.log`

Records the UUIDs of known repositories, and associates them with a
description of the repository. This allows git-annex to display something
more useful than a UUID when it refers to a repository that does not have
a configured git remote pointing at it.

The file format is simply one line per repository, with the uuid followed by a
space and then the description, followed by a timestamp. Example:

	e605dca6-446a-11e0-8b2a-002170d25c55 laptop timestamp=1317929189.157237s
	26339d22-446b-11e0-9101-002170d25c55 usb disk timestamp=1317929330.769997s

## `numcopies.log`

Records the global numcopies setting.

The file format is simply a timestamp followed by a number.

## `remote.log`

Holds persistent configuration settings for [[special_remotes]] such as
Amazon S3.

The file format is one line per remote, starting with the uuid of the
remote, followed by a space, and then a series of var=value pairs,
each separated by whitespace, and finally a timestamp.

Encrypted special remotes store their encryption key here,
in the "cipher" value. It is base64 encoded, and unless shared [[encryption]]
is used, is encrypted to one or more gpg keys. The first 256 bytes of
the cipher is used as the HMAC SHA1 encryption key, to encrypt filenames
stored on the special remote. The remainder of the cipher is used as a gpg
symmetric encryption key, to encrypt the content of files stored on the special
remote.

## `trust.log`

Records the [[trust]] information for repositories. Does not exist unless
[[trust]] values are configured.

The file format is one line per repository, with the uuid followed by a
space, and then either `1` (trusted), `0` (untrusted), `?` (semi-trusted),
`X` (dead) and finally a timestamp.

Example:

	e605dca6-446a-11e0-8b2a-002170d25c55 1 timestamp=1317929189.157237s
	26339d22-446b-11e0-9101-002170d25c55 ? timestamp=1317929330.769997s

Repositories not listed are semi-trusted.

## `group.log`

Used to group repositories together.

The file format is one line per repository, with the uuid followed by a space,
and then a space-separated list of groups this repository is part of,
and finally a timestamp.

## `preferred-content.log`

Used to indicate which repositories prefer to contain which file contents.

The file format is one line per repository, with the uuid followed by a space,
then a boolean expression, and finally a timestamp.

Files matching the expression are preferred to be retained in the
repository, while files not matching it are preferred to be stored
somewhere else.

## `required-content.log`

Used to indicate which repositories are required to contain which file
contents.

File format is identical to preferred-content.log.

## `group-preferred-content.log`

Contains standard preferred content settings for groups. (Overriding or
supplementing the ones built into git-annex.)

The file format is one line per group, staring with a timestamp, then a
space, then the group name followed by a space and then the preferred
content expression.

## `aaa/bbb/*.log`

These log files record [[location_tracking]] information
for file contents. These are placed in two levels of subdirectories
for hashing. See [[hashing]] for details.

The name of the key is the filename, and the content
consists of a timestamp, either 1 (present) or 0 (not present), and
the UUID of the repository that has or lacks the file content.

Example:

	1287290776.765152s 1 e605dca6-446a-11e0-8b2a-002170d25c55
	1287290767.478634s 0 26339d22-446b-11e0-9101-002170d25c55

## `aaa/bbb/*.log.web`

These log files record urls used by the
[[web_special_remote|special_remotes/web]]. Their format is similar
to the location tracking files, but with urls rather than UUIDs.

## `aaa/bbb/*.log.rmt`

These log files are used by remotes that need to record their own state
about keys. Each remote can store one line of data about a key, in
its own format.

Example:

	1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55 blah blah
	1287290767.478634s 26339d22-446b-11e0-9101-002170d25c55 foo=bar

## `aaa/bbb/*.log.met`

These log files are used to store arbitrary [[design/metadata]] about keys.
Each key can have any number of metadata fields. Each field has a set of
values.

Lines are timestamped, and record when values are added (`field +value`),
but also when values are removed (`field -value`). Removed values
are retained in the log so that when merging an old line that sets a value
that was later unset, the value is not accidentally added back.

For example:

	1287290776.765152s tag +foo +bar author +joey
	1291237510.141453s tag -bar +baz

The value can be completely arbitrary data, although it's typically
reasonably short. If the value contains any whitespace
(including \r or \n), it will be base64 encoded. Base64 encoded values
are indicated by prefixing them with "!".

## `aaa/bbb/*.log.cnk`

These log files are used when objects are stored in chunked form on
remotes. They record the size(s) of the chunks, and the number of chunks.

For example, this logs that a remote has an object stored using both
9 chunks of 1 mb size, and 1 chunk of 10 mb size.

	1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55:10240 9
	1287290776.765153s e605dca6-446a-11e0-8b2a-002170d25c55:102400 1

(When those chunks are removed from the remote, the 9 is changed to 0.)

## `schedule.log`

Used to record scheduled events, such as periodic fscks.

The file format is simply one line per repository, with the uuid followed by a
space and then its schedule, followed by a timestamp.

There can be multiple events in the schedule, separated by "; ".

The format of the scheduled events is the same described in
the SCHEDULED JOBS section of the man page.

Example:

	42bf2035-0636-461d-a367-49e9dfd361dd fsck self 30m every day at any time; fsck 4b3ebc86-0faf-4892-83c5-ce00cbe30f0a 1h every year at any time timestamp=1385646997.053162s

## `transitions.log`

Used to record transitions, eg by `git annex forget`

Each line of the file is a transition, followed by a timestamp.

Example:

	ForgetGitHistory 1387325539.685136s
	ForgetDeadRemotes 1387325539.685136s