doc/todo/export.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73

`git annex export` corresponding to import. This might be useful for eg,
datalad. There are some requests to make eg a S3 bucket mirror the
filenames in the git annex repository with incremental updates, 
which seem out of scope (and there are many tools to do stuff like that
search "deploy files to S3 bucket"), 
but something simpler like `git annex export` could be worth doing.

`git annex export --to remote files` would copy the files to the remote,
using the names in the working tree. For remotes like S3, it could add the
url of the exported file, so that another clone of the repo could use the
exported data.

Would this be able to reuse the existing `storeKey` interface, or would
there need to be a new interface in supported remotes?

--[[Joey]]

Work is in progress. Todo list:

* `git annex get --from export` works in the repo that exported to it,
  but in another repo, the export db won't be populated, so it won't work.
  Maybe just show a useful error message in this case?  

  However, exporting from one repository and then trying to update the
  export from another repository also doesn't work right, because the
  export database is not populated. So, seems that the export database needs
  to get populated based on the export log in these cases.  

  This needs the db to contain a record of the data source,
  the tree that most recently populated it.

  When the export log contains a different tree than the data source,
  the export was updated in another repository, and so the
  export db needs to be updated.

  Updating the export db could diff the data source with the 
  logged treeish. Add/delete exported files from the database to get
  it to the same state as the remote database.

  When an export is incomplete, the database is in some
  state in between the data source tree and the incompletely
  exported tree. Diffing won't resolve this.

  When to record the data source? If it's done at the same time the export
  is recorded (as no longer incomplete) in the export log, all the files
  have not yet been uploaded to the export, and the the database is not
  fully updated to match the data source.

  Seems that we need a separate table, to be able to look up filenames
  from the export tree by key. That table can be fully populated,
  before the Exported table is.

* tracking exports

* Support configuring export in the assistant
  (when eg setting up a S3 special remote).

  This is similar to the little-used preferreddir= preferred content
  setting and the "public" repository group. The assistant uses
  those for IA, which could be replaced with setting up an export
  tracking branch.

Low priority:

* When there are two pairs of duplicate files, and the filenames are
  swapped around, the current rename handling renames both dups to a single
  temp file, and so the other file in the pair gets re-uploaded
  unncessarily. This could be improved.

  Perhaps: Find pairs of renames that swap content between two files.
  Run each pair in turn. Then run the current rename code. Although this
  still probably misses cases, where eg, content cycles amoung 3 files, and
  the same content amoung 3 other files. Is there a general algorythm?