summaryrefslogtreecommitdiff
path: root/doc/todo/Limit_file_revision_history.mdwn
blob: c00b555b18b4fd43b1a21e89b89846c7a323c4e4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
Hi, I am assuming to use git-annex-assistant for two usecases, but I would like to ask about the options or planed roadmap for dropped/removed files from the repository.

Usecases:

1. sync working directory between laptop, home computer, work komputer
2. archive functionality for my photograps

Both usecases have one common factor. Some files might become obsolate and
in long time frame nobody is interested to keep their revisions. Let's
assume photographs. Usuall workflow I take is to import all photograps to
filesystem, then assess (select) the good ones I want to keep and then
process them what ever way.

Problem with git-annex(-assistant) I have is that it start to revision all
of the files at the time they are added to directory. This is welcome at
first but might be an issue if you are used to put 80% of the size of your
imported files to trash.

I am aware of what git-annex is not. I have been reading documentation for
"git-annex drop" and "unused" options including forums. I do understand
that I am actually able to delete all revisions of the file if I will drop
it, remove it and if I will run git annex unused 1..###. (on all synced
repositories).

I actually miss the option to have above process automated/replicated to the other synced repositories.

I would formulate the 'use case' requirements for git-annex as:

* command to drop an file including revisions from all annex repositories?
  (for example like moving a file to /trash folder) that will schedulle
  it's deletition)
* option to keep like max. 10 last revisions of the file?
* option to keep only previous revisions if younger than 6 months from now?

Finally, how to specify a feature request for git-annex?

> By moving it here ;-) --[[Joey]] 

> So, let's spec out a design.
> 
> * Add preferred content terminal to configure whether a repository wants
>   to hang on to unused content. Simply `unused`.
>   (It cannot include a timestamp, because there's
>   no way repos can agree on about when a key became unused.) **done**
> * In order to quickly match that terminal, the Annex monad will need
>   to keep a Set of unused Keys. This should only be loaded on demand.
>   **done**  
>   NB: There is some potential for a great many unused Keys to cause
>   memory usage to balloon.
> * Client repositories will end their preferred content with
>   `and (not unused)`. Transfer repositories too, because typically
>   only client repos connect to them, and so otherwise unused files
>   would build up there. Backup repos would want unused files. I
>   think that archive repos would too.
> * Make the assistant check for unused files periodically. Exactly
>   how often may need to be tuned, but once per day seems reasonable
>   for most repos. Note that the assistant could also notice on the
>   fly when files are removed and mark their keys as unused if that was
>   the last associated file. (Only currently possible in direct mode.)
> * After scanning for unused files, it makes sense for the
>   assistant to queue transfers of unused files to any remotes that
>   do want them (eg, backup remotes). If the files can successfully be
>   sent to a remote, that will lead to them being dropped locally as
>   they're not wanted.
> * Add a git config setting like annex.expireunused=7d. This causes
>   *deletion* of unused files after the specified time period if they are
>   not able to be moved to a repo that wants them.
>   (The default should be annex.expireunused=false.)
> * How to detect how long a file has been unused? We can't look at the
>   time stamp of the object; we could use the mtime of the .map file,
>   that that's direct mode only and may be replaced with a database
>   later. Seems best to just keep a unused log file with timestamps.
>   **done**
> * After the assistant scans for unused files, if annex.expireunused
>   is not set, and there is some significant quantity of unused files
>   (eg, more than 1000, or more than 1 gb, or more than the amount of
>   remaining free disk space),
>   it can pop up a webapp alert asking to configure it.
> 
> This does not cover every use case that was requested.
> But I don't see a cheap way to ensure it keeps eg the past 10 versions of
> a file. I guess that if you care about that, you leave
> annex.expireunused=false, and set up a backup repository where the unused
> files will be moved to.
> 
> Note that since the assistant uses direct mode by default, old versions
> of modififed files are not guaranteed to be retained. But they very well
> might be. For example, if a file is replicated to 2 clients, and one
> client directly edits it, or deletes it, it loses the old version,
> but the other client will still be storing that old version.
> 
> ## Stability analysis for unused in preferred content expressions
> 
> This is tricky, because two repos that are otherwise entirely
> in sync may have differing opinons about whether a key is unused,
> depending on when each last scanned for unused keys.
> 
> So, this preferred content terminal is *not stable*.
> It may be possible to write preferred content expressions 
> that constantly moved such keys around without reaching a steady state.
> 
> Example:
> 
> A and B are clients directly connected, and both also connected
> to BACKUP.
> 
> A deletes F. B syncs with A, and runs unused check; decides F
> is unused. B sends F to BACKUP. B will then think A doesn't want F,
> and will drop F from A. Next time A runs a full transfer scan, it will
> *not* find F (because the file was deleted!). So it won't get F back from
> BACKUP.
> 
> So, it looks like the fact that unused files are not going to be
> looked for on the full transfer scan seems to make this work out ok.