| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
| |
sometimes not replace pointer files.
The keys database handle needs to be closed after merging, because the
smudge filter, in another process, updates the database. Old cached info
can be read for a while from the open database handle; closing it ensures
that the info written by the smudge filter is available.
This is pretty horribly ad-hoc, and it's especially nasty that the
transferrer closes the database every time.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
representable in the current locale.
This is a mostly backwards compatable change. I broke backwards
compatability in the case where a filename starts with double-quote.
That seems likely to be very rare, and v6 unlocked files are a new feature
anyway, and fsck needs to fix missing associated file mappings anyway. So,
I decided that is good enough.
The encoding used is to just show the String when it contains a problem
character. While that adds some overhead to addAssociatedFile and
removeAssociatedFile, those are not called very often. This approach has
minimal decode overhead, because most filenames won't be encoded that way,
and it only has to look for the leading double-quote to skip the expensive
read. So, getAssociatedFiles remains fast.
I did consider using ByteString instead, but getting a FilePath converted
with all chars intact, even surrigates, is difficult, and it looks like
instance PersistField ByteString uses Text, which I don't trust for problem
encoded data. It would probably be slower too, and it would make the
database less easy to inspect manually.
|
|
|
|
|
|
| |
This lets readonly repos be used. If a repo is readonly, we can ignore the
keys database, because nothing that we can do will change the state of the
repo anyway.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This breaks any existing keys database!
IKey serializes more efficiently than SKey, although this limits the
use of its Read/Show instances.
This makes the keys database use less disk space, and so should be a win.
Updated benchmark:
benchmarking keys database/getAssociatedFiles from 1000 (hit)
time 64.04 μs (63.95 μs .. 64.13 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 64.02 μs (63.96 μs .. 64.08 μs)
std dev 218.2 ns (172.5 ns .. 299.3 ns)
benchmarking keys database/getAssociatedFiles from 1000 (miss)
time 52.53 μs (52.18 μs .. 53.21 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 52.31 μs (52.18 μs .. 52.91 μs)
std dev 734.6 ns (206.2 ns .. 1.623 μs)
benchmarking keys database/getAssociatedKey from 1000 (hit)
time 64.60 μs (64.46 μs .. 64.77 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 64.74 μs (64.57 μs .. 65.20 μs)
std dev 900.2 ns (389.7 ns .. 1.733 μs)
benchmarking keys database/getAssociatedKey from 1000 (miss)
time 52.46 μs (52.29 μs .. 52.68 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 52.63 μs (52.35 μs .. 53.37 μs)
std dev 1.362 μs (562.7 ns .. 2.608 μs)
variance introduced by outliers: 24% (moderately inflated)
benchmarking keys database/addAssociatedFile to 1000 (old)
time 487.3 μs (484.7 μs .. 490.1 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 490.9 μs (487.8 μs .. 496.5 μs)
std dev 13.95 μs (6.841 μs .. 22.03 μs)
variance introduced by outliers: 20% (moderately inflated)
benchmarking keys database/addAssociatedFile to 1000 (new)
time 6.633 ms (5.741 ms .. 7.751 ms)
0.905 R² (0.850 R² .. 0.965 R²)
mean 8.252 ms (7.803 ms .. 8.602 ms)
std dev 1.126 ms (900.3 μs .. 1.430 ms)
variance introduced by outliers: 72% (severely inflated)
benchmarking keys database/getAssociatedFiles from 10000 (hit)
time 65.36 μs (64.71 μs .. 66.37 μs)
0.998 R² (0.995 R² .. 1.000 R²)
mean 65.28 μs (64.72 μs .. 66.45 μs)
std dev 2.576 μs (920.8 ns .. 4.122 μs)
variance introduced by outliers: 42% (moderately inflated)
benchmarking keys database/getAssociatedFiles from 10000 (miss)
time 52.34 μs (52.25 μs .. 52.45 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 52.49 μs (52.42 μs .. 52.59 μs)
std dev 255.4 ns (205.8 ns .. 312.9 ns)
benchmarking keys database/getAssociatedKey from 10000 (hit)
time 64.76 μs (64.67 μs .. 64.84 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 64.67 μs (64.62 μs .. 64.72 μs)
std dev 177.3 ns (148.1 ns .. 217.1 ns)
benchmarking keys database/getAssociatedKey from 10000 (miss)
time 52.75 μs (52.66 μs .. 52.82 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 52.69 μs (52.63 μs .. 52.75 μs)
std dev 210.6 ns (173.7 ns .. 265.9 ns)
benchmarking keys database/addAssociatedFile to 10000 (old)
time 489.7 μs (488.7 μs .. 490.7 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 490.4 μs (489.6 μs .. 492.2 μs)
std dev 3.990 μs (2.435 μs .. 7.604 μs)
benchmarking keys database/addAssociatedFile to 10000 (new)
time 9.994 ms (9.186 ms .. 10.74 ms)
0.959 R² (0.928 R² .. 0.979 R²)
mean 9.906 ms (9.343 ms .. 10.40 ms)
std dev 1.384 ms (1.051 ms .. 2.100 ms)
variance introduced by outliers: 69% (severely inflated)
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a schema change so will break any existing keys databases. But,
it's not been released yet, so I'm still able to make such changes.
This speeds up the benchmark quite nicely:
benchmarking keys database/getAssociatedKey from 1000 (hit)
time 91.65 μs (91.48 μs .. 91.81 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 91.78 μs (91.66 μs .. 91.94 μs)
std dev 468.3 ns (353.1 ns .. 624.3 ns)
benchmarking keys database/getAssociatedKey from 1000 (miss)
time 53.33 μs (53.23 μs .. 53.40 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 53.43 μs (53.36 μs .. 53.53 μs)
std dev 274.2 ns (211.7 ns .. 361.5 ns)
benchmarking keys database/getAssociatedKey from 10000 (hit)
time 92.99 μs (92.74 μs .. 93.27 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 92.90 μs (92.76 μs .. 93.16 μs)
std dev 608.7 ns (404.1 ns .. 963.5 ns)
benchmarking keys database/getAssociatedKey from 10000 (miss)
time 53.12 μs (52.91 μs .. 53.39 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 52.84 μs (52.68 μs .. 53.16 μs)
std dev 715.4 ns (400.4 ns .. 1.370 μs)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The benchmark shows that the database access is quite fast indeed!
And, it scales linearly to the number of keys, with one exception,
getAssociatedKey.
Based on this benchmark, I don't think I need worry about optimising
for cases where all files are locked and the database is mostly empty.
In those cases, database access will be misses, and according to this
benchmark, should add only 50 milliseconds to runtime.
(NB: There may be some overhead to getting the database opened and locking
the handle that this benchmark doesn't see.)
joey@darkstar:~/src/git-annex>./git-annex benchmark
setting up database with 1000
setting up database with 10000
benchmarking keys database/getAssociatedFiles from 1000 (hit)
time 62.77 μs (62.70 μs .. 62.85 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 62.81 μs (62.76 μs .. 62.88 μs)
std dev 201.6 ns (157.5 ns .. 259.5 ns)
benchmarking keys database/getAssociatedFiles from 1000 (miss)
time 50.02 μs (49.97 μs .. 50.07 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 50.09 μs (50.04 μs .. 50.17 μs)
std dev 206.7 ns (133.8 ns .. 295.3 ns)
benchmarking keys database/getAssociatedKey from 1000 (hit)
time 211.2 μs (210.5 μs .. 212.3 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 211.0 μs (210.7 μs .. 212.0 μs)
std dev 1.685 μs (334.4 ns .. 3.517 μs)
benchmarking keys database/getAssociatedKey from 1000 (miss)
time 173.5 μs (172.7 μs .. 174.2 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 173.7 μs (173.0 μs .. 175.5 μs)
std dev 3.833 μs (1.858 μs .. 6.617 μs)
variance introduced by outliers: 16% (moderately inflated)
benchmarking keys database/getAssociatedFiles from 10000 (hit)
time 64.01 μs (63.84 μs .. 64.18 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 64.85 μs (64.34 μs .. 66.02 μs)
std dev 2.433 μs (547.6 ns .. 4.652 μs)
variance introduced by outliers: 40% (moderately inflated)
benchmarking keys database/getAssociatedFiles from 10000 (miss)
time 50.33 μs (50.28 μs .. 50.39 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 50.32 μs (50.26 μs .. 50.38 μs)
std dev 202.7 ns (167.6 ns .. 252.0 ns)
benchmarking keys database/getAssociatedKey from 10000 (hit)
time 1.142 ms (1.139 ms .. 1.146 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.142 ms (1.140 ms .. 1.144 ms)
std dev 7.142 μs (4.994 μs .. 10.98 μs)
benchmarking keys database/getAssociatedKey from 10000 (miss)
time 1.094 ms (1.092 ms .. 1.096 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.095 ms (1.095 ms .. 1.097 ms)
std dev 4.277 μs (2.591 μs .. 7.228 μs)
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
The repo path is typically relative, not absolute, so
providing it to absPathFrom doesn't yield an absolute path.
This is not a bug, just unclear documentation.
Indeed, there seem to be no reason to simplifyPath here, which absPathFrom
does, so instead just combine the repo path and the TopFilePath.
Also, removed an export of the TopFilePath constructor; asTopFilePath
is provided to construct one as-is.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes several bugs with updates of pointer files. When eg, running
git annex drop --from localremote
it was updating the pointer file in the local repository, not the remote.
Also, fixes drop ../foo when run in a subdir, and probably lots of other
problems. Test suite drops from ~30 to 11 failures now.
TopFilePath is used to force thinking about what the filepath is relative
to.
The data stored in the sqlite db is still just a plain string, and
TopFilePath is a newtype, so there's no overhead involved in using it in
DataBase.Keys.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Writes are optimised by queueing up multiple writes when possible.
The queue is flushed after the Annex monad action finishes. That makes it
happen on program termination, and also whenever a nested Annex monad action
finishes.
Reads are optimised by checking once (per AnnexState) if the database
exists. If the database doesn't exist yet, all reads return mempty.
Reads also cause queued writes to be flushed, so reads will always be
consistent with writes (as long as they're made inside the same Annex monad).
A future optimisation path would be to determine when that's not necessary,
which is probably most of the time, and avoid flushing unncessarily.
Design notes for this commit:
- separate reads from writes
- reuse a handle which is left open until program
exit or until the MVar goes out of scope (and autoclosed then)
- writes are queued
- queue is flushed periodically
- immediate queue flush before any read
- auto-flush queue when database handle is garbage collected
- flush queue on exit from Annex monad
(Note that this may happen repeatedly for a single database connection;
or a connection may be reused for multiple Annex monad actions,
possibly even concurrent ones.)
- if database does not exist (or is empty) the handle
is not opened by reads; reads instead return empty results
- writes open the handle if it was not open previously
|
| |
|
| |
|
|
|
|
|
|
| |
Fsck can use the queue for efficiency since it is write-heavy, and only
reads a value before writing it. But, the queue is not suited to the Keys
database.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The problem is that shutdown is not always called, particularly in the test
suite. So, a database connection would be opened, possibly some changes
queued, and then not shut down.
One way this can happen is when using Annex.eval or Annex.run with a new
state. A better fix might be to make both of them call Keys.shutdown
(and be sure to do it even if the annex action threw an error).
Complication: Sometimes they're run reusing an existing state, so shutting
down a database connection could cause problems for other users of that
same state. I think this would need a MVar holding the database handle,
so it could be emptied once shut down, and another user of the database
connection could then start up a new one if it got shut down. But, what if
2 threads were concurrently using the same database handle and one shut it
down while the other was writing to it? Urgh.
Might have to go that route eventually to get the database access to run
fast enough. For now, a quick fix to get the test suite happier, at the
expense of speed.
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a DbHandle is in use by another thread, it could be queueing changes
while shutdown is running. So, wait for the worker to finish before
flushing the queue, so that any last-minute writes are included. Before
this fix, they would be silently dropped.
Of course, if the other thread continues to try to use a DbHandle once it's
closed, it will block forever as the worker is no longer reading from the
jobs MVar. So, that would crash with
"thread blocked indefinitely in an MVar operation".
|
| |
|
|
|
|
|
| |
I guess this is just as efficient as the getAssociatedFiles query, but I
have not tried to optimise the database yet.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Keys database can hold multiple inode caches for a given key. One for
the annex object, and one for each pointer file, which may not be hard
linked to it.
Inode caches for a key are recorded when its content is added to the annex,
but only if it has known pointer files. This is to avoid the overhead of
maintaining the database when not needed.
When the smudge filter outputs a file's content, the inode cache is not
updated, because git's smudge interface doesn't let us write the file. So,
dropping will fall back to doing an expensive verification then. Ideally,
git's interface would be improved, and then the inode cache could be
updated then too.
|
|
|
|
|
|
|
|
|
| |
Renamed the db to keys, since it is various info about a Keys.
Dropping a key will update its pointer files, as long as their content can
be verified to be unmodified. This falls back to checksum verification, but
I want it to use an InodeCache of the key, for speed. But, I have not made
anything populate that cache yet.
|
| |
|
| |
|
| |
|
|
|
|
| |
encoded filenames on stderr when using --incremental.
|
|
|
|
|
|
|
|
| |
every 5 minutes, whichever comes first.
Previously, commits were made every 1000 files fscked.
Also, improve docs
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The one exception is in Utility.Daemon. As long as a process only
daemonizes once, which seems reasonable, and as long as it avoids calling
checkDaemon once it's already running as a daemon, the fcntl locking
gotchas won't be a problem there.
Annex.LockFile has it's own separate lock pool layer, which has been
renamed to LockCache. This is a persistent cache of locks that persist
until closed.
This is not quite done; lockContent stil needs to be converted.
|
|
|
|
|
| |
The explict import Prelude after import Control.Applicative is a trick
to avoid a warning.
|
|
|
|
| |
It's a code smell, can lead to hard to diagnose error messages.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
I think they might be a sqlite bug. In discussions with sqlite devs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
written to
Also, moved the database to a subdir, as there are multiple files.
This seems to work well with concurrent fscks, although they still do
redundant work due to the commit granularity. Occasionally two writes will
conflict, and one is then deferred and happens later.
Except, with 3 concurrent fscks, I got failures:
git-annex: user error (SQLite3 returned ErrorBusy while attempting to perform prepare "SELECT \"fscked\".\"key\"\nFROM \"fscked\"\nWHERE \"fscked\".\"key\" = ?\n": database is locked)
Argh!!!
|