1 files changed, 124 insertions, 0 deletions
diff --git a/doc/design/caching_database.mdwn b/doc/design/caching_database.mdwn
new file mode 100644
index 000000000..9d688a9d4
--- /dev/null
+++ b/doc/design/caching_database.mdwn
@@ -0,0 +1,124 @@
+* [[metadata]] for views
+* [direct mode mappings scale badly with thousands of identical files](/bugs/__34__Adding_4923_files__34___is_really_slow)
+* [[bugs/incremental_fsck_should_not_use_sticky_bit]]
+* [[todo/wishlist:_pack_metadata_in_direct_mode]]
+* [[todo/cache_key_info]]
+
+What do all these have in common? They could all be improved by
+using some kind of database to locally store the information in an
+efficient way.
+
+The database should only function as a cache. It should be able to be
+generated and updated by looking at the git repository.
+
+* Metadata can be updated by looking at the git-annex branch,
+  either its current state, or the diff between the old and new versions
+* Direct mode mappings can be updated by looking at the current branch,
+  to see which files map to which key. Or the diff between the old
+  and new versions of the branch.
+* Incremental fsck information is not stored in git, but can be
+  "regenerated" by running fsck again.  
+  (Perhaps doesn't quite fit, but let it slide..)
+
+Store in the database the Ref of the branch that was used to construct it.
+(Update in same transaction as cached data.)
+
+## implementation plan
+
+1. Implement for metadata, on a branch, with sqlite.
+2. Make sure that builds on all platforms.
+3. Add associated file mappings support. This is needed to fully
+   use the caching database to construct views.
+4. Store incremental fsck info in db.
+5. Replace .map files with 3. for direct mode.
+
+## case study: persistent with sqllite
+
+Here's a non-normalized database schema in persistent's syntax.
+
+<pre>
+CachedKey
+  key Key
+  associatedFiles [FilePath]
+  lastFscked Int Maybe
+  KeyIndex key
+
+CachedMetaData
+  key Key
+  metaDataField MetaDataField
+  metaDataValue MetaDataValue
+</pre>
+
+Using the above database schema and persistent with sqlite, I made
+a database containing 30k Cache records. This took 5 seconds to create
+and was 7 mb on disk. (Would be rather smaller, if a more packed Key
+show/read instance were used.)
+
+Running 1000 separate queries to get 1000 CachedKeys took 0.688s with warm
+cache. This was more than halved when all 1000 queries were done inside the
+same `runSqlite` call. (Which could be done using a separate thread and some
+MVars.)
+
+(Note that if the database is a cache, there is no need to perform migrations
+when querying it. My benchmarks skip `runMigration`. Instead, if the query
+fails, the database doesn't exist, or uses an incompatable schema, and the
+cache can be rebuilt then. This avoids the problem that persistent's migrations
+can sometimes fail.)
+
+Doubling the db to 60k scaled linearly in disk and cpu and did not affect
+query time.
+
+----
+
+Here's a normalized schema:
+
+<pre>
+CachedKey
+  key Key
+  KeyIndex key
+  deriving Show
+
+AssociatedFiles
+  keyId CachedKeyId Eq
+  associatedFile FilePath
+  KeyIdIndex keyId associatedFile
+  deriving Show
+
+CachedMetaField
+  field MetaField
+  FieldIndex field
+
+CachedMetaData
+  keyId CachedKeyId Eq
+  fieldId CachedMetaFieldId Eq
+  metaValue String
+
+LastFscked
+  keyId CachedKeyId Eq
+  localFscked Int Maybe
+</pre>
+
+With this, running 1000 joins to get the associated files of 1000
+Keys took 5.6s with warm cache. (When done in the same `runSqlite` call.) Ouch!
+
+Update: This performance was fixed by adding `KeyIdOutdex keyId associatedFile`,
+which adds a uniqueness constraint on the tuple of key and associatedFile.
+With this, 1000 queries takes 0.406s. Note that persistent is probably not
+actually doing a join at the SQL level, so this could be sped up using
+eg, esquelito.
+
+Update2: Using esquelito to do a join got this down to 0.250s.
+
+Code: <http://lpaste.net/101141> <http://lpaste.net/101142>
+
+Compare the above with 1000 calls to `associatedFiles`, which is approximately
+as fast as just opening and reading 1000 files, so will take well under
+0.05s with a **cold** cache.
+
+So, we're looking at nearly an order of magnitude slowdown using sqlite and
+persistent for associated files. OTOH, the normalized schema should
+perform better when adding an associated file to a key that already has many.
+
+For metadata, the story is much nicer. Querying for 30000 keys that all
+have a particular tag in their metadata takes 0.65s. So fast enough to be
+used in views.