aboutsummaryrefslogtreecommitdiff
path: root/Backend
Commit message (Collapse)AuthorAge
* Improve SHA*E extension extraction codeGravatar Joey Hess2018-03-05
| | | | | | | | | | | | | | | | | | | | Do not treat parts of the filename that contain punctuation or other non-alphanumeric characters as extensions. Before, such characters were filtered out. Note that in 38bd7ca3cce455c20edcee656c706939087c6a69 "foo.ba__________r" was munged to ".bar" and so incorrectly treated as an extension. That was fixed by changing the filter order, but not allowing punctuation seems a better fix. This assumes that extensions containing punctuation are rare. "_" seems the most likely character; I used it in ikiwiki "._comment" files. But I can't recall seeing it anywhere else. It certianly seems that no commonly used extensions contain punctuation. If git-annex doesn't treat "._comment" as an extension, it's not likely to break software that expects to see that extension like some software expects to see .epub or .mp3. This commit was sponsored by Jack Hill on Patreon.
* fold Build/SysConfig.hs into BuildInfo via includeGravatar Joey Hess2017-12-14
| | | | | | | | | | | This avoids warnings from stack about the module not being listed in the cabal file. So, the generated file is also renamed to Build/SysConfig. Note that the setup program seems to be cached despite these changes; I had to cabal clean to get cabal to update it so that Build/SysConfig was written. This commit was sponsored by Jochen Bartl on Patreon.
* more lambda-case conversionGravatar Joey Hess2017-12-05
|
* migrate: WORM keys containing spaces will be migrated to not contain spaces ↵Gravatar Joey Hess2017-08-17
| | | | | | | | | anymore To work around the problem that the external special remote protocol does not support keys containing spaces. This commit was sponsored by Denis Dzyubenko on Patreon.
* stop using MissingH for MD5Gravatar Joey Hess2017-05-15
| | | | | | | | | | Cryptonite is faster and allocates less, and I want to get rid of MissingH use. Note that the new dependency on memory is free; it's a dependency of cryptonite. This commit was supported by the NSF-funded DataLad project.
* AssociatedFile newtypeGravatar Joey Hess2017-03-10
| | | | | | To prevent any further mistakes like 1a497cefb47557f0b4788c606f9071be422b2511 This commit was sponsored by Francois Marier on Patreon.
* Removed support for building with the old cryptohash library.Gravatar Joey Hess2017-02-24
| | | | | Building with that library made git-annex not support SHA3; it's time for that to always be supported in case SHA2 dominoes.
* add KeyVariety typeGravatar Joey Hess2017-02-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Where before the "name" of a key and a backend was a string, this makes it a concrete data type. This is groundwork for allowing some varieties of keys to be disabled in file2key, so git-annex won't use them at all. Benchmarks ran in my big repo: old git-annex info: real 0m3.338s user 0m3.124s sys 0m0.244s new git-annex info: real 0m3.216s user 0m3.024s sys 0m0.220s new git-annex find: real 0m7.138s user 0m6.924s sys 0m0.252s old git-annex find: real 0m7.433s user 0m7.240s sys 0m0.232s Surprising result; I'd have expected it to be slower since it now parses all the key varieties. But, the parser is very simple and perhaps sharing KeyVarieties uses less memory or something like that. This commit was supported by the NSF-funded DataLad project.
* Some optimisations to string splitting code.Gravatar Joey Hess2017-01-31
| | | | | | | | | | | | | | | | | | | Turns out that Data.List.Utils.split is slow and makes a lot of allocations. Here's a much simpler single character splitter that behaves the same (even in wacky corner cases) while running in half the time and 75% the allocations. As well as being an optimisation, this helps move toward eliminating use of missingh. (Data.List.Split.splitOn is nearly as slow as Data.List.Utils.split and allocates even more.) I have not benchmarked the effect on git-annex, but would not be surprised to see some parsing of eg, large streams from git commands run twice as fast, and possibly in less memory. This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
* Always use filesystem encoding for all file and handle reads and writes.Gravatar Joey Hess2016-12-24
| | | | | This is a big scary change. I have convinced myself it should be safe. I hope!
* Improve SHA*E extension extraction code.Gravatar Joey Hess2016-05-27
| | | | | | Filter out over-long "extensions" before stripping out non-alphanumerics from them, so that eg "foo.ba__________r" is not considered a .bar extension.
* rename functionGravatar Joey Hess2016-05-27
|
* Support --metadata field<number, --metadata field>number etc to match ranges ↵Gravatar Joey Hess2016-02-27
| | | | | | | of numeric values. Similarly (well, for free), support preferred content expressions like metadata=field<number and metadata=field>number
* better forcing of hashGravatar Joey Hess2016-02-26
|
* try again at forcing file read while hashingGravatar Joey Hess2016-02-26
|
* test revert "force hash to finish with file before returning"Gravatar Joey Hess2016-02-26
| | | | | | This reverts commit d11b032bd86ebe69f1d08e382bd83370db8ea9b9. This seems to have caused a memory leak.
* remove 163 lines of code without changing anything except importsGravatar Joey Hess2016-01-20
|
* force hash to finish with file before returningGravatar Joey Hess2016-01-06
| | | | | | Fixes a minor fd leak, never more than 1 in normal use, which broke the test suite when I tried to write to a file that was still open for a previous hashing.
* generalize catchHardwareFault to catchIOErrorTypeGravatar Joey Hess2015-12-06
|
* use action, not sideActionGravatar Joey Hess2015-10-11
| | | | | | | | sideAction is for things not generally related to the current action being performed. And, it adds a newline after the side action. This was not the right thing to use for stuff like "checksum", where doing a checksum is part of the git annex get process, and indeed we want it to display "(checksum...) ok"
* rename fsckKey to verifyKeyContentGravatar Joey Hess2015-10-01
| | | | No behavior changes.
* Added support for SHA3 hashed keys (in 8 varieties), when git-annex is built ↵Gravatar Joey Hess2015-08-06
| | | | | | | | using the cryptonite library. While cryptohash has SHA3 support, it has not been updated for the final version of the spec. Note that cryptonite has not been ported to all arches that cryptohash builds on yet.
* fsck: When checksumming a file fails due to a hardware fault, the file is ↵Gravatar Joey Hess2015-05-27
| | | | now moved to the bad directory, and the fsck proceeds. Before, the fsck immediately failed.
* refactorGravatar Joey Hess2015-05-27
|
* if external hash command fails for any reason, fall back to internal hashingGravatar Joey Hess2015-05-27
| | | | | | | | | | | | | | | | This way, if a system's sha1sum etc is broken, it will be tried if git-annex was built to use it, but at least it will fall back to using internal hashing when it fails. A side benefit of this is that hashFile consistently throws an IOError if the file is unable to be read. In particular, if the disk is failing with IO errors, and external hash command is used, it used to throw a user error with the error message from externalSHA. Now, the external hash command will fail, that message will be printed as a warning, and it'll fall back to the internal hash command. If the disk IO error is not intermittent, it will re-occur, and so an IOError will be thrown. Of course, this can mean it reads a file twice, but only in edge cases.
* fromkey, registerurl: Allow urls to be specified instead of keys, and ↵Gravatar Joey Hess2015-05-22
| | | | | | | | generate URL keys. This is especially useful because the caller doesn't need to generate valid url keys, which involves some escaping of characters, and may involve taking a md5sum of the url if it's too long.
* Added MD5 and MD5E backends.Gravatar Joey Hess2015-02-04
|
* Remove support for building without cryptohash.Gravatar Joey Hess2015-02-04
| | | | | This will prevent backporting to wheezy, but it's time to simplify the code.
* update my email address and homepage urlGravatar Joey Hess2015-01-21
|
* add getFileSize, which can get the real size of a large file on WindowsGravatar Joey Hess2015-01-20
| | | | | | | | | | | | | | Avoid using fileSize which maxes out at just 2 gb on Windows. Instead, use hFileSize, which doesn't have a bounded size. Fixes support for files > 2 gb on Windows. Note that the InodeCache code only needs to compare a file size, so it doesn't matter it the file size wraps. So it has been left as-is. This was necessary both to avoid invalidating existing inode caches, and because the code passed FileStatus around and would have become more expensive if it called getFileSize. This commit was sponsored by Christian Dietrich.
* Generate shorter keys for WORM and URL, avoiding keys that are longer than ↵Gravatar Joey Hess2015-01-06
| | | | used for SHA256, so as to not break on systems like Windows that have very small maximum path length limits.
* Avoid re-checksumming when migrating from hash to hashE backend. Closes: #774494Gravatar Joey Hess2015-01-04
|
* fix some mixed space+tab indentationGravatar Joey Hess2014-10-09
| | | | | | | | | This fixes all instances of " \t" in the code base. Most common case seems to be after a "where" line; probably vim copied the two space layout of that line. Done as a background task while listening to episode 2 of the Type Theory podcast.
* WORM backend: Switched to include the relative path to the file inside the ↵Gravatar Joey Hess2014-09-11
| | | | repository, rather than just the file's base name. Note that if you're relying on such things to keep files separate with WORM, you should really be using a better backend.
* WORM backend: When adding a file in a subdirectory, avoid including the ↵Gravatar Joey Hess2014-08-12
| | | | subdirectory in the key name.
* testremote: New command to test uploads/downloads to a remote.Gravatar Joey Hess2014-08-01
| | | | | | | | | This only performs some basic tests so far; no testing of chunking or resuming. Also, the existing encryption type of the remote is used; it would be good later to derive an encrypted and a non-encrypted version of the remote and test them both. This commit was sponsored by Joseph Liu.
* add key stability checking interfaceGravatar Joey Hess2014-07-27
| | | | | | | Needed for resuming from chunks. Url keys are considered not stable. I considered treating url keys with a known size as stable, but just don't feel that is enough information.
* add chunk metadata to KeyGravatar Joey Hess2014-07-24
| | | | | | | | Added new fields for chunk number, and chunk size. These will not appear in normal keys ever, but will be used for chunked data stored on special remotes. This commit was sponsored by Jouni K Seppanen.
* migrate: Avoid re-checksumming when migrating from hashE to hash backend.Gravatar Joey Hess2014-07-10
|
* bring back the (checksum) when fsckingGravatar Joey Hess2014-02-20
| | | | | This is useful because it shows users which files it checksums, vs ones that are not present, or don't use a hash backend, or --fast
* import: Add --skip-duplicates option.Gravatar Joey Hess2013-12-04
| | | | | | | Note that the hash backends were made to stop printing a (checksum..) message as part of this, since it showed up without a file when deciding whether to act on a file. Should have probably removed that message a while ago anyway, I suppose.
* Better sanitization of problem characters when generating URL and WORM keys.Gravatar Joey Hess2013-10-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FAT has a lot of characters it does not allow in filenames, like ? and * It's probably the worst offender, but other filesystems also have limitiations. In 2011, I made keyFile escape : to handle FAT, but missed the other characters. It also turns out that when I did that, I was also living dangerously; any existing keys that contained a : had their object location change. Oops. So, adding new characters to escape to keyFile is out. Well, it would be possible to make keyFile behave differently on a per-filesystem basis, but this would be a real nightmare to get right. Consider that a rsync special remote uses keyFile to determine the filenames to use, and we don't know the underlying filesystem on the rsync server.. Instead, I have gone for a solution that is backwards compatable and simple. Its only downside is that already generated URL and WORM keys might not be able to be stored on FAT or some other filesystem that dislikes a character used in the key. (In this case, the user can just migrate the problem keys to a checksumming backend. If this became a big problem, fsck could be made to detect these and suggest a migration.) Going forward, new keys that are created will escape all characters that are likely to cause problems. And if some filesystem comes along that's even worse than FAT (seems unlikely, but here it is 2013, and people are still using FAT!), additional characters can be added to the set that are escaped without difficulty. (Also, made WORM limit the part of the filename that is embedded in the key, to deal with filesystem filename length limits. This could have already been a problem, but is more likely now, since the escaping of the filename can make it longer.) This commit was sponsored by Ian Downes
* allow building w/o cryptohashGravatar Joey Hess2013-10-03
| | | | | Mostly for the debian stable autobuilds, which have a too old version to use the Crypto.Hash module.
* better nameGravatar Joey Hess2013-10-01
|
* ensure that hash representations don't change in futureGravatar Joey Hess2013-10-01
|
* Added SKEIN256 and SKEIN512 backendsGravatar Joey Hess2013-10-01
| | | | | | | | | | | | | | | | | | | SHA3 is still waiting for final standardization. Although this is looking less likely given https://www.cdt.org/blogs/joseph-lorenzo-hall/2409-nist-sha-3 In the meantime, cryptohash implements skein, and it's used by some of the haskell ecosystem (for yesod sessions, IIRC), so this implementation is likely to continue working. Also, I've talked with the cryprohash author and he's a reasonable guy. It makes sense to have an alternate high security hash, in case some horrible attack is found against SHA2 tomorrow, or in case SHA3 comes out and worst fears are realized. I'd also like to support using skein for HMAC. But no hurry there and a new version of cryptohash has much nicer HMAC code, so I will probably wait until I can use that version.
* hlintGravatar Joey Hess2013-09-25
| | | | test suite still passes
* Use cryptohash rather than SHA for hashing.Gravatar Joey Hess2013-09-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a massive win on OSX, which doesn't have a sha256sum normally. Only use external hash commands when the file is > 1 mb, since cryptohash is quite close to them in speed. SHA is still used to calculate HMACs. I don't quite understand cryptohash's API for those. Used the following benchmark to arrive at the 1 mb number. 1 mb file: benchmarking sha256/internal mean: 13.86696 ms, lb 13.83010 ms, ub 13.93453 ms, ci 0.950 std dev: 249.3235 us, lb 162.0448 us, ub 458.1744 us, ci 0.950 found 5 outliers among 100 samples (5.0%) 4 (4.0%) high mild 1 (1.0%) high severe variance introduced by outliers: 10.415% variance is moderately inflated by outliers benchmarking sha256/external mean: 14.20670 ms, lb 14.17237 ms, ub 14.27004 ms, ci 0.950 std dev: 230.5448 us, lb 150.7310 us, ub 427.6068 us, ci 0.950 found 3 outliers among 100 samples (3.0%) 2 (2.0%) high mild 1 (1.0%) high severe 2 mb file: benchmarking sha256/internal mean: 26.44270 ms, lb 26.23701 ms, ub 26.63414 ms, ci 0.950 std dev: 1.012303 ms, lb 925.8921 us, ub 1.122267 ms, ci 0.950 variance introduced by outliers: 35.540% variance is moderately inflated by outliers benchmarking sha256/external mean: 26.84521 ms, lb 26.77644 ms, ub 26.91433 ms, ci 0.950 std dev: 347.7867 us, lb 210.6283 us, ub 571.3351 us, ci 0.950 found 6 outliers among 100 samples (6.0%) import Crypto.Hash import Data.ByteString.Lazy as L import Criterion.Main import Common testfile :: FilePath testfile = "/run/shm/data" -- on ram disk main = defaultMain [ bgroup "sha256" [ bench "internal" $ whnfIO internal , bench "external" $ whnfIO external ] ] sha256 :: L.ByteString -> Digest SHA256 sha256 = hashlazy internal :: IO String internal = show . sha256 <$> L.readFile testfile external :: IO String external = do s <- readProcess "sha256sum" [testfile] return $ fst $ separate (== ' ') s
* Fix a few bugs involving filenames that are at or near the filesystem's ↵Gravatar Joey Hess2013-07-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | maximum filename length limit. Started with a problem when running addurl on a really long url, because the whole url is munged into the filename. Ended up doing a fairly extensive review for places where filenames could get too large, although it's hard to say I'm not missed any.. Backend.Url had a 128 character limit, which is fine when the limit is 255, but not if it's a lot shorter on some systems. So check the pathconf() limit. Note that this could result in fromUrl creating different keys for the same url, if run on systems with different limits. I don't see this is likely to cause any problems. That can already happen when using addurl --fast, or if the content of an url changes. Both Command.AddUrl and Backend.Url assumed that urls don't contain a lot of multi-byte unicode, and would fail to truncate an url that did properly. A few places use a filename as the template to make a temp file. While that's nice in that the temp file name can be easily related back to the original filename, it could lead to `git annex add` failing to add a filename that was at or close to the maximum length. Note that in Command.Add.lockdown, the template is still derived from the filename, just with enough space left to turn it into a temp file. This is an important optimisation, because the assistant may lock down a bunch of files all at once, and using the same template for all of them would cause openTempFile to iterate through the same set of names, looking for an unused temp file. I'm not very happy with the relatedTemplate hack, but it avoids that slowdown. Backend.WORM does not limit the filename stored in the key. I have not tried to change that; so git annex add will fail on really long filenames when using the WORM backend. It seems better to preserve the invariant that a WORM key always contains the complete filename, since the filename is the only unique material in the key, other than mtime and size. Since nobody has complained about add failing (I think I saw it once?) on WORM, probably it's ok, or nobody but me uses it. There may be compatability problems if using git annex addurl --fast or the WORM backend on a system with the 255 limit and then trying to use that repo in a system with a smaller limit. I have not tried to deal with those. This commit was sponsored by Alexander Brem. Thanks!
* fix permission damage (thanks, Windows)Gravatar Joey Hess2013-05-11
|