diff options
author | Joey Hess <joeyh@joeyh.name> | 2015-11-13 14:44:53 -0400 |
---|---|---|
committer | Joey Hess <joeyh@joeyh.name> | 2015-11-13 14:44:53 -0400 |
commit | e39c851c299bd8b1a043da64a5b3ebd5c0f2cb08 (patch) | |
tree | 462adc1f34e2ff7e49925c5ff5710f0d9b66825d | |
parent | 3199555ed02af23b0d38554124c7033a974d8c5c (diff) |
require the side lock be held to take pidlock
This is less portable, since currently sidelocks rely on /dev/shm.
But, I've seen crazy lustre inconsistencies that make me not trust the
link() method at all, so what can you do.
2 files changed, 68 insertions, 15 deletions
diff --git a/Utility/LockFile/PidLock.hs b/Utility/LockFile/PidLock.hs index d367759c3..fb6b0792b 100644 --- a/Utility/LockFile/PidLock.hs +++ b/Utility/LockFile/PidLock.hs @@ -38,7 +38,9 @@ import Data.Hash.MD5 type LockFile = FilePath -data LockHandle = LockHandle FilePath Fd (Maybe Posix.LockHandle) +data LockHandle = LockHandle FilePath Fd SideLockHandle + +type SideLockHandle = Maybe Posix.LockHandle data PidLock = PidLock { lockingPid :: ProcessID @@ -58,7 +60,7 @@ readPidLock lockfile = (readish =<<) <$> catchMaybeIO (readFile lockfile) -- This is a regular posix exclusive lock. The side lock is put in -- /dev/shm. This will work on most any Linux system, even if its whole -- root filesystem doesn't support posix locks. -trySideLock :: LockFile -> (Maybe Posix.LockHandle -> IO a) -> IO a +trySideLock :: LockFile -> (SideLockHandle -> IO a) -> IO a trySideLock lockfile a = do sidelock <- sideLockFile lockfile mlck <- catchDefaultIO Nothing $ @@ -82,8 +84,6 @@ sideLockFile lockfile = do -- | Tries to take a lock; does not block when the lock is already held. -- --- The method used is atomic even on NFS without needing O_EXCL support. --- -- Note that stale locks are automatically detected and broken. -- However, if the lock file is on a networked file system, and was -- created on a different host than the current host (determined by hostname), @@ -99,7 +99,7 @@ tryLock lockfile = trySideLock lockfile $ \sidelock -> do nukeFile tmp return Nothing let tooklock = return $ Just $ LockHandle lockfile fd sidelock - ifM (linkToLock tmp lockfile) + ifM (linkToLock sidelock tmp lockfile) ( do nukeFile tmp tooklock @@ -118,16 +118,27 @@ tryLock lockfile = trySideLock lockfile $ \sidelock -> do _ -> failedlock ) --- Linux man pages recommend linking a pid lock into place, +-- Linux's open(2) man page recommends linking a pid lock into place, -- as the most portable atomic operation that will fail if --- it already exists. However, on some network filesystems, --- link will return success sometimes despite having failed, --- so we have to stat both files to check if it actually worked. -linkToLock :: FilePath -> FilePath -> IO Bool -linkToLock src dest = ifM (isJust <$> catchMaybeIO (createLink src dest)) - ( catchDefaultIO False checklink - , return False - ) +-- it already exists. +-- +-- open(2) suggests that link can sometimes appear to fail +-- on NFS but have actually succeeded, and the way to find out is to stat +-- the file and check its link count etc. +-- +-- On a Lustre filesystem, link has been observed to incorrectly *succeed*, +-- despite the dest already existing. A subsequent stat of the dest +-- looked like it had been replaced with the src. The process proceeded to +-- run and then deleted the dest, and after the process was done, the +-- original file was observed to still be in place. This is horrible and we +-- can't do anything about such a lying filesystem. +-- At least the side lock file will prevent git-annex's running on the same +-- host from running concurrently even on such a lying filesystem. +linkToLock :: SideLockHandle -> FilePath -> FilePath -> IO Bool +linkToLock Nothing _ _ = return False +linkToLock (Just _) src dest = do + _ <- tryIO $ createLink src dest + catchDefaultIO False checklink where checklink = do x <- getSymbolicLinkStatus src @@ -136,13 +147,14 @@ linkToLock src dest = ifM (isJust <$> catchMaybeIO (createLink src dest)) [ deviceID x == deviceID y , fileID x == fileID y , fileMode x == fileMode y - , linkCount x == linkCount y , fileOwner x == fileOwner y , fileGroup x == fileGroup y , specialDeviceID x == specialDeviceID y , fileSize x == fileSize y , modificationTime x == modificationTime y , isRegularFile x == isRegularFile y + , linkCount x == linkCount y + , linkCount x == 2 ] -- | Waits as necessary to take a lock. diff --git a/doc/bugs/git-annex_doesn__39__t_work_on_lustre:_waitToSetLock:_unsupported_operation___40__Function_not_implemented__41__/comment_13_5a1191042b32a85b4299cff3004d29de._comment b/doc/bugs/git-annex_doesn__39__t_work_on_lustre:_waitToSetLock:_unsupported_operation___40__Function_not_implemented__41__/comment_13_5a1191042b32a85b4299cff3004d29de._comment new file mode 100644 index 000000000..ae9691cf9 --- /dev/null +++ b/doc/bugs/git-annex_doesn__39__t_work_on_lustre:_waitToSetLock:_unsupported_operation___40__Function_not_implemented__41__/comment_13_5a1191042b32a85b4299cff3004d29de._comment @@ -0,0 +1,41 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 13""" + date="2015-11-13T18:23:37Z" + content=""" +While testing a git-annex that used pid locks, on the Lustre +system I've been given access to, I observed something most +strange: + + link(".git/annex/locktmp12011", ".git/annex/pidlock") = 0 + lstat64(".git/annex/locktmp12011", {st_mode=S_IFREG|0444, st_size=70, ...}) = 0 + lstat64(".git/annex/pidlock", {st_mode=S_IFREG|0444, st_size=70, ...}) = 0 + ... + unlink(".git/annex/pidlock") = 0 + +Seeing that strace, it would make sense that the pidlock file didn't exist, +since a hard link was successfully made by that name, and link() never, +ever overwrites an existing file. The stats of the 2 files are of course +identical since they're hard links. And, since the pidlock is unlinked at +the end, we'd expect the file to be gone then. + +But, none of that has anything to do with the reality. Which was: +The pidlock file already existed, with size=72, and had existed for some +hours at the point the strace begins. The link didn't replace it +at all, and the unlink didn't delete it. When the program exited, +the pidlock file still existed, with contents unaltered. + +All I can guess is happening is that different processes on a Lustre +filesystem, running on the same host, somehow see inconsistent realities. + +I do think that, despite this being completely insane, the locking will +actually work ok, when all git-annex processes in a given repo on Lustre +are running *on the same computer*. That because git-annex actually will +drop a proper lock into a proper filesystem (/dev/shm), and so avoid all +this Lustre nonsense. + +But in general, I can make no warantee express or implied as to the +suitability of Lustre as a platform to use git-annex. If it's this +inconsistent, and modifications made to files are somehow silently rolled +back, anything could happen. +"""]] |