summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Joey Hess <joeyh@joeyh.name>2015-11-13 14:44:53 -0400
committerGravatar Joey Hess <joeyh@joeyh.name>2015-11-13 14:44:53 -0400
commite39c851c299bd8b1a043da64a5b3ebd5c0f2cb08 (patch)
tree462adc1f34e2ff7e49925c5ff5710f0d9b66825d
parent3199555ed02af23b0d38554124c7033a974d8c5c (diff)
require the side lock be held to take pidlock
This is less portable, since currently sidelocks rely on /dev/shm. But, I've seen crazy lustre inconsistencies that make me not trust the link() method at all, so what can you do.
-rw-r--r--Utility/LockFile/PidLock.hs42
-rw-r--r--doc/bugs/git-annex_doesn__39__t_work_on_lustre:_waitToSetLock:_unsupported_operation___40__Function_not_implemented__41__/comment_13_5a1191042b32a85b4299cff3004d29de._comment41
2 files changed, 68 insertions, 15 deletions
diff --git a/Utility/LockFile/PidLock.hs b/Utility/LockFile/PidLock.hs
index d367759c3..fb6b0792b 100644
--- a/Utility/LockFile/PidLock.hs
+++ b/Utility/LockFile/PidLock.hs
@@ -38,7 +38,9 @@ import Data.Hash.MD5
type LockFile = FilePath
-data LockHandle = LockHandle FilePath Fd (Maybe Posix.LockHandle)
+data LockHandle = LockHandle FilePath Fd SideLockHandle
+
+type SideLockHandle = Maybe Posix.LockHandle
data PidLock = PidLock
{ lockingPid :: ProcessID
@@ -58,7 +60,7 @@ readPidLock lockfile = (readish =<<) <$> catchMaybeIO (readFile lockfile)
-- This is a regular posix exclusive lock. The side lock is put in
-- /dev/shm. This will work on most any Linux system, even if its whole
-- root filesystem doesn't support posix locks.
-trySideLock :: LockFile -> (Maybe Posix.LockHandle -> IO a) -> IO a
+trySideLock :: LockFile -> (SideLockHandle -> IO a) -> IO a
trySideLock lockfile a = do
sidelock <- sideLockFile lockfile
mlck <- catchDefaultIO Nothing $
@@ -82,8 +84,6 @@ sideLockFile lockfile = do
-- | Tries to take a lock; does not block when the lock is already held.
--
--- The method used is atomic even on NFS without needing O_EXCL support.
---
-- Note that stale locks are automatically detected and broken.
-- However, if the lock file is on a networked file system, and was
-- created on a different host than the current host (determined by hostname),
@@ -99,7 +99,7 @@ tryLock lockfile = trySideLock lockfile $ \sidelock -> do
nukeFile tmp
return Nothing
let tooklock = return $ Just $ LockHandle lockfile fd sidelock
- ifM (linkToLock tmp lockfile)
+ ifM (linkToLock sidelock tmp lockfile)
( do
nukeFile tmp
tooklock
@@ -118,16 +118,27 @@ tryLock lockfile = trySideLock lockfile $ \sidelock -> do
_ -> failedlock
)
--- Linux man pages recommend linking a pid lock into place,
+-- Linux's open(2) man page recommends linking a pid lock into place,
-- as the most portable atomic operation that will fail if
--- it already exists. However, on some network filesystems,
--- link will return success sometimes despite having failed,
--- so we have to stat both files to check if it actually worked.
-linkToLock :: FilePath -> FilePath -> IO Bool
-linkToLock src dest = ifM (isJust <$> catchMaybeIO (createLink src dest))
- ( catchDefaultIO False checklink
- , return False
- )
+-- it already exists.
+--
+-- open(2) suggests that link can sometimes appear to fail
+-- on NFS but have actually succeeded, and the way to find out is to stat
+-- the file and check its link count etc.
+--
+-- On a Lustre filesystem, link has been observed to incorrectly *succeed*,
+-- despite the dest already existing. A subsequent stat of the dest
+-- looked like it had been replaced with the src. The process proceeded to
+-- run and then deleted the dest, and after the process was done, the
+-- original file was observed to still be in place. This is horrible and we
+-- can't do anything about such a lying filesystem.
+-- At least the side lock file will prevent git-annex's running on the same
+-- host from running concurrently even on such a lying filesystem.
+linkToLock :: SideLockHandle -> FilePath -> FilePath -> IO Bool
+linkToLock Nothing _ _ = return False
+linkToLock (Just _) src dest = do
+ _ <- tryIO $ createLink src dest
+ catchDefaultIO False checklink
where
checklink = do
x <- getSymbolicLinkStatus src
@@ -136,13 +147,14 @@ linkToLock src dest = ifM (isJust <$> catchMaybeIO (createLink src dest))
[ deviceID x == deviceID y
, fileID x == fileID y
, fileMode x == fileMode y
- , linkCount x == linkCount y
, fileOwner x == fileOwner y
, fileGroup x == fileGroup y
, specialDeviceID x == specialDeviceID y
, fileSize x == fileSize y
, modificationTime x == modificationTime y
, isRegularFile x == isRegularFile y
+ , linkCount x == linkCount y
+ , linkCount x == 2
]
-- | Waits as necessary to take a lock.
diff --git a/doc/bugs/git-annex_doesn__39__t_work_on_lustre:_waitToSetLock:_unsupported_operation___40__Function_not_implemented__41__/comment_13_5a1191042b32a85b4299cff3004d29de._comment b/doc/bugs/git-annex_doesn__39__t_work_on_lustre:_waitToSetLock:_unsupported_operation___40__Function_not_implemented__41__/comment_13_5a1191042b32a85b4299cff3004d29de._comment
new file mode 100644
index 000000000..ae9691cf9
--- /dev/null
+++ b/doc/bugs/git-annex_doesn__39__t_work_on_lustre:_waitToSetLock:_unsupported_operation___40__Function_not_implemented__41__/comment_13_5a1191042b32a85b4299cff3004d29de._comment
@@ -0,0 +1,41 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 13"""
+ date="2015-11-13T18:23:37Z"
+ content="""
+While testing a git-annex that used pid locks, on the Lustre
+system I've been given access to, I observed something most
+strange:
+
+ link(".git/annex/locktmp12011", ".git/annex/pidlock") = 0
+ lstat64(".git/annex/locktmp12011", {st_mode=S_IFREG|0444, st_size=70, ...}) = 0
+ lstat64(".git/annex/pidlock", {st_mode=S_IFREG|0444, st_size=70, ...}) = 0
+ ...
+ unlink(".git/annex/pidlock") = 0
+
+Seeing that strace, it would make sense that the pidlock file didn't exist,
+since a hard link was successfully made by that name, and link() never,
+ever overwrites an existing file. The stats of the 2 files are of course
+identical since they're hard links. And, since the pidlock is unlinked at
+the end, we'd expect the file to be gone then.
+
+But, none of that has anything to do with the reality. Which was:
+The pidlock file already existed, with size=72, and had existed for some
+hours at the point the strace begins. The link didn't replace it
+at all, and the unlink didn't delete it. When the program exited,
+the pidlock file still existed, with contents unaltered.
+
+All I can guess is happening is that different processes on a Lustre
+filesystem, running on the same host, somehow see inconsistent realities.
+
+I do think that, despite this being completely insane, the locking will
+actually work ok, when all git-annex processes in a given repo on Lustre
+are running *on the same computer*. That because git-annex actually will
+drop a proper lock into a proper filesystem (/dev/shm), and so avoid all
+this Lustre nonsense.
+
+But in general, I can make no warantee express or implied as to the
+suitability of Lustre as a platform to use git-annex. If it's this
+inconsistent, and modifications made to files are somehow silently rolled
+back, anything could happen.
+"""]]