doc/kernel.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130

The following diagram shows how a filesystem operation (in this
example unlink) is performed in FUSE.

NOTE: everything in this description is greatly simplified

 |  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
 |                                    |
 |                                    |  >sys_read()
 |                                    |    >fuse_dev_read()
 |                                    |      >request_wait()
 |                                    |        [sleep on fc->waitq]
 |                                    |
 |  >sys_unlink()                     |
 |    >fuse_unlink()                  |
 |      [get request from             |
 |       fc->unused_list]             |
 |      >request_send()               |
 |        [queue req on fc->pending]  |
 |        [wake up fc->waitq]         |        [woken up]
 |        >request_wait_answer()      |
 |          [sleep on req->waitq]     |
 |                                    |      <request_wait()
 |                                    |      [remove req from fc->pending]
 |                                    |      [copy req to read buffer]
 |                                    |      [add req to fc->processing]
 |                                    |    <fuse_dev_read()
 |                                    |  <sys_read()
 |                                    |
 |                                    |  [perform unlink]
 |                                    |
 |                                    |  >sys_write()
 |                                    |    >fuse_dev_write()
 |                                    |      [look up req in fc->processing]
 |                                    |      [remove from fc->processing]
 |                                    |      [copy write buffer to req]
 |          [woken up]                |      [wake up req->waitq]
 |                                    |    <fuse_dev_write()
 |                                    |  <sys_write()
 |        <request_wait_answer()      |
 |      <request_send()               |
 |      [add request to               |
 |       fc->unused_list]             |
 |    <fuse_unlink()                  |
 |  <sys_unlink()                     |

There are a couple of ways in which to deadlock a FUSE filesystem.
Since we are talking about unprivileged userspace programs,
something must be done about these.

Scenario 1 -  Simple deadlock
-----------------------------

 |  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
 |                                    |
 |  >sys_unlink("/mnt/fuse/file")     |
 |    [acquire inode semaphore        |
 |     for "file"]                    |
 |    >fuse_unlink()                  |
 |      [sleep on req->waitq]         |
 |                                    |  <sys_read()
 |                                    |  >sys_unlink("/mnt/fuse/file")
 |                                    |    [acquire inode semaphore
 |                                    |     for "file"]
 |                                    |    *DEADLOCK*

The solution for this is to allow requests to be interrupted while
they are in userspace:

 |      [interrupted by signal]       |
 |    <fuse_unlink()                  |
 |    [release semaphore]             |    [semaphore acquired]
 |  <sys_unlink()                     |
 |                                    |    >fuse_unlink()
 |                                    |      [queue req on fc->pending]
 |                                    |      [wake up fc->waitq]
 |                                    |      [sleep on req->waitq]

If the filesystem daemon was single threaded, this will stop here,
since there's no other thread to dequeue and execute the request.
In this case the solution is to kill the FUSE daemon as well.  If
there are multiple serving threads, you just have to kill them as
long as any remain.

Moral: a filesystem which deadlocks, can soon find itself dead.

Scenario 2 - Tricky deadlock
----------------------------

This one needs a carefully crafted filesystem.  It's a variation on
the above, only the call back to the filesystem is not explicit,
but is caused by a pagefault.

 |  Kamikaze filesystem thread 1      |  Kamikaze filesystem thread 2
 |                                    |
 |  [fd = open("/mnt/fuse/file")]     |  [request served normally]
 |  [mmap fd to 'addr']               |
 |  [close fd]                        |  [FLUSH triggers 'magic' flag]
 |  [read a byte from addr]           |
 |    >do_page_fault()                |
 |      [find or create page]         |
 |      [lock page]                   |
 |      >fuse_readpage()              |
 |         [queue READ request]       |
 |         [sleep on req->waitq]      |
 |                                    |  [read request to buffer]
 |                                    |  [create reply header before addr]
 |                                    |  >sys_write(addr - headerlength)
 |                                    |    >fuse_dev_write()
 |                                    |      [look up req in fc->processing]
 |                                    |      [remove from fc->processing]
 |                                    |      [copy write buffer to req]
 |                                    |        >do_page_fault()
 |                                    |           [find or create page]
 |                                    |           [lock page]
 |                                    |           * DEADLOCK *

Solution is again to let the the request be interrupted (not
elaborated further).

An additional problem is that while the write buffer is being
copied to the request, the request must not be interrupted.  This
is because the destination address of the copy may not be valid
after the request is interrupted.

This is solved with doing the copy atomically, and allowing
interruption while the page(s) belonging to the write buffer are
faulted with get_user_pages().  The 'req->locked' flag indicates
when the copy is taking place, and interruption is delayed until
this flag is unset.