Start locking docs

2026-01-07 11:35:33 -05:00 · 2025-10-05 12:41:31 +02:00 · 2025-10-05 12:41:31 +02:00 · 458726f4db
commit 458726f4db
parent 9a71c3b4f3
2 changed files with 28 additions and 0 deletions
--- a/docs/README.md
+++ b/docs/README.md
@ -0,0 +1,7 @@
+# Shufflecake - Developer guide
+
+These pages gather some documentation around non-trivial design and implementation choices, mainly in the `dm-sflc` kernel module. They are currently WIP and very incomplete.
+
+Index:
+
+- [Locking](locking.md): an explanation of the synchronization mechanisms employed for mutual exclusion between I/O requests when accessing the shared position map.
--- a/docs/locking.md
+++ b/docs/locking.md
@ -0,0 +1,21 @@
+## Locking in dm-sflc
+
+For the accesses to the position map (and ancillary data structures) to be thread-safe, we obviously need some locking mechanism, because there *will* be many I/O requests trying to access it concurrently.
+
+The simplest mechanism possible would be a single per-volume lock associated to the position map (to be acquired at every PosMap access), plus a per-device lock associated to the pre-shuffled array of PSIs (to be acquired by WRITEs when allocating a new slice).
+Instead, what we use is slightly more complex. Besides the per-device lock, there are **two** locks associated to each volume: a read-write semaphore, and a spinlock. The reason lies in the following two observations/requirements:
+
+1. FLUSH requests need to perform potentially-sleeping operations in their critical section(s): while locking the position map, they need to encrypt its dirty blocks onto a separate memory area. Encryption can potentially sleep, not just because of memory allocation, but also because of the Kernel Crypto API itself deciding to schedule the operation rather than performing it synchronously.
+
+2. We would like the overall locking procedure to be sleepless in the "typical case". That is, *when there are only READs and WRITEs incoming*, and no FLUSHes, we would like the lock(s) governing the READ and WRITE critical sections to be acquirable without sleeping. Otherwise, the overhead of scheduling and context-switching would likely be too much compared to the critical sections themselves (which are very tiny), and would affect the end-to-end bandwidth.
+
+If we wanted to go for the simple mechanism and only have a single per-volume lock, this lock would have to be a sleeping `mutex`, because of point 1; it could never be a spinlock, because a FLUSH would need to acquire it and then potentially sleep on encryption: this is a no-no. But then, having a `mutex` govern access to the position map would violate point 2.
+
+The next-simplest solution is to use a `rwsem` and a `spinlock` in conjunction. At a high level:
+
+- READs only take the `spinlock`.
+- WRITEs take the `rwsem` **as readers** first, then the `spinlock`. Also the per-device `spinlock`, if allocating a new slice.
+- FLUSHes take the `rwsem` **as writers**.
+- DISCARDs behave like WRITEs in their critical section, so they also take the `rwsem` as readers and then the `spinlock`. No need (yet) to take the per-device `spinlock`.
+
+This respects both points 1 and 2 above: the FLUSH is able to sleep on encryption under a `rwsem`; READs and WRITEs don't sleep to enter their critical sections, when there are no FLUSHes. Also, notice that this architecture allows for concurrency between FLUSHes and READs; this is alright because READs only *read* the position map entries in their critical section, and the FLUSH never *writes* to those entries.