shufflecake/docs/locking.md

## Locking in dm-sflc

For the accesses to the position map and its ancillary data structures to be thread-safe, we obviously need some locking mechanism, because there *will* be many I/O requests trying to access it concurrently.

The simplest mechanism possible would be a single per-volume lock associated to the position map (to be acquired at every PosMap access), plus a per-device lock associated to the pre-shuffled array of PSIs (to be acquired by WRITEs when allocating a new slice).
Instead, what we use is slightly more complex. Besides the per-device lock, there are **two** locks associated to each volume: a read-write semaphore, and a spinlock. The reason lies in the following two observations/requirements:

1. FLUSH requests need to perform potentially-sleeping operations in their critical section(s): while locking the position map, they need to encrypt its dirty blocks onto a separate memory area. Encryption can potentially sleep, not just because of memory allocation, but also because of the Kernel Crypto API itself deciding to schedule the operation rather than performing it synchronously.

2. We would like the overall locking procedure to be sleepless in the "typical case". That is, *when there are only READs and WRITEs incoming*, and no FLUSHes, we would like the lock(s) governing the READ and WRITE critical sections to be acquirable without sleeping. Otherwise, the overhead of scheduling and context-switching would likely be too much compared to the critical sections themselves (which are very tiny), and would affect the end-to-end I/O throughput.

If we wanted to go for the simple mechanism and only have a single per-volume lock, this lock would have to be a sleeping `mutex`, because of point 1: it could never be a spinlock, because a FLUSH would need to acquire it and then potentially sleep on encryption (which is a no-no). But then, having a `mutex` govern access to the position map would violate point 2.

The next-simplest solution is to use a `rwsem` and a `spinlock` in conjunction. At a high level:

- READs only take the `spinlock`.
- WRITEs first take the `rwsem` **as readers**, then the `spinlock`. Also the per-device `spinlock`, if allocating a new slice.
- FLUSHes take the `rwsem` **as writers**.
- DISCARDs behave a lot like WRITEs in their critical section, so they also take the `rwsem` as readers and then the `spinlock`. No need (yet) to take the per-device `spinlock`.

This respects both points 1 and 2 above: the FLUSH is able to sleep on encryption under a `rwsem`; READs and WRITEs don't sleep to enter their critical sections, when there are no FLUSHes.
Also, notice that this architecture allows for concurrency between FLUSHes and READs: this is alright because READs only *read* the position map entries in their critical section, while the FLUSH never *writes* to those entries.


### Non-reentrancy: the static "Flush State"

FLUSH operations need a potentially large amount of memory as their contextual state. To avoid having to allocate it (and free it) for each request, we pre-allocate it once and for all, at volume construction time; its access is (for the most part) <u>not governed by any locks</u>, because the block layer offers the core guarantee that only one FLUSH request can be executing at any given time, for each block device, therefore the state can never be accessed (written to) concurrently.

This state consists, among other things, of the memory area hosting the encrypted (and serialised) position map; this buffer is read by the FLUSH's in-flight CWBs (Cacheline WriteBack requests), <u>while no lock protects it</u>: it is therefore necessary that no other code concurrently writes to it. This fundamental assumption of **non-reentrancy** of the FLUSH function is guaranteed, aside from the aforementioned property of the block layer, by the fact that its only other callers are the volume constructor and destructor, which can never execute concurrently with I/O on the volume.

In reality, WRITEs and DISCARDs might also need to *read* this "Flush State"; however, it is only *written to* by the FLUSH function. These critical sections are therefore protected by the `rwsem`. When the FLUSH only needs to *read*  this state (while the CWBs are in-flight), no lock is needed.


### Data structures

The following picture illustrates the position map and its ancillary fields (including the FLUSH state), allocated once per volume.

![Position Map fields](images/posmap.drawio.svg)

The meaning of the constants and the fields is as follows:

- `NSLICES`: the number of 1-MiB slices in the device/volume. The `entries` array stores LSI => PSI mappings as a simple lookup table: both LSIs and PSIs fit in 32 bits. Unmapped LSIs map to 0xFFFFFFFF.
- `POSMAP_SIZE`: the byte-size of the `entries` array, rounded up to the nearest 4-KiB block size. The `crypt_entries` buffer is part of the FLUSH state, and essentially hosts the encrypted (and serialised) position map as it is stored in the disk header.
- `NBLOCKS`: the number of 4-KiB blocks that the position map takes up; this is `ceil(NSLICES/1024)`, or `POSMAP_SIZE/4096`. The position map block is essentially our "cache line" (unit of FLUSH): we keep per-block status information as Sequence Numbers and Bitfields. The `cwb_error` and `flush_pending` bitfields are all 0 "at rest", while no FLUSH is executing; this is not true of the `snap_seqnum`, but it's value does not matter when no FLUSH is executing.


### Pseudo-code

In this section, we give pseudo-code for the critical sections of all four operations: READ, WRITE, DISCARD, FLUSH.

Except for FLUSH, all operations involve exactly one logical slice (LSI) of their volume, and therefore have to read the mapping `PSI <- entries[LSI]` exactly once, under the appropriate locking.  WRITEs (with slice allocations) and DISCARDs also have to alter this mapping: therefore, they set the `dirty` bit on the appropriate PosMap block, and increment its corresponding sequence number (more on that later).

The goal of the FLUSH requests is to eventually clear the `dirty` bits, after writing the dirty PosMap blocks back to disk.
The lock is initially acquired to "prepare the CWBs" (i.e. encrypt `entries` onto `crypt_entries`, and set the rest of the FLUSH state), and then released before actually sending them: this way, the pending I/O of the CWBs does not stall concurrent WRITEs.
The lock is then re-acquired at the end, once the CWBs return, to effectively mark as clean the blocks that were successfully flushed, and <u>that were not re-dirtied by a WRITE or a DISCARd, while the lock was not held</u>. The sequence number mechanism is there exactly to detect whether a PosMap block has been dirtied again, before its flushing was completed.


#### FLUSH

Let us start with the FLUSH handler, and its helper functions:

```rust
fn flush() {
	down_write(RWSEM);
	// Populate the FLUSH state, while we are holding the locks
	err = prepare_cwbs();
	up_write(RWSEM);
	// After unlocking, WRITEs and DISCARDs are no longer blocked and
	// can go through, potentially re-dirtying the PosMap blocks

	// Locklessly *reads* the FLUSH state; this is fine here
	err = send_cwbs();
	// Send a FLUSH to the underlying device.
	// It is important to first wait for all CWB callbacks to finish.
	err = DEV.flush();

	down_write(RWSEM);
	// Some blocks might be marked clean, some might not; the overall FLUSH
	// operation is successful only if all CWBs were successful
	err = mark_blocks_clean();
	clear(CWB_ERROR);
	clear(FLUSH_PENDING);
	up_write(RWSEM);

	return err;
}


// Under rwsem
fn prepare_cwbs() {
	for_each_set_bit(DIRTY, block) {
		first_lsi = block*1024;
		last_lsi = first_lsi + 1024;
		CRYPT_ENTRIES[first_lsi : last_lsi] = encrypt(ENTRIES[first_lsi : last_lsi]);
		SNAP_SEQNUM[block] = SEQNUM[block];
		FLUSH_PENDING[block] = true;
	}

	// At the end, the whole SNAP_SEQNUM is equal to SEQNUM and
	// the whole FLUSH_PENDING is equal to DIRTY.
	return;
}


// No lock
fn send_cwbs() {
	// Completion of all callbacks
	atomic_t pending = 1;
	struct completion compl;

	// We iterate over FLUSH_PENDING, not DIRTY
	for_each_set_bit(FLUSH_PENDING, block) {
		first_lsi = block*1024;
		last_lsi = first_lsi + 1024;

		atomic_inc(pending);
		// Pass (block, pending, compl) as context to callback
		err = DEV.write(CRYPT_ENTRIES[first_lsi : last_lsi], (block, pending, compl));
	}

	// dm-writecache pattern
	if (atomic_dec_and_test(pending) == 0)
		complete(compl);
	// Wait for all callbacks
	wait_for_completion(compl);

	return err;
}


// Async callback of CWB
fn cwb_callback(err, ctx) {
	if (err) {
		// Many callbacks could be executing concurrently
		spin_lock(CWB_ERROR_LOCK);
		CWB_ERROR[ctx.block] = true;	// Just log the error in the array
		spin_unlock(CWB_ERROR_LOCK);
	}

	if (atomic_dec_and_test(ctx.pending) == 0)
		complete(ctx.compl);

	return;
}


// Under rwsem again
fn mark_blocks_clean() {
	err = false;

	for_each_set_bit(FLUSH_PENDING, block) {
		if (CWB_ERROR[block])
			err = true;
		else if (SNAP_SEQNUM[block] == SEQNUM[block])
			DIRTY[block] = false;
		// Nothing to do in the else branch
	}

	// Only return err = false if no CWB failed
	return err;
}
```

A few things to notice:

- The error paths always (implicitly) zero-out the `cwb_error` and the `flush_pending` bitfields, as they need to be clean at rest.
- In `mark_blocks_clean()`, the check `SNAP_SEQNUM[block] == SEQNUM[block]` is sufficient to assure that no other WRITE or DISCARD re-dirtied the cache line in the meantime. This is because, although the sequence number is allowed to wrap around, it is capped at the snapshot value while a FLUSH is pending (see later for details).


#### WRITE

As was mentioned, the WRITE handler takes both the `rwsem` and the `spinlock`, for mutual exclusion with both FLUSHes and READs.
Here, we only describe the critical section, rather than the whole handler.

```rust
fn write(lba) {
	lsi = lba / 256;	// There are 256 4-KiB blocks in a slice
	block = lsi / 1024;	// The PosMap block this falls into

	// Take both locks
	RWSEM.down_read();
	spin_lock(LOCK);
	psi = ENTRIES[LSI];
	// If LSI is unmapped, sample a new one and insert in PosMap
	if (psi == 0xFFFFFFFF) {
		// If there's a FLUSH executing, ensure we don't increment the block's
		// sequence number too many times
		if (FLUSH_PENDING[block] && (SNAP_SEQNUM[block] + 1 == SEQNUM[block]))
			return -EAGAIN;	// Just try again later, after the FLUSH finished

		// Implicitly takes the per-device spinlock here
		psi = DEV.get_next_random_psi();
		ENTRIES[lsi] = psi;
		DIRTY[lsi] = true;
		SEQNUM[block]++;	// Can wrap around
	}
	spin_unlock(LOCK);
	up_read(RWSEM);
}
```

We make sure not to increment the sequence number too many times (16384 times) while a FLUSH is executing; this way, the check `SNAP_SEQNUM[block] == SEQNUM[block]` is, as mentioned, sufficient for `mark_blocks_clean()` to conclude that the block was not re-dirtied. It is anyway overwhelmingly unlikely that the block gets re-dirtied 16384 times before its FLUSH can complete.


#### DISCARD

The DISCARD's critical section is similar to the WRITE's, in that it dirties a position map block, and increments the sequence number. Just, it does not also act on the device, so it does not need to take the per-device `spinlock`.
Here, we only describe the critical section, rather than the whole handler.

```rust
fn discard(lsi) {
	block = lsi / 1024;

	RWSEM.down_read();
	spin_lock(LOCK);
	psi = ENTRIES[LSI];
	// Unmap LSI
	if (psi != 0xFFFFFFFF) {
		// If there's a FLUSH executing, ensure we don't increment the block's
		// sequence number too many times
		if (FLUSH_PENDING[block] && (SNAP_SEQNUM[block] + 1 == SEQNUM[block]))
			return -EAGAIN;	// Just try again later, after the FLUSH finished

		ENTRIES[lsi] = 0xFFFFFFFF;
		DIRTY[lsi] = true;
		SEQNUM[block]++;	// Can wrap around
	}
	spin_unlock(LOCK);
	up_read(RWSEM);
}
```


#### READ

The READ handler only takes the `spinlock`, and not the `rwsem`. This allows for concurrency between the READ's and the FLUSH's critical sections: this is fine because READs only *read* the `entries` without writing anything, and because the FLUSH handler does not *write* to `entries`.
Here, we only describe the critical section, rather than the whole handler.

```rust
fn read(lba) {
	lsi = lba / 256;	// There are 256 4-KiB blocks in a slice
	block = lsi / 1024;	// The PosMap block this falls into

	spin_lock(LOCK);
	psi = ENTRIES[LSI];
	spin_unlock(LOCK);
}
```