Designing Locks

Posted Jan 21, 2024 Updated Feb 11, 2024

By Filippo Visconti 6 min read

Large-scale Locking and Remote Direct Memory Access

How to lock thousands or processes?

Question: is that useful given that locks provide mutual exclusivity? Mutual exclusivity means serialization increasing Amdahl’s serial fraction limiting scalability, right?

Locks: Challenges

We need intra- and inter-node topology-awareness
We need to cover arbitrary topologies
We need to distinguish between readers and writers
We need flexible performance for both types of processes

Designing Locks

Ingredient 1 - MCS Locks

Ingredient 2 - Reader-Writer Locks

Remote Memory Access Programming

RDMA support implemented in hardware in NICs in the majority of HPC networks

RMA-RW - Required Operations

Recitation recap: MPI RMA

How to ensure tunable performance? Each DQ: fairness vs throughput of writers DC: a parameter for the latency of readers vs writers A tradeoff parameter for every structure DT: a parameter for the throughput of readers vs writers

DISTRIBUTED MCS QUEUES (DQS) Throughput vs Fairness

Each DQ: The maximum number of lock passings within a DQ at level i, before it is passed to another DQ at i.

\[T_{𝐿,i}\]

Larger $T_{𝐿,i}$: more throughput at level i.
Smaller $T_{𝐿,i}$: more fairness at level i.

DISTRIBUTED TREE OF QUEUES (DT) Throughput of readers vs writers

DT: The maximum number of consecutive lock passings within readers ($T_R$). a.k.a.: the number of readers you admit before checking for a writer.

DISTRIBUTED COUNTER (DC) Latency of readers vs writers

DC: every kth compute node hosts a partial counter, all of which constitute the DC. $k = T_{DC}$

Reminder: lock-free vs. wait-free

Explanation from Wikipedia

A locked method

May deadlock (methods may never finish) A lock-free method
Guarantees that infinitely often some method call finishes in a finite number of steps

A wait-free method

Guarantees that each method call finishes in a finite number of steps (implies lock-free)
Strongest guarantee – always nice to have, may be expensive (often in terms of memory) – crucial for critical systems

Recurring theme: helping (faster threads help slower threads, e.g., wait-free queues)

Synchronization instructions are not equally powerful!

e.g., CAS, FADD, Swap, atomic r/w, transactional memory, etc. – which one to use or implement?

Indeed, they form an infinite hierarchy; no instruction (primitive) in level $x$ can be used for lock-/wait-free implementations of primitives in level $z>x$.

If a program is lock-free, it basically means that at least one of its threads is guaranteed to make progress over an arbitrary period of time. If a program deadlocks, none of its threads (and therefore the program as a whole) cannot make progress - we can say it’s not lock-free. Since lock-free programs are guaranteed to make progress, they are guaranteed to complete (assuming finite execution without exceptions).

Wait-free is a stronger condition which means that every thread is guaranteed to make progress over an arbitrary period of time, regardless of the timing/ordering of thread execution; and so we can say that the threads finish independently. All wait-free programs are lock-free.

Concept: Consensus Number

Each level of the hierarchy has a “consensus number” assigned.

It’s the maximum number of threads for which primitives in level x can solve the consensus problem The consensus problem:
Has single function: decide(v)
Each thread calls it at most once, the function returns a value that meets two conditions:
- consistency: all threads get the same value
- validity: the value is some thread’s input
Simplification: binary consensus (inputs in {0,1})

It’s one of the simplest representatives of wait-free algorithms! Still shows all the issues

Understanding Consensus

Can a particular class solve n-thread consensus wait-free?

A class C solves n-thread consensus if there exists a consensus protocol using any number of objects of class C and any number of atomic registers
The protocol must be wait-free (bounded number of steps for each thread)
The consensus number of a class C is the largest n for which that class solves n-thread consensus (may be infinite)
Assume we have a class D whose objects can be constructed from objects out of class C. If class C has consensus number n, what does class D have?

Atomic Registers

Theorem [Herlihy’91]: Atomic registers have consensus number one

i.e., they cannot be used to solve even two-thread consensus! Really? Proof outline:
Assume arbitrary consensus protocol, thread A, B
Run until it reaches critical state where next action determines outcome (show that it must have a critical state first)
Show all options using atomic registers and show that they cannot be used to determine one outcome for all possible executions!
1. Any thread reads (other thread runs solo until end)
2. Threads write to different registers (order doesn’t matter)
3. Threads write to same register (solo thread can start after each write)

Corollary: It is impossible to construct a wait-free implementation of any object with consensus number of >1 using atomic registers

→ We need hardware atomics or Transactional Memory!

Consensus result needs to be valid: a.k.a. return one of the numbers proposed by some thread.

Consensus needs to be wait-free: All threads finish after a finite number of steps, independently of other threads.

Simplification: Binary Consensus

Instead of proposing an integer, every thread now proposes either 0 or 1

Equivalent to “normal” consensus for two threads
How can we proof this?
If we have int_decide(int) as primitive, we can implement bin_decide(bit) and vice-versa

Anatomy of a State (in Two-Thread Consensus)

State Diagrams of Two-thread Consensus Protocols

The Concept of Valency

In binary two-thread consensus, threads either decide zero (0) or one (1)

At some point during the execution (i.e., a state), each thread will “decide” what to return
We call a state where a thread has decided on one 1-valent and a state where a thread has decided on zero 0-valent
Undecided states are called bivalent – decided states are called univalent

Lemma 1: The initial state is bivalent

Proof outline:
- Consider initial state with A has input 0 and B has input 1
- If A finished before B starts, we must decide 0 and if B finishes before A starts, we must decide 1 (because it only knows the thread’s input!)

Thus, the initial state must be bivalent!

Bivalent State
A (bivalent) state is called critical if both child states are univalent!

Lemma 2: Every consensus protocol has a critical state

Proof: From (bivalent) start state, let the threads only move to other bivalent states.

If it runs forever the protocol is not wait free.
If it reaches a position where no moves are possible this state is critical.

Impossibility Proof Setup – Critical State

Impossibility Proof Setup – Possible actions of a thread

Many Cases to check

Impossibility Proof Case I: A reads

What did we just prove?

Impossibility Proof Case I’: B reads

What did we just prove?

Impossibility Proof Case II: A and B write to different registers

What did we just prove?

Impossibility Proof Case III: A and B write to the same register

Conclusion

Compare and Set/Swap Consensus

  
const int first = -1
volatile int thread = -1;
int proposed[n];
int decide(v) {
 proposed[tid] = v;
 if(CAS(thread, first, tid))
  return v; // I won!
 else
  return proposed[thread]; // thread won
}

CAS provides an infinite consensus number

Machines providing CAS are asynchronous computation equivalents of the Turing Machine
I.e., any concurrent object can be implemented in a wait-free manner (not necessarily fast!)

dphpc, lecture-notes

This post is licensed under CC BY 4.0 by the author.

Large-scale Locking and Remote Direct Memory Access

Locks: Challenges

Designing Locks

Ingredient 1 - MCS Locks

Ingredient 2 - Reader-Writer Locks

Remote Memory Access Programming

RMA-RW - Required Operations

Recitation recap: MPI RMA

DISTRIBUTED MCS QUEUES (DQS) Throughput vs Fairness

DISTRIBUTED TREE OF QUEUES (DT) Throughput of readers vs writers

DISTRIBUTED COUNTER (DC) Latency of readers vs writers

Reminder: lock-free vs. wait-free

Concept: Consensus Number

Understanding Consensus

Atomic Registers

Simplification: Binary Consensus

Anatomy of a State (in Two-Thread Consensus)

State Diagrams of Two-thread Consensus Protocols

The Concept of Valency

Lemma 1: The initial state is bivalent

Lemma 2: Every consensus protocol has a critical state

Impossibility Proof Setup – Critical State

Impossibility Proof Setup – Possible actions of a thread

Many Cases to check

Impossibility Proof Case I: A reads

Impossibility Proof Case I’: B reads

Impossibility Proof Case II: A and B write to different registers

Impossibility Proof Case III: A and B write to the same register

Conclusion

Compare and Set/Swap Consensus

Trending Tags