Work-stealing

Posted Jan 24, 2024 Updated Feb 11, 2024

By Filippo Visconti 2 min read

Little’s law

Given a stable system (i.e., input rate = output rate), the long-term average number of items in the system, $𝑁$, is equal to the average throughput, $\beta$, multiplied by the average time an item spends in the system, $\alpha$.

\[N = \alpha \cdot \beta\]

Little’s law tells us how much data is in flight.

Example: Memory System

Arithmetic intensity

Given a program P, express number of operations per loaded data.

Arithmetic intensity $A(n)=\frac {W(n)} {L(n)}= \frac {\text{num flops}} {\text{num bytes loaded}}$

Advantage: very simple to calculate!

Quite useful for basic algorithmic optimizations (but very limited)

Disadvantage: does not model the effect of a cache/scratchpad

A cache would be able to serve many loads at highly reduced cost

Operational intensity refines this concept.

Operational intensity

Given a program $P$, assume cold (empty) cache

Operational intensity
\[I(n)=\frac {W(n)} {Q(n)}= \frac {\text{num flops}} {\text{bytes transferred between cache and RAM}}\]

Asymptotic bounds on 𝑰(n)

Assuming only compulsory misses (infinite cache size for simplicity)

Operation	$I(n)$
Vector sum: $y = x + y$	$O(1)$
Matrix-vector product: $y = Ax$	$O(1)$
Fast Fourier transform	$O(\log n)$
Matrix-matrix product: $C = AB +C$	$O(n)$

Balance principles I (Kung 1986)

Roofline model

Different optimizations can be represented as performance ceilings ⇒ you cannot break through a ceiling without performing the associated optimization.

Assume CPU performance $𝜋$ increases by a factor of $a$ (i.e., $a𝜋$), how to rebalance?

Algorithm/program model

Work: $W$ [number of ops]

Data movement: $𝑄_𝑚$ [byte] (mem cache)

\[\frac \pi \beta = \frac W {Q_m}=I\]

How much do we need to increase the cache size, $𝛾$, to balance the increase in CPU performance?

Alpha-Beta model

Time taken to transfer data of size $n$:

\[T(n) = \frac n \beta+\alpha\]

In some literature, $\beta $ describes inverse bandwidth, and the cost of transferring data grows linearly with $n$.

Balance Principles II

Objective: More detailed balance principles for multicores, assessment of hardware trends.

Derivation

Estimate $T_{mem}$ by dividing DAG into levels.

\[Q_{\gamma,\lambda}=\sum^D_{i=1} q_i\]

Then, apply alpha-beta model to each level:

\[T_{mem}\approx \sum^D_{i=1}(\frac {q_i \lambda} \beta + \alpha)=\alpha D+ \frac {Q\lambda} \beta\]

Brent’s lemma:

\[T_{comp} \approx \frac {D+\frac W p}\pi\]

Balance Principle
$T_{mem} \leq T_{comp}$
\[\frac {p \pi} {\beta} (1+\frac {\alpha \beta / \lambda} {Q/D}) \leq \frac {W} {Q\lambda} (1+\frac p {W/D})\]

Matrix multiplication

dphpc, lecture-notes

This post is licensed under CC BY 4.0 by the author.