Post

Work-stealing

Little’s law

Given a stable system (i.e., input rate = output rate), the long-term average number of items in the system, $𝑁$, is equal to the average throughput, $\beta$, multiplied by the average time an item spends in the system, $\alpha$.

\[N = \alpha \cdot \beta\]

Little’s law tells us how much data is in flight.

Example: Memory System

shutup

Arithmetic intensity

Given a program P, express number of operations per loaded data.

Arithmetic intensity \(A(n)=\frac {W(n)} {L(n)}= \frac {\text{num flops}} {\text{num bytes loaded}}\)

Advantage: very simple to calculate!

  • Quite useful for basic algorithmic optimizations (but very limited)

Disadvantage: does not model the effect of a cache/scratchpad

  • A cache would be able to serve many loads at highly reduced cost

Operational intensity refines this concept.

Operational intensity

Given a program $P$, assume cold (empty) cache

Operational intensity

\[I(n)=\frac {W(n)} {Q(n)}= \frac {\text{num flops}} {\text{bytes transferred between cache and RAM}}\]

Asymptotic bounds on 𝑰(n)

Assuming only compulsory misses (infinite cache size for simplicity)

Operation$I(n)$
Vector sum: $y = x + y$$O(1)$
Matrix-vector product: $y = Ax$$O(1)$
Fast Fourier transform$O(\log n)$
Matrix-matrix product: $C = AB +C$$O(n)$

shutup

Balance principles I (Kung 1986)

shutup shutup

Roofline model

shutup

Different optimizations can be represented as performance ceilings ⇒ you cannot break through a ceiling without performing the associated optimization. shutup

shutup

Assume CPU performance $𝜋$ increases by a factor of $a$ (i.e., $a𝜋$), how to rebalance? shutup

Algorithm/program model

Work: $W$ [number of ops]

Data movement: $𝑄_𝑚$ [byte] (mem cache)

\[\frac \pi \beta = \frac W {Q_m}=I\]

How much do we need to increase the cache size, $𝛾$, to balance the increase in CPU performance? shutup

Alpha-Beta model

shutup

Time taken to transfer data of size $n$:

\[T(n) = \frac n \beta+\alpha\]

In some literature, $\beta $ describes inverse bandwidth, and the cost of transferring data grows linearly with $n$.

Balance Principles II

Objective: More detailed balance principles for multicores, assessment of hardware trends. shutup shutup

Derivation

Estimate $T_{mem}$ by dividing DAG into levels.

\[Q_{\gamma,\lambda}=\sum^D_{i=1} q_i\]

shutup

Then, apply alpha-beta model to each level:

\[T_{mem}\approx \sum^D_{i=1}(\frac {q_i \lambda} \beta + \alpha)=\alpha D+ \frac {Q\lambda} \beta\]

Brent’s lemma:

\[T_{comp} \approx \frac {D+\frac W p}\pi\]

Balance Principle

$T_{mem} \leq T_{comp}$

\[\frac {p \pi} {\beta} (1+\frac {\alpha \beta / \lambda} {Q/D}) \leq \frac {W} {Q\lambda} (1+\frac p {W/D})\]

shutup

Matrix multiplication

shutup

This post is licensed under CC BY 4.0 by the author.