Lesson 9 — Tree reduction
Adding N numbers on the GPU is not as simple as it looks. You cannot just let every thread accumulate into the same variable — concurrent writes would race. Instead, GPU reduction uses a tree: in each round, half the active threads each add one pair of values, halving the working set. After log⊂2;(N) rounds only one value remains: the total sum.
The pattern, for a single workgroup of 256 threads:
- Load each element into shared memory (
sdata), initialising out-of-bounds threads to0.0. - Barrier — ensure all loads are complete.
- Set stride
s = 128. Whiles > 0:- Thread
tid < saddssdata.(tid + s)intosdata.(tid). - Barrier after each round.
- Halve the stride:
s := s / 2.
- Thread
- Thread 0 writes
sdata.(0)tooutput.(block_idx_x).
After 7 rounds (128 → 64 → 32 → 16 → 8 → 4 → 2 → 1) the full sum of all
256 elements lives in sdata.(0), then in output.(0).
Your task
The inner accumulation expression is missing. Replace
(* TODO *) — currently the placeholder 0.0
— with the correct expression, then click
Run on my GPU. The auto-checker compares
output.(0) against the expected sum of all 256 input elements.
Fill in the TODO and click "Run on my GPU".
Hint
Each active thread (those with tid < s) should add the
element at distance s from itself. The slot to add is
sdata.(tid + s).
Replace (* TODO *) 0.0 with sdata.(tid + s).
The full accumulation line becomes:
sdata.(tid) <- sdata.(tid) +. sdata.(tid + s)
How the tree works step by step (N=8 example)
With sdata = [3, 1, 4, 1, 5, 9, 2, 6] and starting
stride s = 4:
Round s=4: sdata[0]+=sdata[4] → 3+5=8
sdata[1]+=sdata[5] → 1+9=10
sdata[2]+=sdata[6] → 4+2=6
sdata[3]+=sdata[7] → 1+6=7
sdata = [8, 10, 6, 7, ...]
Round s=2: sdata[0]+=sdata[2] → 8+6=14
sdata[1]+=sdata[3] → 10+7=17
sdata = [14, 17, ...]
Round s=1: sdata[0]+=sdata[1] → 14+17=31
sdata = [31, ...] ✓ (3+1+4+1+5+9+2+6 = 31)