Lesson 9 — Tree reduction

Adding N numbers on the GPU is not as simple as it looks. You cannot just let every thread accumulate into the same variable — concurrent writes would race. Instead, GPU reduction uses a tree: in each round, half the active threads each add one pair of values, halving the working set. After log⊂2;(N) rounds only one value remains: the total sum.

The pattern, for a single workgroup of 256 threads:

  1. Load each element into shared memory (sdata), initialising out-of-bounds threads to 0.0.
  2. Barrier — ensure all loads are complete.
  3. Set stride s = 128. While s > 0:
    • Thread tid < s adds sdata.(tid + s) into sdata.(tid).
    • Barrier after each round.
    • Halve the stride: s := s / 2.
  4. Thread 0 writes sdata.(0) to output.(block_idx_x).

After 7 rounds (128 → 64 → 32 → 16 → 8 → 4 → 2 → 1) the full sum of all 256 elements lives in sdata.(0), then in output.(0).

Your task

The inner accumulation expression is missing. Replace (* TODO *) — currently the placeholder 0.0 — with the correct expression, then click Run on my GPU. The auto-checker compares output.(0) against the expected sum of all 256 input elements.

Fill in the TODO and click "Run on my GPU".
Hint

Each active thread (those with tid < s) should add the element at distance s from itself. The slot to add is sdata.(tid + s).

Replace (* TODO *) 0.0 with sdata.(tid + s). The full accumulation line becomes:

  sdata.(tid) <- sdata.(tid) +. sdata.(tid + s)
How the tree works step by step (N=8 example)

With sdata = [3, 1, 4, 1, 5, 9, 2, 6] and starting stride s = 4:

Round s=4:  sdata[0]+=sdata[4]  →  3+5=8
            sdata[1]+=sdata[5]  →  1+9=10
            sdata[2]+=sdata[6]  →  4+2=6
            sdata[3]+=sdata[7]  →  1+6=7
            sdata = [8, 10, 6, 7, ...]

Round s=2:  sdata[0]+=sdata[2]  →  8+6=14
            sdata[1]+=sdata[3]  →  10+7=17
            sdata = [14, 17, ...]

Round s=1:  sdata[0]+=sdata[1]  →  14+17=31
            sdata = [31, ...]   ✓  (3+1+4+1+5+9+2+6 = 31)