Lesson 8 — Shared memory & barriers
So far every thread has read directly from global memory (the large buffers uploaded from the CPU). Global memory is relatively slow because each access crosses the full GPU memory bus. Most GPUs also expose a much faster scratchpad: shared memory (called workgroup-local memory in WebGPU/WGSL). A block of threads can load data into shared memory once and then read it many times at near-zero cost.
The pattern used in almost every tiled GPU algorithm is:
- All threads in a workgroup load their element from global memory into a shared array.
- Call
block_barrier ()— this is a synchronisation barrier. Every thread in the workgroup stops here until all threads have finished their load. Without it, some threads might read stale values from shared memory written by a slower neighbour. - Now every thread can safely read any element of the shared array — guaranteed to be fully populated — and write the result to global memory.
In Sarek, shared memory is declared with the
let%shared binding:
let%shared (tile : float32) = ()
The () means “allocate one slot per thread in the
workgroup” (the size comes from the dispatch configuration). The array
is then indexed by thread_idx_x, the local thread
index within the workgroup.
Your task
The kernel below implements a tiled copy: load each
element from input into the shared tile, synchronise, then
write it back to output. One line is missing — the
barrier between the load and the store. Replace the placeholder
(* TODO: add a barrier here *) () with
block_barrier (), then click
Run on my GPU.
Hint: avoid naming variables gid in kernels —
the WGSL backend reserves that name for the built-in; use
global_thread_id directly.
Fill in the TODO and click "Run on my GPU".
Hint
Replace (* TODO: add a barrier here *) () with
block_barrier (). The barrier takes a unit argument
(()) and returns unit — it fits directly as a
statement in a semicolon-separated sequence.
After the fix the key lines read:
tile.(thread_idx_x) <- input.(global_thread_id) ; block_barrier () ; output.(global_thread_id) <- tile.(thread_idx_x)
Why does this matter?
Without the barrier, thread 0 might reach the store
output.(global_thread_id) <- tile.(thread_idx_x)
before thread 1 has finished writing its slot into tile.
In a more complex tiled algorithm where threads exchange data through
shared memory (e.g. tiled matrix multiplication or tree reduction),
reading a neighbour’s slot before it is written yields a race
condition. The barrier eliminates this by guaranteeing that all writes
to shared memory have completed before any read can proceed.
In this tiled-copy lesson each thread reads back only its own slot, so the barrier is technically redundant for correctness here — but inserting it is the habit that makes real tiled algorithms safe.