Lesson 8 — Shared memory & barriers

So far every thread has read directly from global memory (the large buffers uploaded from the CPU). Global memory is relatively slow because each access crosses the full GPU memory bus. Most GPUs also expose a much faster scratchpad: shared memory (called workgroup-local memory in WebGPU/WGSL). A block of threads can load data into shared memory once and then read it many times at near-zero cost.

The pattern used in almost every tiled GPU algorithm is:

  1. All threads in a workgroup load their element from global memory into a shared array.
  2. Call block_barrier () — this is a synchronisation barrier. Every thread in the workgroup stops here until all threads have finished their load. Without it, some threads might read stale values from shared memory written by a slower neighbour.
  3. Now every thread can safely read any element of the shared array — guaranteed to be fully populated — and write the result to global memory.

In Sarek, shared memory is declared with the let%shared binding:

let%shared (tile : float32) = ()

The () means “allocate one slot per thread in the workgroup” (the size comes from the dispatch configuration). The array is then indexed by thread_idx_x, the local thread index within the workgroup.

Your task

The kernel below implements a tiled copy: load each element from input into the shared tile, synchronise, then write it back to output. One line is missing — the barrier between the load and the store. Replace the placeholder (* TODO: add a barrier here *) () with block_barrier (), then click Run on my GPU.

Hint: avoid naming variables gid in kernels — the WGSL backend reserves that name for the built-in; use global_thread_id directly.

Fill in the TODO and click "Run on my GPU".
Hint

Replace (* TODO: add a barrier here *) () with block_barrier (). The barrier takes a unit argument (()) and returns unit — it fits directly as a statement in a semicolon-separated sequence.

After the fix the key lines read:

  tile.(thread_idx_x) <- input.(global_thread_id) ;
  block_barrier () ;
  output.(global_thread_id) <- tile.(thread_idx_x)
Why does this matter?

Without the barrier, thread 0 might reach the store output.(global_thread_id) <- tile.(thread_idx_x) before thread 1 has finished writing its slot into tile. In a more complex tiled algorithm where threads exchange data through shared memory (e.g. tiled matrix multiplication or tree reduction), reading a neighbour’s slot before it is written yields a race condition. The barrier eliminates this by guaranteeing that all writes to shared memory have completed before any read can proceed.

In this tiled-copy lesson each thread reads back only its own slot, so the barrier is technically redundant for correctness here — but inserting it is the habit that makes real tiled algorithms safe.