Lesson 7 — Composing kernels

A single kernel maps over one array at a time, but many algorithms are naturally expressed as a pipeline of kernels: the output of one kernel becomes the input of the next. This lesson shows how to chain two kernels — square and add — to compute out[i] = a[i]² + b[i] in two GPU passes.

What makes this different from the earlier lessons: there is now an OCaml host program (compiled to JavaScript with js_of_ocaml) that orchestrates the two GPU dispatches. The host transpiles each kernel source, runs Kernel A to produce a temporary buffer tmp, then runs Kernel B with tmp and b to produce the final result.

Host orchestration (OCaml)

(* Host program (runs in the browser via js_of_ocaml) *)
let run kA_src kB_src ~a ~b ~cb =
  match of_source_with_abi WGSL kA_src with
  | Error e -> cb_err cb e
  | Ok (wgslA, abiA) ->
    Webgpu_runtime.run ~wgsl:wgslA ~abi_json:abiA
      ~inputs:[("a", a); ("tmp", zeros)]
      ~outputs_wanted:["tmp"]
      ~on_done:(fun [{tmp}] ->
        match of_source_with_abi WGSL kB_src with
        | Error e -> cb_err cb e
        | Ok (wgslB, abiB) ->
          Webgpu_runtime.run ~wgsl:wgslB ~abi_json:abiB
            ~inputs:[("tmp", tmp); ("b", b); ("out", zeros)]
            ~outputs_wanted:["out"]
            ~on_done:(fun [{out}] -> cb_ok cb out) ())
      ()

Note: the two kernels currently round-trip through the host between stages (tmp is read back from the GPU before being re-uploaded for Kernel B). True GPU-side buffer persistence — avoiding the round-trip — is a planned runtime enhancement.

Your task

Fill in the two kernels below and click Run on my GPU. Both editors use the same OCaml kernel syntax you have seen in earlier lessons. The host program above glues them together automatically.

Fill in the TODOs and click "Run on my GPU".
Hint

Kernel A: a.(i) *. a.(i) (OCaml uses *. for float multiplication). Kernel B: tmp.(i) +. b.(i).