Getting Started with Sarek

Sarek is a high-performance framework for GPGPU programming in OCaml. It allows you to write kernels directly in OCaml syntax and execute them on various backends including CUDA, OpenCL, Vulkan, and Metal.

Installation

Prerequisites

OCaml: 5.4.0+ (required for effects and domains support)
Dune: 3.20+
GPU Drivers (optional):
- CUDA: NVIDIA drivers + CUDA Toolkit
- OpenCL: OpenCL runtime (Intel NEO, ROCm, etc.)
- Vulkan: Vulkan SDK + glslangValidator
- Metal: macOS 10.13+ (included with Xcode)

Installing from Source

Sarek is not yet in the official opam repository. Install from source:

# Clone the repository
git clone https://github.com/mathiasbourgoin/Sarek.git
cd Sarek

# Install dependencies
opam install . --deps-only -y

# Build
dune build

# Optional: install locally in your opam switch
opam install .

GPU backends (CUDA, OpenCL, Vulkan, Metal) are automatically detected and enabled based on available drivers and SDKs on your system.

Your First Kernel: Vector Addition

Here is a complete example of a vector addition kernel. Sarek uses the [%kernel ...] syntax to define code that runs on the GPU.

open Sarek

(* 1. Define the kernel *)
let%kernel vector_add (a : float32 vector) (b : float32 vector) (c : float32 vector) =
  (* Get the global thread ID *)
  let idx = get_global_id 0 in
  
  (* Perform computation if within bounds *)
  (* Note: Arrays use unsafe access syntax inside kernels for performance *)
  c.(idx) <- a.(idx) + b.(idx)

let () =
  (* 2. Initialize input data *)
  let n = 1024 in
  let a = Vector.create Float32 n in
  let b = Vector.create Float32 n in
  let c = Vector.create Float32 n in
  
  (* Fill vectors with data *)
  for i = 0 to n - 1 do
    Vector.set a i (float_of_int i);
    Vector.set b i (float_of_int (i * 2));
  done;

  (* 3. Select a device (auto-detects available GPU) *)
  let device = Device.get_default () in
  Printf.printf "Using device: %s\n" (Device.name device);

  (* 4. Execute the kernel *)
  (* Grid: 4 blocks, Block: 256 threads -> 1024 total threads *)
  Execute.run vector_add 
    ~device 
    ~grid:(4, 1, 1) 
    ~block:(256, 1, 1) 
    [Vec a; Vec b; Vec c];

  (* 5. Check results *)
  let result = Vector.get c 10 in
  Printf.printf "c[10] = %f\n" result

Shared Memory & Synchronization

Sarek supports advanced GPU features like shared memory and barriers. Here is an example of a parallel reduction (summing a vector).

let%kernel reduce_sum (input : float32 vector) (output : float32 vector) (n : int32) =
  (* Allocate shared memory for the thread block *)
  let%shared sdata = Array.create Float32 256 in
  
  let tid = thread_idx_x in
  let gid = get_global_id 0 in
  
  (* Load data into shared memory *)
  sdata.(tid) <- if gid < n then input.(gid) else 0.0;
  
  (* Synchronize all threads in the block *)
  barrier ();

  (* Tree reduction in shared memory *)
  let stride = ref 128 in
  while !stride > 0 do
    if tid < !stride then
      sdata.(tid) <- sdata.(tid) +. sdata.(tid + !stride);
    barrier ();
    stride := !stride / 2
  done;

  (* Write the block result to global memory *)
  if tid = 0 then
    output.(block_idx_x) <- sdata.(0)

Compilation

Build your project with dune:

(executable
 (name my_program)
 (libraries sarek spoc)
 (preprocess (pps sarek.ppx)))

Run it:

dune exec ./my_program.exe

Next Steps

Examples - Learn through practical examples (vector add, matrix multiply, reduction, transpose, mandelbrot)
Concepts - Understand Sarek’s design and programming model
Benchmarks - See performance data across different GPUs and backends
Backends - Learn about CUDA, OpenCL, Vulkan, and Metal support
API Documentation - Complete API reference