Advanced Topics

CUDA Setup

SITU defaults to a CPU-only llama.cpp image. Switching to GPU acceleration requires one line in situ.conf and a compatible NVIDIA driver — nothing else changes.

How the sidecar works

When a session starts, SITU spins up a llama.cpp container alongside the agent container inside the same Podman pod. This sidecar loads the model and serves it over the pod-internal network. The image used for that sidecar is controlled by the LLAMA_IMAGE parameter in situ.conf.

The default image uses CPU inference:

LLAMA_IMAGE=ghcr.io/ggml-org/llama.cpp:server

Prerequisites

NVIDIA GPU with CUDA compute capability 5.0 or later.
NVIDIA driver installed on the host (the container does not carry its own driver).
NVIDIA Container Toolkit — required for Podman to pass GPU devices into the container. Follow the official installation guide, then configure the CDI device for Podman.

Configure Podman for GPU access

Once the NVIDIA Container Toolkit is installed, generate a CDI specification so Podman knows how to expose the GPU to containers:

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Verify that Podman can reach the GPU before launching SITU:

podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi

If nvidia-smi prints the GPU table, SITU will pick up the GPU the next time you start a session with a CUDA image.

Enabling GPU acceleration

The llama.cpp project publishes a separate image with CUDA support. To use it, open ~/.situ/situ.conf and set:

LLAMA_IMAGE=ghcr.io/ggml-org/llama.cpp:server-cuda

That is the only change required. On the next session start, SITU pulls the CUDA image (first run only) and the sidecar runs inference on the GPU.

macOS note

macOS does not support NVIDIA CUDA. On Apple Silicon, inference runs on the CPU image. Metal/MPS support may come in a future llama.cpp image variant.

Configuration Reference — full reference for LLAMA_IMAGE and all other situ.conf parameters.
Benchmark — hardware and model performance results including GPU-accelerated local LLM runs.
Tuning — Podman memory limits and other performance adjustments for local LLM inference.

CUDA Setup

How the sidecar works

Prerequisites

Configure Podman for GPU access

Enabling GPU acceleration

Related