CUDA Setup
SITU defaults to a CPU-only llama.cpp image. Switching to GPU acceleration requires one line in situ.conf and a compatible NVIDIA driver — nothing else changes.
How the sidecar works
When a session starts, SITU spins up a llama.cpp container alongside the agent container inside the same Podman pod. This sidecar loads the model and serves it over the pod-internal network. The image used for that sidecar is controlled by the LLAMA_IMAGE parameter in situ.conf.
The default image uses CPU inference:
LLAMA_IMAGE=ghcr.io/ggml-org/llama.cpp:server
Prerequisites
- NVIDIA GPU with CUDA compute capability 5.0 or later.
- NVIDIA driver installed on the host (the container does not carry its own driver).
- NVIDIA Container Toolkit — required for Podman to pass GPU devices into the container. Follow the official installation guide, then configure the CDI device for Podman.
Configure Podman for GPU access
Once the NVIDIA Container Toolkit is installed, generate a CDI specification so Podman knows how to expose the GPU to containers:
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
Verify that Podman can reach the GPU before launching SITU:
podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi
If nvidia-smi prints the GPU table, SITU will pick up the GPU the next time you start a session with a CUDA image.
Enabling GPU acceleration
The llama.cpp project publishes a separate image with CUDA support. To use it, open ~/.situ/situ.conf and set:
LLAMA_IMAGE=ghcr.io/ggml-org/llama.cpp:server-cuda
That is the only change required. On the next session start, SITU pulls the CUDA image (first run only) and the sidecar runs inference on the GPU.
macOS does not support NVIDIA CUDA. On Apple Silicon, inference runs on the CPU image. Metal/MPS support may come in a future llama.cpp image variant.
Related
- Configuration Reference — full reference for
LLAMA_IMAGEand all othersitu.confparameters. - Benchmark — hardware and model performance results including GPU-accelerated local LLM runs.
- Tuning — Podman memory limits and other performance adjustments for local LLM inference.