Build log

3× Faster and Sharper Output. Same Model. Same Machine. — 10 Tuning Tips That Supercharge Your LLMs

By Andreas Burner, author of SITU · · ~16 min read

Getting the best out of a local AI coding agent on limited hardware is as much about configuration as it is about hardware. The right inference settings — whether in llama.cpp, Ollama, LM Studio, or vLLM — can mean the difference between 0.3 tok/s and 6 tok/s on the same machine, or between hitting an out-of-memory crash mid-task and running a 14B model comfortably inside 32 GB.

LLM rocket launching with lightning — tuning local LLMs for speed

This post documents the journey of a client engagement with a software company — rolling out SITU Agent as a local AI coding agent for their development team and systematically tuning llama.cpp parameters against real coding benchmarks to extract maximum speed and output quality from the hardware the developers already had.

Why parameter tuning matters — and why it is easy to get wrong

Most people who run local inference through llama.cpp, Ollama, LM Studio, or a similar tool accept the defaults and conclude that local LLM inference is just slow. In most cases, the bottleneck is configuration, not hardware.

The challenge is that these parameters do not act in isolation. Temperature interacts with sampling strategy and model behaviour. Context size determines KV cache allocation, which interacts with the number of parallel slots, which determines whether a 14B model fits in RAM or spills to disk at 0.3 tok/s. Thread count interacts with CPU topology in ways that can either double throughput or accidentally serialize your entire compute pipeline. Get two of these right and leave the third at a bad default, and you may be leaving most of the performance on the table.

There is also a quality dimension that is easy to overlook. Chasing raw tok/s numbers without understanding a parameter's effect on output can silently degrade the model — producing incorrect, repetitive, or truncated responses. A faster agent that produces worse code is not an improvement.

The tuning documented here was developed against SITU Agent's coding benchmarks: real multi-step tasks like chess implementations and multi-file refactors, not synthetic perplexity scores. Code quality and task completion rate were measured alongside raw speed. The underlying inference engine is llama.cpp running as a local server, which is also the engine that powers Ollama and LM Studio under the hood. The parameters described here apply across all three frontends and to any other llama.cpp-based backend, with per-engine syntax noted in each section.

The hope is that this saves others from the same trial and error. Parameters that look independent are often coupled; settings that work for Llama 3 may actively harm Qwen3; a default that is fine for a chat interface can cripple an agentic coding loop. There is a lot of knowledge in the llama.cpp issues and discussions that takes time to surface. This article is an attempt to distil the most impactful parts of it into a single place.


1. Temperature: stop using the default

What it does

Temperature controls how the model samples from its output distribution. High temperature produces diverse, creative text. Low temperature produces focused, near-deterministic output. Every major inference engine defaults to somewhere between 0.7 and 1.0 — well-suited for creative writing and general-purpose chat, and actively wrong for agentic coding tasks where reproducibility and consistency matter.

How to set it

For agentic coding, set it to 0.1. It produces near-deterministic outputs without crossing into true greedy decoding territory. Do not use 0.0: since llama.cpp PR #9897 (link) (November 2024), setting temperature to exactly zero in the default sampling pipeline no longer guarantees greedy behaviour and has been observed causing infinite repetition loops on some prompts.

llama.cpp --temp 0.1
Ollama (Modelfile) PARAMETER temperature 0.1
Ollama (API) "options": {"temperature": 0.1}
LM Studio Temperature slider in the chat preset or Advanced model settings
vLLM / OpenAI API "temperature": 0.1 in the request body

Model-specific note: Qwen3's official model card (link) recommends 0.6 in thinking mode and explicitly warns against greedy decoding. Gemma-4 was calibrated at 1.0 during training — lowering it below 0.8 has been observed degrading Gemma-4's coding performance (link) specifically. Use 0.1 for reproducible benchmarking across models; align with the model card for production quality.

What improved

Temperature is applied post-logit as a scalar divide before softmax — it costs essentially zero compute. The gain is not speed but reproducibility. Benchmark results became directly comparable across runs. SITU coding benchmarks at T=0.1 align with published HumanEval leaderboard scores (Gemma-4 HumanEval 82.7% at T=0.1), making external validation straightforward. At T=0.8, the same prompt produces meaningfully different code 30–40% of the time.


2. Batch sizing on Apple Silicon: raise num_batch to 2048

What it does

The batch size (called -b / -ub in llama.cpp, num_batch in Ollama) controls how many tokens are processed in a single compute pass during prompt evaluation (prefill). On Apple Silicon with Metal, the micro-batch size (-ub) determines the Metal kernel dispatch size. At the default of 512, Metal threadgroups run with partial occupancy — most of the GPU sits idle during the prefill phase.

How to set it

The value 2048 maps to Metal SIMD group occupancy patterns and is the community-validated optimum (link) for most Apple Silicon chips. On 16 GB Macs running large models, drop to 1024 to avoid compute buffer out-of-memory errors. GPU layers should always be set to 99 on Apple Silicon — unified memory means leaving layers on CPU gives no memory headroom benefit.

llama.cpp -b 2048 -ub 2048 --n-gpu-layers 99
Ollama (Modelfile) PARAMETER num_batch 2048
PARAMETER num_gpu 99
LM Studio "Batch Size" and "GPU Layers" sliders in Advanced model settings
vLLM vLLM manages batching internally; --max-num-batched-tokens is the closest analogue but is usually left at auto

One caveat: Qwen2.5-27B was observed running faster at num_batch=64 than at 2048 on some chips. This is unusual but real. Always benchmark your specific model and chip combination rather than blindly applying 2048 to everything.

What improved

A 4K-token prompt that previously took ~8 seconds to process dropped to ~3 seconds — a 2.7× prefill speedup. The gain scales with prompt length; prompts under ~512 tokens see no benefit. For a coding agent that routinely prefills 8K–32K context windows, this compounds significantly across a multi-step task.


3. mlock: prevent mid-decode speed collapse on macOS

What it does

macOS's memory compressor is aggressive. It begins compressing heap pages well before swap is hit, and the KV cache — a regular heap allocation — is vulnerable to silent compression even during active inference. When this happens mid-decode, generation speed drops by an order of magnitude while the system decompresses. mlock pins the KV allocation in physical RAM, preventing this entirely.

This is a macOS-specific concern. Linux with swap disabled, or a system with sufficient free RAM, does not have this failure mode.

How to set it

Use this when (model_size + KV_cache + 4 GB OS overhead) < 70% of total unified memory. If you are near the memory boundary, leave it off — aggressive locking can cause system-wide thrashing if the system runs out of physical pages to lock.

llama.cpp --mlock
Ollama (Modelfile) PARAMETER mlock true
LM Studio "Keep model in memory" toggle in Advanced model settings
vLLM Not applicable — vLLM uses CUDA managed memory on GPU; mlock is irrelevant

What improved

Without mlock, decode speed drops from 20+ tok/s to under 2 tok/s under memory pressure on M2 Max — a confirmed 10× regression. With it enabled, generation speed stays stable for the full output. Long coding tasks — multi-file refactors, 10K+ token outputs — are exactly the workload that triggers this failure mode.


4. CPU thread count: pin to physical cores, not logical CPUs

What it does

Tools like nproc return logical CPUs — physical cores plus hyperthreads. LLM decode is memory-bandwidth bound, not compute bound. Assigning all logical threads causes hyperthreads to compete for the same L3 cache bandwidth, reducing effective bandwidth per physical core. The result is often worse throughput than using half the thread count.

On Intel hybrid CPUs (P-cores + E-cores, everything from Alder Lake onward), the problem is more severe. E-cores use busy-waiting spinlocks — P-cores finish early and spin-wait, rate-limiting the entire batch to the speed of the slowest E-core. This is not a subtle difference: it was measured at a 2.4–3× throughput collapse on an i7-12700H.

How to set it

Read the physical core count from /proc/cpuinfo directly, rather than relying on nproc or any tool's auto-detect. On Windows, use the "Physical Cores" value from Task Manager → Performance → CPU.

phys_cores=$(grep -m1 "cpu cores" /proc/cpuinfo | awk '{print $4}')
-t "${phys_cores}"
llama.cpp -t <physical_core_count>
Ollama (Modelfile) PARAMETER num_thread 6
(substitute your physical core count)
Ollama (env) OLLAMA_NUM_THREADS=6
LM Studio "CPU Threads" slider in Advanced settings — set to physical core count, not "Max"
vLLM vLLM is GPU-native; CPU threading is not user-configurable. Apply this to CPU-only inference engines only

What improved

On an Intel i7-12700H (6 P-cores + 8 E-cores): restricting to 6 P-cores produced a 2.4× speedup on LLaMA 7B Q4_0 (2.1 → 5.0 tok/s) and a 3× speedup on LLaMA 65B Q4_0 (0.25 → 0.74 tok/s). AMD CPUs see a more modest 5–15% gain from physical-only threading, but still benefit.

A related data point worth noting: upgrading from single-channel to dual-channel RAM (same total capacity, different slot configuration) improved a 34B model from 1.5 tok/s to 4.0 tok/s — a 2.7× gain from memory bandwidth alone. Thread tuning and memory configuration are complementary; fixing one without the other leaves significant performance on the table.


5. Generation ceiling: raise max_tokens to 16384

What it does

max_tokens (known as n_predict in llama.cpp, num_predict in Ollama) is the per-call generation budget — the maximum number of tokens the model will generate before stopping. The llama.cpp raw default is -1 (unlimited), which sounds generous but is equally undesirable — an uncapped generation can stall an agentic pipeline indefinitely if the model enters a repetition loop, and it makes latency completely unpredictable. Most frontends, agent frameworks, and configuration templates therefore ship with a conservative cap — often 4096. That sounds large until you consider that a complete chess implementation runs approximately 11,000 tokens.

When the budget is too low, the model cuts off mid-generation. Agent frameworks then fall into retry loops: detect incomplete output, re-prompt with context, attempt to continue. Each retry consumes a full prompt evaluation round-trip. One truncation event can multiply the LLM call count for a single task by 3–5×, collapsing effective throughput regardless of how fast the underlying inference is.

How to set it

Unlike context size, the generation limit has no memory cost at all. KV cache allocation is driven by --ctx-size, not by max_tokens. Setting max_tokens=32768 when the model stops after 200 tokens via an EOS signal costs nothing — only a trivial per-token counter check.

llama.cpp --n-predict 16384
Ollama (Modelfile) PARAMETER num_predict 16384
Ollama (API) "options": {"num_predict": 16384}
LM Studio "Response Token Limit" in the generation settings panel
vLLM / OpenAI API "max_tokens": 16384 in the request body

Known bug: in llama.cpp, when generation is cut off by max_tokens, the finish_reason field incorrectly returns "stop" instead of "length" (Issue #8856 (link), open as of mid-2026). Frameworks that check finish_reason to detect truncation cannot rely on this signal. The workaround is to compare usage.completion_tokens directly against the requested max_tokens value.

What improved

Raising the limit to 16,384 on the chess benchmark produced a complete, working implementation — 1,054 lines in a single call — where a conservative 4,096 setting had been delivering 30–50 truncated lines. For reference, pi-mono's own documented default (link) for reservedOutputTokens is 16,384, which the framework authors treat as the minimum adequate response budget for non-trivial coding tasks. Anything lower is leaving significant capability on the table.


6. Sync the agent's contextWindow to the server's actual context size

What it does

Agent frameworks maintain their own contextWindow setting — independent of the inference engine — to decide when to compact or summarize the running session. Think of it as the framework's internal view of how much context it has available. If that view does not match what the inference engine actually allocated, the two are working with different assumptions, and things break in both directions.

If contextWindow understates the server's actual context size: the agent compacts aggressively and unnecessarily, discarding in-progress work mid-task. If it overstates: the agent holds more context than the server can handle, and the server returns an error — "request (66,202 tokens) exceeds the available context size (65,536 tokens)" — that most frameworks do not auto-recover from.

How to set it

The formula is simple — and must be honoured by every layer in the stack:

contextWindow = ctx_size / parallel_slots

# single-user (parallel_slots = 1):
contextWindow = ctx_size
llama.cpp Pass --ctx-size N to the server; set the agent's contextWindow to the same N
Ollama (Modelfile) PARAMETER num_ctx N → set agent contextWindow to the same N
LM Studio Read "Context Length" from Advanced settings → use that value in the agent config
General rule Derive both values from a single source — one env var or config entry — so they can never drift apart

If you use Ollama with a Modelfile that sets num_ctx 40960 and your agent's context window is declared at 64000, premature compaction will fire at roughly 51K tokens — well before the model is actually full.

What improved

Context compaction summarization discards approximately 60% of session facts and 54% of project constraints (arXiv 2602.22402 (link) — a measured result, not an estimate). For a multi-file coding task, premature compaction means the agent loses its working model of which files exist, what code was already written, and what constraints were established. Fixing the mismatch eliminated mid-task context loss on long refactoring sessions entirely.


7. Context size: auto-detect from model metadata, don't hardcode

What it does

Every GGUF model embeds its training context length (n_ctx_train) in its file metadata. The correct context size to allocate is that value. Overallocating wastes RAM; underallocating truncates the model's effective window and triggers premature compaction as described above.

The waste from overallocation is not trivial. Qwen3-14B has n_ctx_train = 40,960. Running it at ctx=64K wastes approximately 3.6 GB of KV cache. Running it at ctx=180K — a value some configurations recommend for headroom — wastes 21.7 GB, which is 4.4× the needed allocation. Gemma-4 E4B has n_ctx_train = 131,072; a hardcoded ctx=64K was leaving half its context window unused.

How to set it

For llama.cpp, query the running server's /v1/models endpoint after startup — it returns n_ctx_train in a meta field. Do not rely on --ctx-size 0 to auto-detect: in earlier llama.cpp versions, this silently fell back to 4096 with no warning (Issue #18376 (link)).

CTX_SIZE=$(curl -sf "${LM_SERVER_BASE_URL}/models" | node -e "
    const m = JSON.parse(require('fs').readFileSync('/dev/stdin','utf8'));
    const n = m.data?.[0]?.meta?.n_ctx_train;
    process.stdout.write(n ? String(n) : '65536');
")
llama.cpp --ctx-size <n_ctx_train>
Query via /v1/modelsmeta.n_ctx_train
Ollama (Modelfile) PARAMETER num_ctx 40960
Check the model page on ollama.com for the correct value
LM Studio "Context Length" in Advanced model settings — model cards list the training context length
vLLM --max-model-len 40960 at startup; vLLM reads n_ctx_train from the model config and auto-caps if unset

What improved

Zero-config model switching: changing from Gemma-4 to Qwen3-14B requires no manual edit to any configuration file. Overallocation beyond available VRAM forces CPU offload for the overflow, reducing prefill throughput by approximately 32%. Auto-detection eliminates this penalty without any per-model tuning.


8. Parallel slots: set to 1 for single-user deployment

What it does

Inference servers that support concurrent requests allocate KV cache for all parallel slots upfront at startup — not on demand. This is a fundamental difference from vLLM's paged attention architecture. In llama.cpp and Ollama, the default parallel slot count can be 4 or higher depending on the version. With 4 slots and a 64K context, you are allocating 4× the KV memory you actually need for single-user inference.

The failure scenario on a 32 GB machine running Qwen3-14B at 64K context with 4 slots:

~7 GB KV cache × 4 slots  = 28 GB KV
+ ~9 GB model weights
= 37 GB  →  OS begins disk swapping

Observed throughput in that state: 0.34–0.36 tok/s. Expected throughput in-RAM: 3–6 tok/s. That is not a performance degradation — it is the system being effectively unusable.

How to set it

llama.cpp --parallel 1
Verify startup log: n_slots = 1
Ollama (env) OLLAMA_NUM_PARALLEL=1
LM Studio Single-user mode by default; no parallel slot configuration exposed in the UI
vLLM vLLM uses paged attention — KV is dynamically allocated, not pre-reserved per slot. This concern does not apply

Bug warning: on some llama.cpp builds, --parallel 1 still logs n_slots = 4 and allocates 4× KV memory (Issues #17300 (link) and #17989 (link)). Always verify the startup log line reads srv init: initializing slots, n_slots = 1. If it shows 4, the flag was silently ignored.

What improved

Reducing from 4 slots to 1 dropped KV allocation by 4×, bringing a 32 GB system from disk-swap territory into fully in-RAM operation. A Gemma 2 9B benchmark (link) confirmed the scaling is exactly linear: 1 slot = 1,344 MiB, 4 slots = 5,376 MiB. For single-user deployment there is no downside — idle extra slots add no per-token compute cost, only wasted memory.


9. KV cache quantization on CPU: Q8_0 cuts memory by 47%

What it does

The KV cache stores the key/value attention tensors that accumulate as the model generates tokens. At the default F16 precision, each cached token costs 2 bytes per KV head per layer. At 128K context with a 14B model, the KV cache alone exceeds the model's own weight footprint. Q8_0 quantizes those tensors to 8-bit integers, cutting KV memory by ~47% with negligible quality loss.

This is a parameter where the correct setting differs by hardware — and getting it wrong on GPU can subtly degrade output quality even if the system doesn't crash.

How to set it

Apply Q8_0 on CPU inference only:

llama.cpp (CPU) --cache-type-k q8_0 --cache-type-v q8_0
Ollama (newer versions) PARAMETER cache_type k8
Support varies by Ollama version — check release notes
LM Studio "KV Cache Type" dropdown in Advanced settings (Q8_0 option, version 0.3.x+)
vLLM (GPU) --kv-cache-dtype fp8 — fp8 is vLLM's equivalent; note quality implications differ from Q8_0

On Nvidia CUDA specifically: if VRAM headroom is available, keep the KV cache at F16. Q8_0 quantization introduces small rounding differences that accumulate over long generations and can quietly shift which token the model picks next — enough to change agent behaviour at low temperatures like T=0.1. The memory saving is not worth that risk when VRAM is not the constraint. On other GPU architectures the picture may differ; benchmark your specific setup before applying Q8_0 there.

Why not Q4_0: Q4_0 requires nibble extraction (bit shifts, masks, sign extension) per element during dequantization. At 64K+ context, this cost scales linearly with context length on every generated token and can dominate the bandwidth savings. A DGX Spark benchmark (link) found Q4_0 decode throughput degrading 36.8% vs F16 at 110K tokens even on GPU; the CPU penalty is worse. Q8_0 on x86/AVX2 has no measurable speed regression — the 8-bit dequantize is a single multiply-and-add that AVX2 handles near-natively.

What improved

A 128K context window with a 7B model: KV cache drops from ~16 GB (F16) to ~8.5 GB (Q8_0), making large-context CPU inference feasible on 16 GB systems. On CPU specifically, Q8_0 KV improves generation throughput vs F16 because the memory bandwidth reduction outweighs the trivial dequantization cost — the opposite of GPU behaviour at long context. Perplexity impact: approximately 0.005 PPL increase vs F16, confirmed across multiple independent sources as "undetectable in conversational use."


10. Reasoning budget — cap thinking tokens at 25% of the generation limit

What it does

Models like Qwen3 and Gemma-4 support "thinking mode" — an extended reasoning phase inside <think> blocks before the final response. For complex tasks this genuinely improves output quality. The problem: thinking tokens count against the generation budget. Without a cap, the model can spend 8,000 tokens reasoning about a problem and leave only 8,000 tokens for the actual code — insufficient for a large implementation.

Harder to see: a hard budget cutoff with no handoff message produces output that is worse than no thinking at all. A Qwen3-9B HumanEval benchmark (link) measured: full thinking = 94%, no thinking = 88%, hard cutoff with no message = 78%. The model's reasoning is interrupted mid-thought with no signal to start outputting, producing incoherent partial analysis followed by abrupt generation.

How to set it

The 25% figure comes from community benchmarking on Qwen3-9B HumanEval (link): models capture the core of their reasoning in roughly the first quarter of their thinking tokens, with diminishing returns beyond that. With MAX_TOKENS=16384, a 25% cap allocates 4,096 tokens for planning — enough for meaningful reasoning on complex tasks — while leaving 12,288 for output. Always pair the cap with a handoff message:

# llama.cpp server
reasoning_budget=$(( MAX_TOKENS * 25 / 100 ))
--reasoning-budget "${reasoning_budget}" \
--reasoning-budget-message $'\n\nLet me now write the solution.'
llama.cpp --reasoning-budget N
--reasoning-budget-message "..."
Ollama (API) "options": {"thinking_budget_tokens": N} in the request body — not available as a Modelfile parameter
LM Studio "Thinking Budget" field in the reasoning model settings panel
vLLM / OpenAI API "max_thinking_tokens": N in the request body for compatible models

Default reasoning to off. Most coding tasks — file edits, short scripts, straightforward queries — do not benefit from extended thinking. Enabling it by default adds token overhead to every call. Opt in explicitly for complex multi-step tasks.

What improved

With budget + handoff message: 89% HumanEval — nearly matching full-thinking quality at a fraction of the token cost. Hard-capped without a message: 78%, worse than no thinking at all. The handoff message is not optional if you want to use a thinking cap.


Bonus: OpenShift AI and vLLM on Kubernetes

Most of the parameters above apply unchanged when the inference backend is vLLM running on OpenShift AI. A few translate differently at cluster scale — and a few matter more, not less.

What carries over unchanged. Temperature (Tip 1) and max_tokens (Tip 5) are request-level parameters. Any client code or agent configuration that sets these correctly against a local llama.cpp server works identically against a vLLM endpoint behind an OpenShift Route — no changes required. Context window sync (Tips 6 and 7) remains critical: vLLM's --max-model-len at deployment time is the cluster-wide ceiling, and it must match what agent frameworks declare as their contextWindow. A misconfigured InferenceService silently serves requests that crash mid-task when a long context hits the cap. Reasoning budget (Bonus tip) becomes more important in multi-tenant deployments — unbounded thinking tokens consume slot time across all users. Cap them at the serving layer with vLLM's --max_thinking_tokens or enforce the limit at the gateway.

What changes. The parallel-slot memory problem (Tip 8) disappears entirely. vLLM's paged attention allocates KV cache dynamically per sequence — there is no pre-reserved per-slot overhead, and the 4× memory blowup that kills a local llama.cpp deployment simply does not occur. Scale replica count horizontally instead. KV cache quantization (Tip 9): on H100/A100 with ample VRAM, leave the KV cache at F16. If VRAM is the constraint, vLLM's --kv-cache-dtype fp8 gives a similar footprint reduction to Q8_0, with the same quality caveat at low temperatures — measure before committing. CPU thread tuning (Tip 4) translates to Kubernetes resource requests: set resources.requests.cpu to the physical core count of the node type, not the node's total vCPU capacity. Server-class x86 (Xeon, EPYC) uses uniform SMT without P-core/E-core asymmetry, so the dramatic 3× collapse seen on hybrid Intel desktop CPUs is unlikely — but over-subscribing vCPUs in a pod still introduces scheduling jitter that degrades decode latency under load.


Summary: The combined picture

The striking result from the client engagement was not any single parameter — it was the combination. Tasks that previously required 4–5 LLM calls due to truncation and out-of-memory restarts completed in 1–2 calls on the same hardware. Speed improved by a multiple. Output quality went up. None of it required a hardware upgrade.

Most of these settings take under a minute to change. The payoff is immediate and measurable. Start with the ones that match your hardware, apply them one at a time, and watch the numbers move.

Contribute to this guide. If you have measured results on hardware or model combinations not covered here — AMD ROCm, NVIDIA Pascal-era cards, other Apple Silicon chips, Llama 4, Mistral Small 3 — share them. If any of the Ollama or LM Studio equivalents are wrong or out of date for a newer version, a correction is equally valuable. Leave a comment, or send an email to support@situagent.com. The goal is a community resource that stays current as the tooling evolves.
← All posts