How we deploy llama.cpp at PhotonSpark
Our llama.cpp deployment is built around llama-cpp-autodeploy: a small Python and web control plane that builds upstream llama.cpp, launches GGUF models, checks GPU memory, streams logs, recovers running servers, and keeps the operational surface readable.
Running llama-server by hand is fine once. It gets old fast.
The hard part is not starting a model. The hard part is keeping the workflow repeatable after you change a quant, rebuild llama.cpp, move a model between GPUs, test a new MTP flag, or restart the control process while a server is still running.
That is why we use llama-cpp-autodeploy. It is our small control plane for llama.cpp. It builds the runtime, launches models, watches GPU pressure, keeps logs, exposes a browser UI, and can recover llama-server processes that outlive the backend.
This is the shape we want for private inference: simple enough to debug from a terminal, structured enough to run more than one model without turning the host into a pile of one-off scripts.
What the repo does
The repo wraps the llama.cpp lifecycle in four layers:
| Layer | Job |
|---|---|
autodevops.py | Clone and build upstream ggml-org/llama.cpp at a chosen tag, branch, commit, or latest release |
loadmodel.py | Launch LLM, embedding, and reranker endpoints with GGUF download support |
memory_utils.py | Inspect GPU/RAM state and estimate placement before launch |
web/backend + web/frontend | Browser control plane for builds, model library, instances, logs, memory, and benchmarks |
There are CLI/TUI entry points too: autodevops_cli.py, loadmodel_cli.py, loadmodel_dist_cli.py, and rpc_server_cli.py. The browser UI uses the same underlying code paths. That matters. We do not want one workflow for operators and another workflow for the web panel.
The build path
The builder starts from upstream llama.cpp, not a vendored fork. autodevops.py points at https://github.com/ggml-org/llama.cpp and can build a specific ref:
python autodevops.py --ref latest --now
python autodevops.py --ref 764f1e6 --now --distributed
The default runtime layout is outside the repo:
~/llama-runtime/
llama-builds/ # versioned source/build trees
llama-current # symlink to the active build
bin/ # stable symlinks: llama-server, llama-bench, rpc-server, ...
models/ # standalone CLI model cache unless overridden
That separation is intentional. The repo stays code. Runtime artifacts live in a runtime directory. The stable bin/ symlinks mean launchers can keep calling the same binary path while the active build changes underneath.
The build script handles the common GPU host details we care about:
- checks for
git,cmake, compilers,pkg-config, and NVIDIA tooling - finds
CUDA_HOMEor an installed CUDA toolkit - detects GPU compute capability from
nvidia-smi - builds with
GGML_CUDA=ON - enables MMQ automatically on newer NVIDIA architectures unless overridden
- chooses MKL, OpenBLAS, or no BLAS
- can build GGML RPC support with
--distributed - symlinks built binaries into the runtime
bin/
There is also a CUDA iterator compatibility patch path for one upstream CUDA source issue. That is exactly the kind of practical scar tissue we want in a deployment tool: small, visible, and close to the build that needs it.
The launch path
loadmodel.py is the terminal launcher. It supports three modes:
python loadmodel.py --llm ./models/model.gguf --port 45540
python loadmodel.py --embed Qwen/Qwen3-Embedding-8B-GGUF:Q8_0 --port 45541
python loadmodel.py --rerank Qwen/Qwen3-Reranker-8B --host 127.0.0.1 --port 45542
For LLM and embedding mode it runs llama-server. For reranking it runs a small Transformers-backed HTTP service shaped like an embeddings/rerank endpoint.
Model input can be a local GGUF path or a Hugging Face shorthand like:
unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL
If the file is not local, the launcher lists repo files, chooses the matching GGUF, downloads it with huggingface_hub, and then starts the server.
The launcher exposes the llama.cpp knobs we actually use in production:
--hostand--port--n-gpu-layers--tensor-split--split-mode--ctx-size--cpu-moeand--n-cpu-moe--mmprojfor multimodal projector files--jinjaand--reasoning-format- extra passthrough flags for newer llama.cpp options
It also checks binary capability before using newer features. If the current llama-server build does not support MoE offload or MTP speculative decoding flags, it fails with a rebuild hint instead of starting a broken command.
That sounds small, but it saves time. A launch failure should tell the operator what is missing, not leave them reading --help output at midnight.
MTP and Qwen-style launches
The repo already carries structured support for --spec-type draft-mtp, including fields for draft model path, Hugging Face draft repo, draft GPU layers, draft KV cache types, CPU/MoE placement, and draft thread controls.
The smoke-test shape in the README is the kind of command we want the tool to preserve:
./bin/llama-server \
--model models/Qwen3.6-35B-A3B-UD-IQ1_M.gguf \
--host 127.0.0.1 --port 45650 \
--ctx-size 1024 --parallel 1 \
--n-gpu-layers 999 \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--flash-attn on
The control plane does not guess a sibling draft file. You either point at the draft GGUF, download it through the Library page, or pass a Hugging Face draft reference. That is the right bias. Speculative decoding is too easy to make silently wrong if the launcher guesses.
The web control plane
The browser path starts like this:
python web_cli.py --init
python web_cli.py
--init creates .web_config.json and prints a bearer token. The backend defaults to 0.0.0.0:8787, serves the built Vite frontend if web/frontend/dist/ exists, and exposes a FastAPI API under /api.
The unauthenticated endpoint is narrow:
GET /api/health
Everything else requires the bearer token. WebSockets use ?token= because browsers cannot set Authorization headers during the upgrade request.
The UI gives us the pages we need on an inference host:
- Dashboard: backend health, host load, GPU pressure, fleet state
- Instances: create, start, stop, restart, delete, and recover
llama-serverprocesses - Logs: live stdout per instance, build, or benchmark
- Builds: run
autodevops.pyfrom the browser and keep build history - Memory: plan VRAM/RAM placement before launch
- Library: scan local GGUFs and download models from Hugging Face
- Benchmarks: run
llama-bench, store logs, and parse throughput rows
The point is not to hide the terminal. The point is to make the host inspectable without SSHing in for every question.
Process state and recovery
The backend persists state in .web_state.json. It tracks instances, builds, benchmarks, PIDs, process groups, command lines, status, ports, and log file paths. Logs go under web/logs/.
The recovery behavior is the most important operational feature. On startup, the process manager checks the persisted records. If a managed process is still alive and its command line matches, it reattaches. It also scans /proc for recoverable llama-server processes launched from this repo and adopts them.
When the backend starts a managed instance, it marks the process environment with recovery metadata:
LLAMA_AUTODEPLOY_MANAGED=1
LLAMA_AUTODEPLOY_INSTANCE_ID=...
LLAMA_AUTODEPLOY_INSTANCE_NAME=...
LLAMA_AUTODEPLOY_LOG_FILE=...
That lets a restarted backend find the process again, attach the old log file, restore the instance record, and put the server back into the UI.
This is the difference between a dashboard and a useful control plane. The model server should not die just because the web backend restarted.
How we deploy it
Our normal host flow is:
git clone https://github.com/CesarPetrescu/llama-cpp-autodeploy.git
cd llama-cpp-autodeploy
python3 -m venv venv
./venv/bin/pip install -U pip
./venv/bin/pip install -r requirements.txt
cd web/frontend
npm install
npm run build
cd ../..
./start web --init
./start web
Then we build llama.cpp from the UI or CLI:
./start build
# or
./venv/bin/python autodevops.py --ref latest --now --distributed
Then we create model instances from the UI or call the API. A typical llama-server instance is just a structured version of this:
./venv/bin/python loadmodel.py \
--llm unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL \
--host 127.0.0.1 \
--port 45540 \
--ctx-size 32768 \
--n-gpu-layers 999 \
--tensor-split auto \
--extra --parallel 4 --flash-attn on
In front of that, we put the normal production pieces: systemd or a process supervisor for the web backend, a reverse proxy with TLS if the UI is remote, firewall rules around model ports, and a separate routing layer such as LiteLLM when the model should appear behind one OpenAI-compatible endpoint.
The repo does not try to be the whole platform. It manages the llama.cpp host. That boundary is healthy.
What we keep outside the tool
A few things should stay outside this repo:
- public TLS and domain routing
- team authentication beyond the single bearer token
- long-term metrics storage
- alerting
- LiteLLM routing and tenant controls
- OS patching and NVIDIA driver lifecycle
- backup policy for models, state, and logs
That is not a criticism. A small llama.cpp host controller should not become a half-built Kubernetes distribution. It should build, launch, inspect, recover, and benchmark reliably.
Why this pattern works
Raw llama.cpp is excellent because it is direct. One binary, one model file, one server. The moment you operate more than one model, that directness needs a little structure.
llama-cpp-autodeploy adds that structure without burying the operator:
- builds are reproducible by upstream ref
- binaries land in stable runtime paths
- models can come from local disk or Hugging Face
- GPU placement is explicit
- logs are preserved
- benchmark runs are tracked
- backend restarts do not orphan the fleet
- every generated command is still recognizable as llama.cpp
That last point is the design test. If the tool fails, an engineer can still read the command line and run it by hand.
That is how we deploy llama.cpp: keep the runtime close to upstream, wrap the repetitive work, and make the host easy to inspect before it becomes clever.