← Blog
AI EngineeringJune 1, 2026 · 8 min read · Cesar Petrescu, CoFounder

How we deploy llama.cpp at PhotonSpark

Our llama.cpp deployment is built around llama-cpp-autodeploy: a small Python and web control plane that builds upstream llama.cpp, launches GGUF models, checks GPU memory, streams logs, recovers running servers, and keeps the operational surface readable.

Running llama-server by hand is fine once. It gets old fast.

The hard part is not starting a model. The hard part is keeping the workflow repeatable after you change a quant, rebuild llama.cpp, move a model between GPUs, test a new MTP flag, or restart the control process while a server is still running.

That is why we use llama-cpp-autodeploy. It is our small control plane for llama.cpp. It builds the runtime, launches models, watches GPU pressure, keeps logs, exposes a browser UI, and can recover llama-server processes that outlive the backend.

This is the shape we want for private inference: simple enough to debug from a terminal, structured enough to run more than one model without turning the host into a pile of one-off scripts.

What the repo does

The repo wraps the llama.cpp lifecycle in four layers:

LayerJob
autodevops.pyClone and build upstream ggml-org/llama.cpp at a chosen tag, branch, commit, or latest release
loadmodel.pyLaunch LLM, embedding, and reranker endpoints with GGUF download support
memory_utils.pyInspect GPU/RAM state and estimate placement before launch
web/backend + web/frontendBrowser control plane for builds, model library, instances, logs, memory, and benchmarks

There are CLI/TUI entry points too: autodevops_cli.py, loadmodel_cli.py, loadmodel_dist_cli.py, and rpc_server_cli.py. The browser UI uses the same underlying code paths. That matters. We do not want one workflow for operators and another workflow for the web panel.

The build path

The builder starts from upstream llama.cpp, not a vendored fork. autodevops.py points at https://github.com/ggml-org/llama.cpp and can build a specific ref:

bash
python autodevops.py --ref latest --now
python autodevops.py --ref 764f1e6 --now --distributed

The default runtime layout is outside the repo:

text
~/llama-runtime/
  llama-builds/      # versioned source/build trees
  llama-current      # symlink to the active build
  bin/               # stable symlinks: llama-server, llama-bench, rpc-server, ...
  models/            # standalone CLI model cache unless overridden

That separation is intentional. The repo stays code. Runtime artifacts live in a runtime directory. The stable bin/ symlinks mean launchers can keep calling the same binary path while the active build changes underneath.

The build script handles the common GPU host details we care about:

  • checks for git, cmake, compilers, pkg-config, and NVIDIA tooling
  • finds CUDA_HOME or an installed CUDA toolkit
  • detects GPU compute capability from nvidia-smi
  • builds with GGML_CUDA=ON
  • enables MMQ automatically on newer NVIDIA architectures unless overridden
  • chooses MKL, OpenBLAS, or no BLAS
  • can build GGML RPC support with --distributed
  • symlinks built binaries into the runtime bin/

There is also a CUDA iterator compatibility patch path for one upstream CUDA source issue. That is exactly the kind of practical scar tissue we want in a deployment tool: small, visible, and close to the build that needs it.

The launch path

loadmodel.py is the terminal launcher. It supports three modes:

bash
python loadmodel.py --llm ./models/model.gguf --port 45540
python loadmodel.py --embed Qwen/Qwen3-Embedding-8B-GGUF:Q8_0 --port 45541
python loadmodel.py --rerank Qwen/Qwen3-Reranker-8B --host 127.0.0.1 --port 45542

For LLM and embedding mode it runs llama-server. For reranking it runs a small Transformers-backed HTTP service shaped like an embeddings/rerank endpoint.

Model input can be a local GGUF path or a Hugging Face shorthand like:

text
unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL

If the file is not local, the launcher lists repo files, chooses the matching GGUF, downloads it with huggingface_hub, and then starts the server.

The launcher exposes the llama.cpp knobs we actually use in production:

  • --host and --port
  • --n-gpu-layers
  • --tensor-split
  • --split-mode
  • --ctx-size
  • --cpu-moe and --n-cpu-moe
  • --mmproj for multimodal projector files
  • --jinja and --reasoning-format
  • extra passthrough flags for newer llama.cpp options

It also checks binary capability before using newer features. If the current llama-server build does not support MoE offload or MTP speculative decoding flags, it fails with a rebuild hint instead of starting a broken command.

That sounds small, but it saves time. A launch failure should tell the operator what is missing, not leave them reading --help output at midnight.

MTP and Qwen-style launches

The repo already carries structured support for --spec-type draft-mtp, including fields for draft model path, Hugging Face draft repo, draft GPU layers, draft KV cache types, CPU/MoE placement, and draft thread controls.

The smoke-test shape in the README is the kind of command we want the tool to preserve:

bash
./bin/llama-server \
  --model models/Qwen3.6-35B-A3B-UD-IQ1_M.gguf \
  --host 127.0.0.1 --port 45650 \
  --ctx-size 1024 --parallel 1 \
  --n-gpu-layers 999 \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --flash-attn on

The control plane does not guess a sibling draft file. You either point at the draft GGUF, download it through the Library page, or pass a Hugging Face draft reference. That is the right bias. Speculative decoding is too easy to make silently wrong if the launcher guesses.

The web control plane

The browser path starts like this:

bash
python web_cli.py --init
python web_cli.py

--init creates .web_config.json and prints a bearer token. The backend defaults to 0.0.0.0:8787, serves the built Vite frontend if web/frontend/dist/ exists, and exposes a FastAPI API under /api.

The unauthenticated endpoint is narrow:

text
GET /api/health

Everything else requires the bearer token. WebSockets use ?token= because browsers cannot set Authorization headers during the upgrade request.

The UI gives us the pages we need on an inference host:

  • Dashboard: backend health, host load, GPU pressure, fleet state
  • Instances: create, start, stop, restart, delete, and recover llama-server processes
  • Logs: live stdout per instance, build, or benchmark
  • Builds: run autodevops.py from the browser and keep build history
  • Memory: plan VRAM/RAM placement before launch
  • Library: scan local GGUFs and download models from Hugging Face
  • Benchmarks: run llama-bench, store logs, and parse throughput rows

The point is not to hide the terminal. The point is to make the host inspectable without SSHing in for every question.

Process state and recovery

The backend persists state in .web_state.json. It tracks instances, builds, benchmarks, PIDs, process groups, command lines, status, ports, and log file paths. Logs go under web/logs/.

The recovery behavior is the most important operational feature. On startup, the process manager checks the persisted records. If a managed process is still alive and its command line matches, it reattaches. It also scans /proc for recoverable llama-server processes launched from this repo and adopts them.

When the backend starts a managed instance, it marks the process environment with recovery metadata:

text
LLAMA_AUTODEPLOY_MANAGED=1
LLAMA_AUTODEPLOY_INSTANCE_ID=...
LLAMA_AUTODEPLOY_INSTANCE_NAME=...
LLAMA_AUTODEPLOY_LOG_FILE=...

That lets a restarted backend find the process again, attach the old log file, restore the instance record, and put the server back into the UI.

This is the difference between a dashboard and a useful control plane. The model server should not die just because the web backend restarted.

How we deploy it

Our normal host flow is:

bash
git clone https://github.com/CesarPetrescu/llama-cpp-autodeploy.git
cd llama-cpp-autodeploy
python3 -m venv venv
./venv/bin/pip install -U pip
./venv/bin/pip install -r requirements.txt

cd web/frontend
npm install
npm run build
cd ../..

./start web --init
./start web

Then we build llama.cpp from the UI or CLI:

bash
./start build
# or
./venv/bin/python autodevops.py --ref latest --now --distributed

Then we create model instances from the UI or call the API. A typical llama-server instance is just a structured version of this:

bash
./venv/bin/python loadmodel.py \
  --llm unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL \
  --host 127.0.0.1 \
  --port 45540 \
  --ctx-size 32768 \
  --n-gpu-layers 999 \
  --tensor-split auto \
  --extra --parallel 4 --flash-attn on

In front of that, we put the normal production pieces: systemd or a process supervisor for the web backend, a reverse proxy with TLS if the UI is remote, firewall rules around model ports, and a separate routing layer such as LiteLLM when the model should appear behind one OpenAI-compatible endpoint.

The repo does not try to be the whole platform. It manages the llama.cpp host. That boundary is healthy.

What we keep outside the tool

A few things should stay outside this repo:

  • public TLS and domain routing
  • team authentication beyond the single bearer token
  • long-term metrics storage
  • alerting
  • LiteLLM routing and tenant controls
  • OS patching and NVIDIA driver lifecycle
  • backup policy for models, state, and logs

That is not a criticism. A small llama.cpp host controller should not become a half-built Kubernetes distribution. It should build, launch, inspect, recover, and benchmark reliably.

Why this pattern works

Raw llama.cpp is excellent because it is direct. One binary, one model file, one server. The moment you operate more than one model, that directness needs a little structure.

llama-cpp-autodeploy adds that structure without burying the operator:

  • builds are reproducible by upstream ref
  • binaries land in stable runtime paths
  • models can come from local disk or Hugging Face
  • GPU placement is explicit
  • logs are preserved
  • benchmark runs are tracked
  • backend restarts do not orphan the fleet
  • every generated command is still recognizable as llama.cpp

That last point is the design test. If the tool fails, an engineer can still read the command line and run it by hand.

That is how we deploy llama.cpp: keep the runtime close to upstream, wrap the repetitive work, and make the host easy to inspect before it becomes clever.

Source