AI EngineeringJune 1, 2026 · 5 min read · PhotonSpark

Serving Qwen on two RTX 4090s: vLLM for traffic, llama.cpp for context

A dual RTX 4090 box can serve a useful private Qwen endpoint, but the serving stack should split the work: vLLM for concurrent traffic, llama.cpp for GGUF models, experiments, and long-context jobs.

Two RTX 4090s look generous until the first serious model lands on the box. Then the limits become visible: 48 GB of total VRAM, no NVLink, enough memory for a useful 27B or 32B-class Qwen setup with the right quantization, and still not enough room to ignore the KV cache.

For this kind of machine, we would not choose a single inference engine and force every job through it. We would run vLLM and llama.cpp behind LiteLLM. vLLM gets the main chat and agent endpoint. llama.cpp stays available for GGUF models, long-context work, and quick experiments.

That split is deliberate. It matches the hardware.

What vLLM is good at

vLLM is built for serving traffic. Its best-known idea is PagedAttention, where the KV cache is managed in blocks instead of as one large contiguous allocation per request. In real deployments, that matters because concurrent requests do not arrive politely. Some are short, some are long, and some sit around while the model generates.

PagedAttention helps the server use memory more efficiently under that mix. The practical result is better batching and better throughput when more than one user is talking to the model.

That is why vLLM is the natural front door for interactive chat, agent traffic, and API workloads where concurrency matters.

What llama.cpp is good at

llama.cpp wins a different lane. It is lean, practical, and very good at running GGUF models. It starts quickly, supports a wide range of hardware, and gives you a comfortable place to test quants that are not ready for the main serving path.

On a dual-4090 box, this is useful even if vLLM is faster under traffic. Sometimes the job is a long codebase read. Sometimes the model you want is available as a GGUF first. Sometimes you are testing a quant and do not want to disturb production chat. llama.cpp is the right tool for those jobs.

What public benchmarks suggest

The pattern across public tests is not surprising. vLLM tends to pull away when concurrency rises. llama.cpp remains attractive when the workload is smaller, unusual, or GGUF-based.

Red Hat's 2025 comparison showed vLLM far ahead of llama.cpp at peak load on their H200 test setup. Do not copy those numbers onto a 4090 box, because the hardware and model matter. The lesson is still useful: once many requests are active, scheduling and KV-cache handling dominate.

LLMKube's Qwen3.6-27B bake-off had a different lesson. Their test hardware was not a pair of 4090s, so again, the raw numbers are not portable. But the shape was familiar: vLLM handled short-prompt concurrency better, while llama.cpp handled a long prompt that the tested vLLM configuration rejected because of the context cap.

That is exactly why we keep both engines available.

The 4090 constraint

A pair of 4090s does not behave like one 48 GB card. The missing piece is NVLink. The cards coordinate over PCIe, and that coordination has a cost.

For models that need both GPUs, benchmark tensor parallelism and pipeline parallelism instead of assuming one is correct. vLLM's own parallelism documentation notes that pipeline parallelism can be a better fit when GPUs do not have fast interconnect.

There is also a simpler option: if a quant fits comfortably on one 4090, run one model server per GPU and load-balance at LiteLLM. Do not make two cards coordinate just because both cards are installed.

Quantization choices

For vLLM, stick to formats it serves well: FP8 where supported, AWQ, GPTQ, and the Marlin-backed paths when the model and GPU line up. The point is not to chase the smallest file. The point is to serve the model reliably under the prompts customers actually send.

For llama.cpp, GGUF is the home field. Use it without apology. The only rule is that every quant needs a smoke test before traffic reaches it. A quant can look fine on a quick prompt and then fall apart on a longer, messier real one.

We also pin builds for production. Inference bugs can come from the model, the quant, CUDA, drivers, the serving engine, or the prompt. Pinning does not remove the risk, but it makes failures easier to reproduce.

A practical routing shape

LiteLLM is useful here because it can present one API surface while routing to different backends.

yaml

model_list:
  - model_name: qwen-chat
    litellm_params:
      model: hosted_vllm/qwen-chat
      api_base: http://vllm-qwen:8000/v1
      api_key: local

  - model_name: qwen-long
    litellm_params:
      model: openai/qwen-long
      api_base: http://llamacpp-qwen:8080/v1
      api_key: local

router_settings:
  routing_strategy: least-busy
  health_check_interval: 30

Interactive chat, tool calls, and concurrent API traffic go to qwen-chat. Long repository reads, overnight jobs, GGUF tests, and odd context-heavy requests go to qwen-long.

Starting point for vLLM

For a main endpoint, we would start with vLLM and test both parallelism modes:

bash

CUDA_VISIBLE_DEVICES=0,1 vllm serve Qwen/Qwen3.6-27B-FP8   --served-model-name qwen-chat   --tensor-parallel-size 2   --language-model-only   --enable-prefix-caching   --enable-chunked-prefill   --kv-cache-dtype fp8   --gpu-memory-utilization 0.90

Then repeat with pipeline parallelism:

bash

CUDA_VISIBLE_DEVICES=0,1 vllm serve Qwen/Qwen3.6-27B-FP8   --served-model-name qwen-chat   --pipeline-parallel-size 2   --language-model-only   --enable-prefix-caching   --enable-chunked-prefill   --kv-cache-dtype fp8   --gpu-memory-utilization 0.90

The winning command is the one that behaves better with your prompts, not the one that looks cleaner in a blog post.

Starting point for llama.cpp

The secondary endpoint can run a known-good GGUF:

bash

CUDA_VISIBLE_DEVICES=0,1 llama-server   -m /models/qwen3.6-27b-UD-Q4_K_XL.gguf   --host 0.0.0.0   --port 8080   --n-gpu-layers 99   --ctx-size 65536   --parallel 4   --metrics

The flags will move with the model, context target, CUDA version, and build. The operating rule stays the same: keep vLLM on the traffic lane and llama.cpp on the flexible lane.

What would change this recommendation

A larger single-GPU box would change the memory story. NVLink-capable GPUs would change the multi-GPU story. A faster and boring GGUF path in vLLM would consolidate more traffic there. A major llama.cpp serving improvement under concurrency would push more live traffic its way.

For this box, today, we would keep both engines and measure them with real prompts before promising anything to customers.

Try the split yourself

Start by proving the vLLM path works before adding routing layers:

bash

nvidia-smi
CUDA_VISIBLE_DEVICES=0,1 vllm serve Qwen/Qwen3.6-27B-FP8 \
  --served-model-name qwen-chat \
  --tensor-parallel-size 2 \
  --language-model-only \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.90

Smoke-test the OpenAI-compatible endpoint:

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-chat",
    "messages": [{"role": "user", "content": "Write a 5-line deployment checklist."}],
    "max_tokens": 160
  }'

Then test the llama.cpp lane with the exact GGUF you intend to use:

bash

CUDA_VISIBLE_DEVICES=0,1 llama-server \
  -m /models/qwen3.6-27b-UD-Q4_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 65536 \
  --parallel 4 \
  --metrics

Smoke-test that endpoint too:

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-long",
    "messages": [{"role": "user", "content": "Summarize this codebase migration plan."}],
    "max_tokens": 160
  }'

Record the boring numbers before making a recommendation:

bash

watch -n 1 nvidia-smi

Check	vLLM	llama.cpp
Cold start time
Peak VRAM
TTFT on short prompt
Tokens/sec on long prompt
Max stable parallel requests
First failure mode

If a model fits on one 4090, repeat the vLLM test as one process per GPU before assuming tensor parallelism is better.

Sources

← Back to blog