AI EngineeringMay 11, 2026 · 5 min read · PhotonSpark

Private AI: what it actually costs

Private AI can beat hosted APIs when usage is steady, data has to stay controlled, or model access is part of the product. The real budget is hardware, power, hosting, and the person who owns the stack.

"Can we just run the model ourselves?" is usually the right question. It is also the start of a longer spreadsheet.

Private AI is not free because the model weights are free. You still need GPUs, a host server, storage, networking, power, cooling, monitoring, backups, patching, and someone who knows what to do when CUDA, drivers, kernels, or model servers start fighting each other.

That does not make private AI a bad idea. It means you should buy it for the right reasons.

The cost buckets

We usually split the budget into four buckets.

Hardware is the visible one: GPUs, CPU, RAM, NVMe, NICs, chassis, rails, spare parts. The GPU gets the attention, but a weak host around an expensive card is a common way to waste money.

Power and cooling are easier to forget. A high-end data-center GPU can draw hundreds of watts under load. A multi-GPU node running all day needs a real power budget, and the cooling side is not optional. Consumer cards can be attractive on price, but they add their own problems around cooling, support, and multi-GPU behavior.

Hosting means rack space, remote hands, networking, IP space, DDoS posture, backups, physical access rules, and the boring contract details. If the machine sits in an office closet, the hosting cost is just hidden in a different line item.

Operations is the line item people undercount. Someone has to deploy models, test quantization, rotate keys, watch disk and VRAM, handle abuse, read logs, patch the host, and decide when a model update is safe.

The hosted API comparison

Hosted APIs are simple to start with because you pay by use. That is perfect for prototypes, bursty workloads, and teams still deciding which model family fits the product.

The bill starts to hurt when the workload becomes steady. If your app sends a predictable stream of prompts every hour, private capacity can make sense. You are trading a variable per-token bill for a fixed operating cost. That trade is only attractive when the machine stays busy enough to justify itself.

There is also a control question. Some teams need a private endpoint because of data residency, contractual limits, or customer trust. Others want to run a specific open-weight model, keep a fine-tune close, or avoid sending internal documents to an external API. Those are valid reasons, but they should be written down before buying hardware.

Hardware choices in plain language

A large enterprise GPU gives you more VRAM, better reliability, and vendor support. It costs more, but it is usually calmer in production.

A consumer GPU can be a smart development box or a cost-effective small inference node. It can also become painful when you need remote maintenance, dense cooling, predictable replacement parts, or clean multi-GPU scaling.

The model size drives the choice. A small embedding model and a 70B chat model are not the same project. Context length matters too, because the KV cache eats memory as conversations get longer. Quantization helps, but every quant should be tested with the actual prompts your application will send.

The safest procurement rule is boring: choose the smallest setup that can run the target workload with headroom, then test it under traffic before calling it production.

When private AI wins

Private AI tends to make sense when one of these is true:

Usage is steady enough that owned or dedicated capacity stays busy.
Data cannot leave a controlled environment.
The product depends on a specific model, adapter, or inference behavior.
Latency matters and the application sits close to the model endpoint.
The team wants predictable monthly infrastructure cost instead of a per-token curve.

The strongest cases are usually boring business cases: call-center analysis, clinical or legal document workflows, internal search over sensitive files, batch processing, and products that call a model on almost every user action.

When hosted APIs still win

Hosted APIs are still the better answer when usage is spiky, the product is early, or the team changes models every week. They also win when nobody on the team can own the stack. A private model endpoint with no operator is just another production system waiting for a bad night.

For early projects, we often recommend starting hosted, measuring real usage, then moving the stable workload to private infrastructure once the shape is known. The measurements matter more than opinions.

How PhotonSpark fits

Our Private AI service is for teams that want the private endpoint without hiring a full infrastructure team around it. We provision the hardware, deploy the model stack, monitor it, and keep the deployment inside the agreed environment. Depending on the project, that can mean EU data-center hosting, dedicated GPUs, or an on-prem installation.

The useful conversation starts with five numbers: expected requests per day, average input tokens, average output tokens, target latency, and retention requirements for logs and prompts. With those, we can size the system without guessing.

If you already have those numbers, send them. If you do not, we can help measure them during discovery.

Build a quick sizing worksheet

Before buying GPUs, measure the workload you already have.

On an existing inference box, start with hardware facts:

bash

nvidia-smi --query-gpu=name,memory.total,power.limit --format=csv
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,power.draw --format=csv -l 1

Measure model storage and cache pressure:

bash

du -sh /models/*
watch -n 1 nvidia-smi

Track tokens and requests for one normal day:

text

requests_per_day =
average_input_tokens =
average_output_tokens =
peak_concurrent_requests =
target_ttft_seconds =
target_tokens_per_second =

Then estimate monthly cost with a plain formula:

text

monthly_cost = hardware_amortization + colo_or_hosting + power + support_time
monthly_tokens = daily_tokens * active_days_per_month
cost_per_1k_tokens = (monthly_cost / monthly_tokens) * 1000

Compare that to hosted API spend only after you include operations. A private endpoint that saves token cost but needs constant senior attention may still be expensive.

The minimum useful benchmark table is:

Model	Quant	GPU	Context	Concurrency	TTFT	Tokens/sec	Peak VRAM

If you cannot fill that table, you are not ready to buy hardware yet.

Sources

← Back to blog