Self-Host an Open Model or Call an API? A Cost-and-Risk Decision Tree
"Should we run our own model or use an API?" is really four questions wearing one coat: a data question, a cost question, a latency question, and an operations question. Answer them in order and the decision usually makes itself.
Start with data, because it can end the discussion
If you are processing data that contractually or legally cannot leave your infrastructure — regulated health or financial records, or anything covered by a data-processing agreement that excludes third-party sub-processors — self-hosting is not a preference, it is a requirement. If your data is not sensitive, keep going; the other three questions are about economics, not compliance.
Then volume, because it flips the cost math
Hosted APIs price per token, which is excellent at low and spiky volume: you pay nothing when idle and avoid all setup. Self-hosting front-loads a fixed cost (a GPU you rent or own, plus the engineering to run it) that only pays off above a break-even throughput. The honest way to compare is to estimate monthly tokens, price the API at current rates, and price the equivalent GPU-hours — then add a realistic engineering cost to the self-host side. Teams routinely forget that last term, and it is often the largest. Below the break-even, the API is cheaper *and* less work; above it, self-hosting can save real money.
Then latency and control
Self-hosting can lower tail latency (no network hop, no shared-tenant queueing) and removes a dependency on someone else's uptime and rate limits. It also gives you version stability: a hosted model can change behavior under you, which matters if you have built evaluations around specific outputs. Against that, hosted providers ship frontier-capability models you cannot match locally and absorb the work of scaling, patching, and capacity.
Finally, operations — the question teams underweight
Self-hosting means you now own GPU drivers, inference-server upgrades, autoscaling, monitoring, and the 2 a.m. page when throughput collapses. If you do not have someone who wants to own that, the "cheaper" option is more expensive in the way that matters.
The hidden costs people forget
When you price the two options, the API side is usually quoted honestly (tokens times rate) and the self-host side is quoted dishonestly, because the GPU bill is visible and the human bill is not. Put real numbers on all of it: GPU rental or amortized hardware, yes, but also the engineer-days to stand up an inference server, the ongoing patching and driver upkeep, monitoring and on-call, and the opportunity cost of that person not building product. A useful rule of thumb: if self-hosting only wins on paper after you ignore engineering time, it does not actually win.
A worked sketch (use your own current numbers): at low volume, an API might cost a few hundred dollars a month with zero ops; a self-hosted GPU large enough to matter often starts higher *before* a single token is served, and still needs an owner. The crossover is real, but it sits further out than most teams guess — which is why "start on the API, move the heavy or sensitive paths later" is the default that ages well.
A decision tree you can actually follow
- Sensitive data that cannot leave your walls? → Self-host (or a vendor with a signed DPA and in-region, no-training guarantees).
- Not sensitive, but low or spiky volume? → API.
- High, steady volume *and* an owner for the ops? → Price both at your real token estimate; self-host if it clears break-even with margin.
- High volume but no ops owner? → API, and revisit when you can staff it.
A common and sensible end state is hybrid: prototype on an API, route sensitive or high-volume paths to a self-hosted open model once the numbers and the team justify it. Whatever you choose, write down the assumption that drove it — token volume, a compliance clause, a latency target — so you know what change should make you revisit.
The review point after the first month
Whichever path you choose, schedule a one-month review before the decision hardens into infrastructure. For an API path, compare actual token use with the estimate, count rate-limit incidents, and check whether the data terms still match your use case. For a self-hosted path, count GPU idle time, failed deploys, model reloads, and the number of hours spent on operations instead of product work. The right decision can change when volume becomes steady, when a privacy requirement appears, or when the team loses the person who understood the inference stack. A good LLM architecture decision includes the date when it should be challenged again.