A cost model for patient-level healthcare AI: $1M for locally deployed Medical LLM vs. $13M to $30M via frontier APIs

Jun 20, 2026

Healthcare AI budgets break on one assumption: that per-token API pricing works at patient-population scale. Model a real workload, building a de-identified oncology real-world evidence dataset at 1 million tokens per patient with four processing passes, and the gap reaches an order of magnitude and keeps growing. At 10,000 patients, a locally deployed medical LLM costs about $95,000 all-in against $128,000 to $300,000 in API fees. At 1 million patients, it is roughly $1 million against $13 million to $30 million. This post walks through the model, every assumption behind it, and why the gap widens as you grow.

The workload: building a real-world evidence dataset from a full patient history

Most cost conversations about LLMs start from the wrong unit. Pricing pages quote dollars per million tokens, so teams estimate a few prompts, multiply, and conclude the API route is cheap. That arithmetic holds for a chatbot. It collapses for patient-level data work, because the unit of work in healthcare is not a prompt. It is a patient, and a patient is a large object.

A typical cancer patient’s record runs to thousands of pages: clinical notes, pathology reports, radiology reports, surgical notes, genomic tests, treatment records, often spanning years. For the model below I use 1 million tokens of data per patient, which is less than a typical cancer patient generates and more than a typical chronic disease patient does. It is a deliberately middle-of-the-road figure, chosen so the model neither flatters nor punishes either deployment option.

The workload itself is building an oncology real-world evidence (RWE) dataset, the kind used for external control arms, treatment-pattern and outcomes studies, and regulatory submissions. I presented this work in detail at PHUSE US Connect 2026, where our paper on automating it won the RWE Catalyst Challenge. Building an RWE dataset is a useful benchmark workload for cost modeling for two reasons. First, it’s real: pharma and health systems spend heavily to curate these datasets, manual chart abstraction runs about two hours per case, and the lag from raw records to a research-ready dataset is measured in months. Second, it’s representative: the same pattern of reading everything, extracting facts, reasoning across documents, and producing audited output describes most serious clinical data projects. Clinical trial matching reads the full record to test eligibility criteria. Question answering for clinicians reads the full record to answer reliably. Cancer registry abstraction, quality measures, risk adjustment, and referral determination all start the same way: from the complete patient story, not from a summary of it. Model the RWE workload and you have modeled the cost of much of the patient-level data roadmap.

Four passes over every document

The model assumes a four-step pipeline, with each step visiting all of a patient’s documents:

1. De-identification. Masking PHI to create a research-ready dataset.
2. Extraction. Structuring biomarkers, staging, histology, and treatments from raw text.
3. Summarization and reasoning. Synthesizing the patient journey, with the clinical reasoning behind the timeline written out.
4. Conflict resolution. Resolving discrepancies between documents, with an explicit chain-of-thought explanation for each final chosen value.

Step four is not optional. A regulatory-grade dataset must be explainable: a reviewer or auditor needs to see why the system chose July 2002 as the diagnosis date when a later note says September. Producing that reasoning costs tokens. The model assumes output tokens equal to 10% of input tokens at each step, which in my experience is conservative for reasoning-heavy clinical work.

So the total demand is 1 million tokens per patient, times four passes, plus 10% output on each pass, times however many patients you have. The model runs that demand at three volumes: 10,000 patients (a pilot or a single service line), 100,000 (a small health system or a focused research cohort), and 1,000,000 (a midsize healthcare system, a payer, or a multi-site research network).

The assumptions: mid-range data volumes, list prices, every API discount granted

A cost model is only as good as its stated assumptions, so here are the rest of mine.

On the API side, I used current list prices for GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.6 as of the 2026 Applied Healthcare AI Summit (April 2026). Two adjustments roughly cancel each other, and the model treats them as a wash: enterprise agreements for healthcare (HIPAA compliance, a BAA, zero-retention terms) typically raise the price, while volume commitments at these token counts earn discounts. I also assume competent pipeline engineering that chunks documents to avoid the long-context surcharge several frontier providers apply to large-window queries, which would otherwise double parts of the bill. In other words, the API numbers below assume you do everything right.

On the local side, the John Snow Labs figures are all-in: model licensing plus the cloud infrastructure to run it, sized at one node for 10,000 patients, five nodes for 100,000, and fifteen nodes for 1,000,000. There is no per-token component, because John Snow Labs’ Medical LLM is licensed per server per year. A node that is paid for processes its millionth token at the same marginal cost as its first: zero.

One assumption deliberately favors the API side. The model has every step visiting every document with a large model. In production, our pipelines put small, task-specific clinical language models in front of the LLM to filter the roughly 96% of documents that are noise for a given task, which cuts the expensive token volume by an order of magnitude. The model below skips that optimization. Even paying full freight on every page, the conclusion holds.

The results: 1.35x at pilot scale, 28.75x at a million patients

At 10,000 patients, the local deployment runs on a single node at $94,766 all-in. The same workload costs $160,000 on GPT-5.4 (1.69x), $128,000 on Gemini-3.1-Pro (1.35x), and $300,000 on Claude-Opus-4.6 (3.17x). At pilot scale, the API route is competitive. A 1.35x premium buys you zero infrastructure work, and for a short-lived proof of concept that trade can be rational.

At 100,000 patients, the picture changes. Five nodes cost $389,830. The API equivalents are $1.6 million on GPT-5.4 (4.10x), $1.28 million on Gemini-3.1-Pro (3.28x), and $3 million on Claude-Opus-4.6 (7.70x). The cheapest API option now costs nearly a million dollars more than local deployment, for one workload, in one year.

At 1,000,000 patients, the divergence is no longer a premium. It is a different category of spending. Fifteen nodes cost $1,043,490. The same tokens through the APIs cost $16 million on GPT-5.4 (15.33x), $12.8 million on Gemini-3.1-Pro (12.27x), and $30 million on Claude-Opus-4.6 (28.75x). Averaged across providers, that is the pattern I summarized in my keynote: about 2x cheaper for a pilot, 5x for a small system, and 18x for a midsize healthcare system.

Two things are worth reading off these numbers beyond the headline multiples. The local cost grows sublinearly: cost per node falls from about $94,800 at one node to about $69,600 at fifteen, because licensing scales with volume while a 100x increase in patients needs only 15x the hardware. And the API cost grows exactly linearly, because that is what per-token pricing means. Those two curves can only spread apart.

Why the gap widens: token pricing is linear, infrastructure is not

The structural point matters more than any single number, because list prices will change and the multiples will move. Per-token pricing makes cost a linear function of data volume. Healthcare data volume is enormous and growing: more documents per patient every year, more modalities, more passes as workflows add reasoning and verification steps. A pricing model that charges per unit of data places its worst-case cost exactly where healthcare AI creates the most value, which is processing everything rather than sampling.

That last distinction deserves a sentence of its own. When the marginal token costs money, teams ration. They process the discharge summaries but not the nursing notes, the last two years but not the full history, a sample of the population but not all of it. Every one of those rationing decisions degrades the output: studies miss cases, cohorts miss patients, and timelines miss the event that explains the outcome. When the marginal token is free, the rational behavior flips. You process every page of every record for every patient, every time the pipeline improves, and rerun whenever a model or guideline updates. Fixed-cost infrastructure does not just lower the bill. It changes what the team is willing to do with the data.

There is also a budgeting argument that CFOs tend to appreciate more than data scientists do. A per-server cost is a number you can put in next year’s budget. A per-token cost is a forecast, and forecasts of token consumption have a way of being wrong by multiples once a project succeeds and other teams want to use it. Predictability is worth something independent of the average price, and at these volumes it is worth a lot.

What this model does not claim

The model is about cost, and cost is the second question, accuracy being the first. A cheap model that produces a dataset clinical and regulatory reviewers will not sign off on is worth nothing. That argument is made separately, with evidence: our Medical LLM currently ranks first or tied for first across 13 clinical and biomedical benchmarks against GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.6, and the PHUSE study reached 98.4% accuracy on primary site against certified tumor registrar ground truth. The cost case stands on top of the accuracy case, in that order.

The model also excludes engineering time on both sides, on the grounds that both routes need pipeline work and the difference is smaller than commonly assumed: a well-packaged local deployment installs from a cloud marketplace, while a well-engineered API pipeline needs chunking, retry, and rate-limit logic of its own. It excludes the cost of data egress reviews, privacy assessments, and BAA negotiations that API routes trigger and local deployment inside your own environment largely avoids; counting those would widen the gap further. And it freezes prices at a point in time. API prices fall, GPU prices fall, and anyone using this model a year from now should rerun it with current numbers. The assumptions are stated precisely so that you can.

What it means for planning an AI budget

If you are budgeting a healthcare AI initiative, the practical guidance falls out of the three scales. At pilot volume, choose on accuracy, privacy, and speed to start; the cost difference is real but not decisive. At anything resembling production volume, the deployment model is the cost decision, and it dwarfs the choice between API vendors. And if your roadmap ends at population scale, per-token pricing is the line item that will eventually force a re-architecture, so it is cheaper to model that now than to discover it in year two.

The full cost model, with the table and the workload it comes from, is in my keynote from the 2026 Applied Healthcare AI Summit.

FAQ

Why does the cost gap widen with scale?

Per-token pricing is linear in data volume, while per-server licensing plus infrastructure grows sublinearly: a 100x increase in patients required only 15x the nodes in this model. Two curves with those shapes always diverge, so the multiple grows from 1.35x at pilot scale to 28.75x at a million patients.

Wouldn’t filtering documents reduce API costs too?

Yes, and well-built API pipelines should filter aggressively. This model deliberately skips filtering on both sides to keep the comparison clean. Filtering helps the fixed-cost deployment less, because its marginal token already costs nothing, so adding it narrows the gap somewhat at small scale and barely at all at large scale.

What about cheaper API models instead of the flagships?

Smaller API models cut the per-token price but give up accuracy, and in regulated clinical work accuracy is the constraint: a dataset that a clinical reviewer will not validate has no value at any price. The relevant comparison is between options that clear the accuracy bar, and the benchmark data shows which those are.

Is 1 million tokens per patient realistic?

It is a middle estimate: below a typical cancer patient, above a typical chronic disease patient. If your population averages 200,000 tokens per patient, divide the API figures by five; the multiples at each scale barely move, because both sides scale with the same workload.

What would change the conclusion?

A structural change in API pricing, such as flat-rate enterprise tiers with unmetered tokens at these volumes, would change it. Price cuts alone do not: a 50% cut at the million-patient scale turns $16 million into $8 million against $1 million, and the linear-versus-sublinear geometry remains.