When smaller wins: the size calculus for generative AI in regulated work
Originally published October 2024 in CIO.
The default assumption about generative AI, that bigger models are better models, is a legacy of the first two years of LLM scaling. For general-purpose conversational work, the assumption still mostly holds. For the specialized, high-volume, high-accuracy work that regulated enterprises actually run, the assumption has been flipping since at least 2023, and the 2024 evidence makes the flip operational rather than theoretical. Smaller, domain-specific, task-oriented language models now routinely outperform frontier LLMs on the tasks those enterprises care about, at a fraction of the inference cost, while clearing compliance requirements that the large models cannot. Company size shapes the picture: large enterprises have different priorities than mid-size or small companies, and the right model strategy looks different for each.
The default assumption, restated
The scaling hypothesis that drove model sizes from hundreds of millions of parameters to hundreds of billions was straightforward: more parameters plus more training data plus more compute produces better general-purpose language capability. For open-ended conversation, creative writing, and broad question-answering, that hypothesis has held up. Frontier general-purpose models are genuinely better at these tasks than smaller predecessors, and the gap shows up on the benchmarks that measure general-purpose capability.
The problem is that most enterprise workloads are not general-purpose capability problems. They are narrow, repetitive, high-accuracy tasks: extracting structured information from documents, classifying records, matching entities across systems, generating standardized outputs. On these, the scaling hypothesis is not the right mental model. What matters is how well the model handles the specific task on the specific data the enterprise runs, at the cost and latency the production system can absorb. And on those metrics, the evidence that accumulated through 2023 and 2024 points consistently toward specialization.
What the evidence says
Three classes of evidence are worth separating.
Peer-reviewed extraction benchmarks. A 2024 JAMIA study on the 2010 i2b2 clinical-concept extraction benchmark measured GPT-4 at F1 0.804 with baseline prompting and 0.861 with a carefully engineered four-component prompt framework. BioClinicalBERT, a 110-million-parameter domain-specific model released years earlier, reached 0.901 on the same benchmark with no prompt engineering. On the VAERS adverse-event corpus, GPT-4 with careful prompting reached 0.736; BioClinicalBERT reached 0.802. A 2024 *Bioinformatics* paper showed a 7-billion-parameter LLaMA fine-tuned on biomedical NER outperforming few-shot GPT-4 by 5 to 30 F1 points across three standard datasets. These are not selective results — they’re a consistent picture across independent evaluations.
Blind clinician preference evaluations. A 2025 JMIR AI paper (Kocaman et al., “CLEVER”) reported a blind, randomized, preference-based evaluation by practicing medical doctors comparing GPT-4o against healthcare-specific LLMs (8-billion-parameter and 70-billion-parameter variants) on clinical text summarization, clinical information extraction, and biomedical question answering. On each of three dimensions (factuality, clinical relevance, and conciseness) the medical doctors preferred the smaller medical LLM between 45% and 92% more often than GPT-4o. The 8B-parameter variant is roughly two orders of magnitude smaller than the frontier model it was compared against. It was preferred by clinicians anyway.
Practitioner behavior. The 2024 Generative AI in Healthcare Survey (Gradient Flow, 304 respondents) showed 36% of respondents using healthcare-specific small models and another 21% using general open-source small models — a combined 57% running on small, often specialized models. Frontier general-purpose LLMs were not the default choice. A follow-up in the same survey series showed 54% of large-company respondents specifically preferring healthcare-specific task-oriented models over general-purpose LLMs. Practitioners deploying these systems are voting with their pipelines.
The pattern is not that large models are bad. It’s that for the specialized work regulated enterprises run (entity extraction, classification, terminology mapping, structured summarization, narrow question-answering over curated knowledge) specialized smaller models are often a better choice on the metrics that decide whether the system ships: accuracy on the specific task, inference cost per record at production volume, latency inside the operational SLA, and the ability to run inside the customer’s environment under regulatory constraints.
Why specialization wins on regulated work
Five reasons, each measurable.
Domain data is a structural advantage. Medical language is specialty-dependent in ways that general-web training data underrepresents. “RA” is rheumatoid arthritis to a rheumatologist and right atrium to a cardiologist. “MS” is multiple sclerosis in neurology and mitral stenosis in cardiology. Domain-tuned models have seen these distinctions at scale with labeled context; general models have seen them diluted among everything else. The 2024 *Bioinformatics* paper on biomedical NER traced its fine-tuned open models’ advantage over GPT-4 directly to this: specialized training teaches the model which interpretation is meant in which specialty, which is exactly what clinical extraction requires.
Sequence-labeling is a structural mismatch for generation-first models. Many high-value regulated tasks (named entity recognition, assertion status classification, relation extraction, de-identification) are fundamentally sequence-labeling problems. LLMs trained primarily as text generators solve these awkwardly, because the objective mismatch between “generate fluent text” and “label spans in existing text” shows up as over-confident labeling of non-entity spans and under-recall on unusual entities. Encoder models trained with labeling objectives handle the same tasks natively. This is a well-documented pattern in the literature; the 2024 paper “GPT-NER” and subsequent work on biomedical NER consistently find the same structural gap.
Inference cost tilts sharply at production volume. Running a frontier LLM over every clinical note in a hospital’s daily ingest is economically unattractive even when it’s technically possible. A domain-tuned 100-million-parameter model runs on a single commodity GPU at thousands of records per minute. A 70-billion-parameter model running over an API bills per token and introduces a per-request latency that turns a three-hour batch job into a three-day one. For hospital systems processing 500,000 notes per quarter, or pharma safety functions processing millions of adverse-event records annually, the economics aren’t 10x or 100x: they’re often 1,000x, which is the difference between the system being deployed and the system being an experiment.
In-environment deployment is a compliance and data-sovereignty requirement. For most regulated buyers, sending clinical notes, contract text, or financial records to a third-party cloud API is a non-starter regardless of accuracy. HIPAA, GDPR, and the US state-privacy-law patchwork have made on-premises or private-cloud-in-customer-tenant deployment a procurement hard requirement. Smaller specialized models are designed to run in this architecture. Frontier models typically are not — and even when they can be privately deployed, the compute costs change the economics decisively.
Freshness and update cadence are manageable on specialized models. Domain terminologies, clinical guidelines, regulatory requirements, and compliance rules change continuously. A specialized model that can be fine-tuned weekly with new annotated data and redeployed in hours is a different operational animal than a frontier model where the customer has no control over training cadence and has to adapt prompts as the vendor ships updates. For high-compliance workflows where traceability matters, the control over update cadence is the governance mechanism.
How company size shapes the right strategy
The 2024 survey data showed distinct patterns by company size, and the patterns map to real differences in where the biggest returns sit.
Large companies (5,000+ employees). The pattern is substantial budget, serious investment, and a preference for healthcare-specific task-oriented models (54% in the survey) combined with heavy use of proprietary LLMs via SaaS APIs for the exploratory and conversational layer. The right play for a large organization is composition: specialized small models for the high-volume production work inside the firewall, frontier LLMs for reasoning and conversation on top, with strict governance about what data flows where. Large companies also have the scale to justify building or licensing domain-specific models tuned on their own data, an expensive capability that pays back at their volume. The testing priorities that matter most for large companies are fairness and private-data leakage, the failure modes that create the largest reputational and regulatory exposure at scale.
Mid-size companies (501–5,000 employees). These organizations were the most experimental in the survey — 24% actively developing AI models and 36% reporting 50–100% budget increases. They typically have enough volume to justify serious AI investment but not enough to build foundational models from scratch. The pragmatic path is picking the right specialized models off the shelf, investing in the internal harmonization and pre-processing layers that make those models work on their specific data, and using frontier LLMs selectively for tasks where the per-record cost can be justified. Mid-sized companies benefit disproportionately from open-source small models because the total cost of ownership is predictable and the models run inside their own infrastructure.
Smaller companies (under 500 employees). The right strategy looks different. The investment that pays back for a smaller organization is usually a vertically integrated tool, a specialized product that solves one specific problem end-to-end, rather than a build-your-own-pipeline effort. Smaller companies’ testing priorities in the survey tilted toward bias and freshness — which reflects an operational reality that models going stale and bias failures are what the smaller team notices first. Frontier LLMs via API often make sense for smaller companies on the exploratory side, because the volume doesn’t yet justify the fixed-cost investment in self-hosted specialized infrastructure.
None of these patterns is universal. The point is that the right size-of-model question depends on the size-of-company question, because the returns differ. The mistake is assuming the same architecture fits all three.
What the next 12 months look like
Three directional predictions that follow from the 2024 evidence and the 2025 industry behavior already visible.
Specialized small models keep widening the task-specific gap. As domain-tuned models are fine-tuned on more operational data under human-in-the-loop feedback, their accuracy on the narrow tasks they handle continues to improve. Frontier general-purpose models improve too, but on a trajectory that optimizes general capability, not narrow-task accuracy on specialized data. The cross-over point has passed on most regulated extraction tasks. It’s not moving back.
Composition becomes the default architecture. The systems that ship are compositions: specialized models doing the high-volume work, frontier LLMs doing the reasoning on top, with clean interfaces between them. Neither pure-frontier-LLM nor pure-small-model architectures dominate. The architectural question is how to orchestrate both, which is an engineering problem with known answers.
Governance shifts from monolithic to modular. Governing one frontier LLM that does everything is actually harder than governing a system of specialized models, each with its own scope, validation set, and audit trail. Regulators are moving in this direction too: the EU AI Act’s risk-classification framework effectively requires organizations to know what each model in their system is doing, on what data, with what validation. Systems built as compositions of well-scoped specialized models produce the governance artifacts regulators ask for more naturally than systems built as one giant model.
What to do differently
For enterprises planning 2024 and 2025 AI investments, four changes to the procurement conversation.
First, stop starting with “which frontier LLM should we use?” Start with “what is the task, at what volume, on what data, inside what compliance envelope?” The answer to that question usually nominates a specialized model for the core of the work, with frontier LLMs used where they genuinely fit.
Second, demand per-task accuracy benchmarks with peer-reviewed methodology. A single “99% accuracy” claim means nothing. Task-specific F1 scores, per-task inference cost, per-task latency, and per-task compliance posture are what decide whether a system works in production.
Third, budget for the harmonization layer alongside the model. Most of the accuracy in a regulated AI workflow comes from the pre-processing, terminology mapping, and human-in-the-loop feedback infrastructure around the model, not from the model itself. Under-investing in this layer is the most common reason pilot systems fail to generalize to production.
Fourth, match the architecture to your company size. Large enterprises should be building composition systems with governance-by-design. Mid-size companies should be buying specialized models and investing in internal harmonization. Smaller companies should be buying vertically integrated tools. The worst outcome for any of the three is pretending you’re one of the others.
The bigger-is-better heuristic was a useful shortcut when general-purpose capability was the scarce resource. It isn’t anymore. For regulated work, specialized is better, smaller is cheaper, and in-environment is table stakes. The organizations whose AI strategy reflects that reality are the ones whose AI budgets show returns.
FAQ
Are large language models actually getting less useful over time?
No. Frontier LLMs continue to improve on general-purpose tasks and on reasoning and summarization benchmarks. The claim is narrower: on the specialized, high-volume, high-accuracy tasks regulated enterprises run, smaller domain-tuned models consistently outperform them. Both things are true simultaneously, and the right system uses each for what it’s good at.
Is the small-model advantage limited to healthcare?
No, though healthcare is where the evidence base is thickest because of the peer-reviewed literature. The same pattern shows up in legal (contract NER and clause extraction), financial services (transaction classification and AML signal detection), and industrial (domain-specific document processing). Any domain with specialized vocabulary, high-volume extraction needs, and regulatory constraints shows the same structural advantage for specialized models.
How much cheaper is a small specialized model at production volume?
Usually 50x to 1,000x on inference cost, depending on the task and the comparison. A 100-million-parameter model running on a single GPU processes thousands of records per minute at fixed hardware cost. A 70-billion-parameter model over an API bills per token at prices that, multiplied by production volume, put the per-record cost two to three orders of magnitude higher. The exact multiplier depends on the workload; the order-of-magnitude is consistent.
What does “specialized” actually mean in practice?
Two things, and the best systems combine them. Domain-specific pre-training (or fine-tuning from a base model) on corpus relevant to the field: biomedical literature, clinical notes, legal contracts, financial filings. And task-specific fine-tuning on labeled examples of the exact task the model will perform in production: clinical NER, contract-clause extraction, AML classification. A model that’s both domain-specific and task-specific outperforms a model that is only one or the other.
Will frontier models eventually close the gap on specialized tasks?
On some tasks, probably. On others, the structural mismatch between a generation-first architecture and a sequence-labeling objective suggests the gap will persist. The practical question for a 2024–2025 investment decision isn’t what will be true in 2027; it’s what works now. Right now, for regulated extraction and classification work, specialized smaller models win: and the governance, cost, and compliance advantages they bring are additive, not dependent on the pure-accuracy comparison.



