What 304 healthcare AI practitioners said about their 2024 budgets, models, and worries
Originally published April 2024 in Hospital & Healthcare Management and Holistic Pulse, based on the 2024 Generative AI in Healthcare Survey conducted by Gradient Flow.
Early in 2024, 304 healthcare and life sciences practitioners filled out a detailed survey on how they were actually using generative AI: what budgets they had, what models they were picking, how they were evaluating vendors, where they were stuck. The results ran against the dominant narrative of the year. The story in the trade press was “healthcare is cautious about generative AI.” The story in the data was that healthcare was spending aggressively, was building with healthcare-specific models rather than frontier LLMs, and was weighting accuracy and privacy well above cost when evaluating options. For anyone deciding how to invest in 2024 and 2025, the survey is a useful check against the conference-stage version of the market.
The budget picture: not cautious
The survey’s headline number was a 300%+ year-over-year budget increase reported by nearly one-fifth of technical leaders. That is not a cautious industry. When the underlying distribution is laid out, the picture sharpens:
- 34% of all respondents reported a 10–50% increase in generative AI budgets versus 2023.
- 22% reported a 50–100% increase.
- 18% of technical leaders specifically reported a budget increase of more than 300%.
- Another 16% of technical leaders reported increases in the 100–300% range.
Company size shaped the pattern. Medium-sized companies were most likely to report 50–100% increases (36% of medium-sized respondents). Large companies were most likely to report the very large increases, with 12% seeing more than 300%, compared to 7% of medium-sized and 6% of small companies.
The pattern to read out of this is that the organizations with the most operational experience in healthcare AI are also the ones making the biggest bets. That’s a different signal than “healthcare is cautious.” It’s healthcare saying the tooling is finally good enough to justify the investment, and the teams that have been running small pilots for a while are now moving to scale.
The model picture: specialized beats general
The clearest finding from the survey on model choice was a pronounced preference for healthcare-specific models over general-purpose LLMs. Asked what kinds of language models they were using, 36% of respondents reported using healthcare-specific small models. Open-source LLMs came second at 24%, and open-source small models at 21%. Frontier general-purpose LLMs were not the default choice for this audience.
The scoring of evaluation criteria reinforced the pattern. Asked to rank importance factors on a 1-to-5 scale, respondents put:
- Tuned specifically for healthcare: 4.03 mean
- Reproducibility: 3.91
- Legal and reputational risk: 3.89
- Explainability and transparency: 3.83
- Cost: 3.80
The noteworthy line in that list is the last one. Cost was the least important factor. The practitioners in this survey were willing to invest in high-quality, reliable models rather than cut corners on price, which is consistent with an industry that has absorbed what wrong answers actually cost in clinical or regulatory settings.
The explicit preference for healthcare-specific models sits on top of an accumulating evidence base. By early 2024, peer-reviewed evaluations had consistently found that domain-tuned models outperformed general-purpose LLMs on clinical extraction tasks. A *JAMIA* study published in January 2024 measured GPT-4 at F1 0.804 on the 2010 i2b2 concept-extraction benchmark with baseline prompts, versus BioClinicalBERT at 0.901. A 2024 *Bioinformatics* paper found that fine-tuned open models outperformed few-shot GPT-4 on biomedical NER by 5 to 30 F1 points depending on the dataset. The practitioners in the survey were weighting their choices in line with where the evidence was actually pointing.
What practitioners were building
The use-case mix in the survey was skewed toward externally facing applications and information-extraction work, the places where generative AI reaches real volume in healthcare operations.
- Answering patient questions: 21%
- Medical chatbots: 20%
- Information extraction and data abstraction: 19%
What the mix doesn’t show, but what trails behind these numbers in the open-ended responses, is the pattern of how these systems are being built. Patient-facing Q&A systems and medical chatbots, done well, are not single-LLM deployments. They’re compositions: pre-processing pipelines that section and normalize clinical text, task-specific extraction models that pull structured findings, a longitudinal patient record that assembles the findings into a timeline, a reasoning layer (which is where the LLM finally earns its place) that answers questions over the timeline with citations to sources. Information extraction follows the same pattern, specialized models at the bottom, LLMs used for reasoning or summarization on top of clean inputs.
That architecture is what closes the gap between the accuracy healthcare practitioners need and the accuracy a frontier LLM delivers on raw clinical text. It’s also what the 36% of respondents using healthcare-specific small models are deploying in practice.
What practitioners are worried about
Adoption roadblocks in the survey clustered around three themes, in roughly this order:
Accuracy and reliability. The dominant worry, and the one that the composition-architecture work described above is specifically aimed at. Frontier LLMs called on raw clinical text hallucinate at rates that regulated workflows cannot absorb; systems that compose specialized models with LLMs close the gap.
Legal and reputational risk. Second in importance to healthcare-specificity when evaluating models. Behind this is the recognition that wrong AI answers in a clinical context can harm patients, trigger regulatory action, and damage brand. Responsible-AI testing for robustness, fairness, bias, truthfulness, and data leakage has moved from optional to expected.
Alignment with industry-specific needs. The survey asked practitioners whether the technology options on the market actually fit the regulated, high-accuracy, high-privacy demands of healthcare work. The preference for healthcare-specific models is partly an answer to this: the options that don’t fit the industry’s needs get filtered out at the evaluation stage.
Human oversight is the common thread running through the mitigations. Asked how they test and improve LLM models, respondents’ most common strategy was “human in the loop.” This is not a compliance concession, it’s an engineering pattern that lets specialized models run at high throughput on the records they can handle, with domain experts reviewing the flagged records where the AI is least confident. Well-calibrated systems that route low-confidence records to human reviewers consistently clear the accuracy bars that pure-automation or pure-manual approaches cannot.
Testing priorities varied by company size. Large companies prioritized fairness and private-data leakage. Smaller companies prioritized bias and freshness (how up-to-date the model is relative to changing clinical guidelines and terminology). Both sets of priorities reflect real regulatory and operational concerns, fairness and leakage are what a large organization can be sued over; bias and freshness are what a smaller team notices first when the model is wrong.
What the survey implies for 2024 and 2025
A few practical takeaways for healthcare organizations planning generative AI investments in the twelve months after this survey shipped.
Budget is not the binding constraint anymore. The organizations investing seriously in healthcare AI are doing so in large increments and in ways that reflect real operational deployment. Underfunding a generative AI initiative in 2024 is no longer a defensible strategy, it’s a decision to fall behind competitors who are moving faster.
Model choice should be informed by task, not by hype. Healthcare-specific small models are winning a material share of the market because they work better for the work healthcare actually needs done: high-volume, high-accuracy extraction and classification. Frontier LLMs have a role (summarization, conversational interfaces, reasoning over already-clean inputs) but they are not the default choice for clinical NLP workloads. The 36% of respondents using healthcare-specific small models are voting with their pipelines.
Accuracy, privacy, and industry-specificity beat cost. The survey’s most striking finding is that cost came last among the evaluation criteria. That’s the right answer for an industry where wrong answers have outsized consequences, and it should shape how vendors pitch and how buyers buy. Organizations evaluating vendors should weight accuracy and privacy heavily, and should discount vendor claims that have not been substantiated by peer-reviewed benchmarks or public case studies.
Human-in-the-loop is how the economics work. No single model deployed alone hits the accuracy bars healthcare workflows need. Systems that combine AI throughput with targeted expert review, with feedback flowing back into the next model version, are what reach production, and do so in a form that satisfies the regulatory requirements for human oversight.
In-environment deployment is table stakes. The survey’s privacy findings line up with what every procurement review in healthcare ends up concluding: systems that cannot run inside the customer’s environment are eliminated before they reach accuracy evaluation. Organizations building or buying generative AI for healthcare should treat on-premises or private-cloud deployment as a hard requirement, not a premium feature.
The practitioners represented in this survey are building the next generation of healthcare AI quietly, while the public conversation is still stuck on exam-score headlines. Their choices — healthcare-specific models, compositions rather than single models, humans in the loop, in-environment deployment, accuracy weighted above cost — are a more reliable guide to what works than the pitch deck of any frontier-model vendor. The 2024 survey was the first annual edition; subsequent editions will reveal how much further the production bar has shifted. On the evidence of this first one, the gap between the operator view and the media view of healthcare AI was significant, and the operator view was the one worth listening to.
FAQ
How representative are the 304 respondents?
The survey was conducted by Gradient Flow over 33 days in early 2024, with 304 participants of whom 196 were actively engaged in evaluating, using, or deploying generative AI in healthcare or life sciences. Respondents were recruited through online channels including the Gradient Flow newsletter, social media, and industry partners. As with any voluntary survey, respondents self-select, but the sample size and the mix of technical leaders, data scientists, and practitioners make the distributions reasonably informative about the population of actively building organizations.
Does “healthcare-specific small models” mean models trained from scratch for healthcare, or fine-tuned general models?
Both. The category in the survey covers models in the roughly 100M–10B parameter range that have either been trained from scratch on healthcare data or fine-tuned from a general base on healthcare data. The operational distinction from frontier LLMs is that they can be run on a single GPU (or CPU for the smaller ones), in the customer’s environment, at fixed cost.
Why was cost rated lowest in evaluation priority?
Because in healthcare, the cost of a wrong answer typically exceeds the cost of the model. A missed adverse event, a mis-coded diagnosis, a leaked patient record, or a failed regulatory audit has consequences (clinical, financial, and reputational) that dwarf the per-record cost of inference. Practitioners who have absorbed those consequences rate accuracy and privacy above cost, because they know the downstream numbers.
What is the practical threshold for “high accuracy” in healthcare AI?
It depends on the task and on whether the workflow includes human review. For tasks where automation is the point (de-identification, PHI detection, high-volume clinical coding) the practical threshold is above 99% on the first pass, because below that every record still needs human review. For tasks with a designed human-in-the-loop review layer, the AI-only accuracy can be lower (90–96% is routine) as long as confidence calibration routes the uncertain records to reviewers reliably.
Is the budget growth seen in the 2024 survey sustainable?
Two years later, the answer appears to be yes: subsequent surveys and the evidence from public healthcare AI deployments show continued investment, broader adoption beyond early-adopter organizations, and a shift from pilot projects to operational workloads. The organizations that bet on the space in 2024 largely kept investing in 2025 and 2026. Organizations that stayed on the sidelines are now doing the catch-up work.



