AI in Healthcare

Why cancer registries stay years out of date - and what regulatory-grade oncology AI changes

David Talby — Thu, 18 Jun 2026 08:32:09 GMT

Originally published in Forbes, July 2025.

Most of what matters in a cancer patient’s record is free text. Stage, histology, biomarker status, treatment response, progression: none of it lives cleanly in a discrete EHR field. It lives in pathology reports, radiology narratives, oncologist progress notes, and multidisciplinary tumor board summaries. Manually abstracting that information for a cancer registry takes a certified tumor registrar about two hours per case, and only 14% of US registries consistently meet the National Program of Cancer Registries’ target of reporting 90% of cases within 12 months of diagnosis. Regulatory-grade oncology AI changes that ratio, and with it the timelines for research, trial matching, and quality reporting that cancer care depends on.

Why oncology data is uniquely hard to structure

Oncology is a free-text discipline. Cancer staging follows AJCC rules that depend on tumor size, nodal involvement, metastatic spread, and (for most solid tumors) molecular or genomic features that are described, not coded. A prostate cancer report references Gleason patterns. A breast cancer report specifies ER, PR, and HER2 status in language that varies by pathologist. A lung cancer case depends on EGFR, ALK, ROS1, KRAS, and increasingly a dozen more biomarkers, each with its own testing methodology and reporting convention.

Claims data captures almost none of this. Discrete EHR fields capture some of it, inconsistently. The rest, which is typically the clinically decisive information, sits in narrative reports that a human has to read.

That is why, as of the latest assessment, central cancer registries in the US take an average of two hours per case to abstract, with complex patients consuming several days of registrar time. A single full-time registrar can process 6 to 10 cases per day. A thousand-patient cohort costs roughly a full year of certified registrar labor. Oncology informatics research out of the Cancer Institute of New Jersey has documented the structural reasons: EHR infrastructure varies widely across treating facilities, patient-reported physician lists disagree with registry records in 42% of cases, and the average lung cancer patient generates about 300 pages of records that have to be reviewed line by line.

The consequence is that cancer surveillance operates on a two-to-four-year lag. Clinical trial matching misses eligible patients because their biomarker status is not yet coded. Outcomes research depends on cohorts that are systematically incomplete. Quality measures under CMS’s Oncology Care Model and the MIPS Promoting Interoperability category depend on structured data that often does not exist until long after the care episode is closed.

What regulatory-grade accuracy means in oncology

General-purpose LLMs can read a pathology report and produce a summary that looks right. Regulatory-grade oncology extraction is a higher bar. It means:

Every extracted entity (diagnosis, stage, biomarker, medication, procedure) is mapped to a controlled terminology such as SNOMED CT, ICD-O-3, RxNorm, or LOINC, with documented confidence and provenance to the source sentence. Negation and temporality are handled correctly: “no evidence of metastatic disease” does not become a metastasis flag, and a history of prior tamoxifen therapy is not confused with current treatment. Results are reproducible: the same input produces the same output, which regulators and auditors require. The pipeline runs in the customer’s environment so PHI never leaves their control.

Peer-reviewed work on cancer-specific information extraction has shown the gap between healthcare-specific models and frontier LLMs on exactly these tasks. On the CACER (Clinical Concept Annotations for Cancer Events and Relations) benchmark, GPT-4 scored below 0.50 F1 on cancer entity extraction, while a medical language model tuned for oncology reached materially higher accuracy. On structured extraction from diagnostic reports, a medical language model reached 0.80 F1 on relation extraction versus GPT-4’s below 0.60. Assertion classification, which handles negation and uncertainty and matters more in oncology than in almost any other clinical domain, reached above 90% accuracy with healthcare-specific assertion models; general LLMs produced inconsistent output under prompt variation.

These differences are not cosmetic. At scale, a 15-point F1 gap means a different cohort in every study, a different denominator in every quality measure, and a different set of eligible patients for every trial.

What changes when extraction runs in minutes instead of hours

Three downstream consequences follow when regulatory-grade oncology extraction is available at scale.

Cancer registry reporting converges toward real time. Our Medical LLMs cut per-case abstraction from two hours to one to two minutes, a 60–100x productivity gain, while preserving human-in-the-loop review for complex cases. That changes the operating model of a registry from “build the backlog, then catch up” to “review by exception.” A recent medRxiv preprint out of China Medical University, using a locally deployed 20B-parameter open-weight model on a single professional-grade GPU, reached similar conclusions: autonomous multi-stage extraction of pathology reports is now feasible inside a hospital firewall, without depending on an external API.

Clinical trial matching catches patients earlier in their journey. A trial protocol requires EGFR mutation status, ECOG performance status, prior lines of therapy, and measurable disease per RECIST 1.1. When those variables are extracted continuously from incoming pathology and oncology notes, rather than abstracted months later for registry purposes, matching happens in the window where enrollment is still possible.

Quality and outcomes reporting become operationally feasible. Measures like 30-day readmissions after cancer surgery, time from diagnosis to treatment initiation, and adherence to NCCN guideline recommendations require structured clinical data that most health systems cannot produce reliably from claims or discrete EHR fields. Regulatory-grade extraction closes that gap.

What this does not do

Oncology AI that extracts structured data from notes does not make clinical decisions. It does not recommend treatment, diagnose cancer, or replace the judgment of a multidisciplinary tumor board. The pipeline produces structured, auditable data; clinicians and certified tumor registrars continue to interpret and act on that data.

That distinction matters for two reasons. First, it defines the regulatory path. Extracting a structured representation of information that already exists in the clinical record is a different regulatory question from generating novel clinical recommendations. The FDA’s 2025 Predetermined Change Control Plan guidance and the evolving framework around decision support interventions apply differently to each. Second, it defines where the accuracy bar sits. An extraction pipeline that is 98% accurate on biomarker status is useful immediately; a decision-support tool at the same accuracy level is not, because the 2% tail sits on clinical outcomes rather than on a data field a human will review.

What to ask when evaluating oncology AI

For health systems, pharma RWE teams, and cancer centers looking at vendors in this space, four questions separate regulatory-grade offerings from demos:

What is the peer-reviewed accuracy on your use case, against a published benchmark and ground-truth data? Benchmark results on MedQA or USMLE-style questions do not predict performance on pathology report extraction.

Where does the data live during processing? If the answer is “our API,” that is a different compliance, cost, and data-sovereignty posture than “in your environment, behind your firewall.”

What terminology does the output map to, and how is provenance tracked? A number without a code and a source sentence is not registry-grade data.

How is human-in-the-loop review supported? Complex oncology cases (rare tumors, ambiguous staging, contradictory reports) require registrar judgment. The tool either supports that workflow or forces a shadow system around it.

The shift underway

Oncology was the first clinical domain where the gap between what the record contains and what the structured data captures became operationally unacceptable. It is now the first domain where that gap is closing at scale, driven by healthcare-specific language models that run inside the customer’s environment and hit the accuracy, terminology, and provenance bar that registries, trials, and quality programs require.

The practical consequence for cancer centers and cancer research is that the two-to-four-year surveillance lag is no longer inevitable. For pharma RWE, it is that oncology cohorts can be built from real clinical narrative rather than from claims proxies. For patients, it is that trial opportunities show up while they are still options, not months after another line of treatment has started.

Regulatory-grade accuracy is what makes all of that possible.

Frequently asked questions

Why can’t general-purpose LLMs handle oncology extraction out of the box?

They can read a pathology report and produce a reasonable summary. They struggle on the specific tasks oncology data pipelines require: mapping to controlled terminologies like SNOMED CT and ICD-O-3, handling nested negation, distinguishing current from historical treatment, and producing reproducible output under prompt variation. Peer-reviewed benchmarks on cancer-specific information extraction show healthcare-specific models meaningfully outperforming GPT-4 on relation extraction, assertion classification, and entity resolution.

What is a realistic accuracy target for a registry-grade extraction pipeline?

It depends on the entity. For straightforward diagnoses and medications, above 95% F1 is routine with healthcare-specific models. For staging, biomarker status, and response assessment, the bar is 90%+ with human review of ambiguous cases. Published benchmarks and reproducible notebooks are the right way to evaluate vendors; demo videos are not.

Does this replace certified tumor registrars?

No. It changes what they spend their time on. Registrars move from line-by-line abstraction of routine cases to review of complex cases, validation of AI output, and the judgment calls on rare tumors and ambiguous staging that automation cannot handle.

Can the same pipeline run on pathology, radiology, and oncology notes?

Yes, with the right architecture. Healthcare-specific pipelines combine document classifiers that route input to the right extraction engine, cancer-specific NER models tuned for each report type, and a unified output representation (typically OMOP Oncology or a CDM extension) that supports downstream research and reporting.

How does this intersect with FDA guidance on AI in clinical care?

Extracting structured data from an existing clinical record is a different regulatory question than generating clinical recommendations. The FDA’s Predetermined Change Control Plan guidance and the broader framework around decision-support interventions apply, but the accuracy and validation requirements for data extraction are primarily about auditability and reproducibility, not about the model making a clinical decision. Oncology AI that supports registries, trial matching, and RWE is data infrastructure.

What about privacy and data sovereignty?

Any oncology AI pipeline processing identifiable patient records should run inside the customer’s environment (on-premises or in a private cloud tenant), with no PHI leaving the firewall. API-based approaches that send clinical notes to an external LLM vendor are difficult to reconcile with HIPAA, GDPR, and the data-use agreements that cancer centers and pharma RWE teams operate under.

What is the biggest operational change when this is deployed at scale?

The registry workflow shifts from “build the backlog” to “review by exception.” Timeliness improves; cases that used to take weeks to abstract are available within days or hours of documentation, and registrar time moves to the cases where human judgment actually changes the output.

Why the 2026 Medicare Advantage rate decision raises the bar on HCC coding accuracy

David Talby — Sat, 13 Jun 2026 14:47:13 GMT

Originally published in MedCity News, Rama on Healthcare, and Gene Online — June 2025. Recast for this Substack with updated CMS figures, regulatory context, and a sharper framing on the accuracy and compliance requirements now facing Medicare Advantage plans.

CMS announced a 5.06% average increase in payments to Medicare Advantage plans for 2026, the largest rate increase in a decade. The headline was easy: more funding per member. The implication was harder: a larger payment base creates a larger audit surface, and CMS and the Office of Inspector General have been explicit that rate increases will now be tied more tightly to coding integrity. The FY2024 Part C payment error already sits at $19.07 billion. HCC coding accuracy in 2026 is a compliance baseline, not a revenue lever.

What the 2026 MA rate announcement actually signals

The 5.06% average rate increase is notable in isolation. It is more notable in context. The 2025 rate was 3.70%. 2024 was negative on an effective basis once coding trend adjustments were factored in. The jump to 5.06% in 2026 reflects CMS confidence that the MA program can absorb higher payments without collapsing into the overpayment problem that has dogged the program for a decade. That confidence has strings attached.

CMS has been re-tuning the mechanics underneath the rate. The CMS-HCC V28 model restructured how conditions map to HCC categories, with changes to diabetes, mental health, cardiovascular, and chronic kidney disease staging. The transition from V24 to V28 is still being phased in, which means plans in 2026 will be running dual logic for at least another payment year. RADV audits are being pushed to a quarterly cadence on eligible contracts. OIG has repeatedly flagged diagnoses sourced only from health risk assessments or chart reviews without supporting documentation elsewhere in the record.

Putting that together: the 2026 payment environment is higher-dollar, higher-scrutiny, and more complex than any previous year. A plan that grew comfortable with a 2-to-3% error rate in 2023 is now operating in a regime where that error rate is visible and actionable.

The documentation gap is the accuracy problem in disguise

HCC coding is a translation problem. It starts with clinical reality (what conditions a patient actually has, documented in progress notes, specialist consults, imaging reports, and pathology) and ends with a structured submission (ICD-10 codes mapping to HCC categories, with MEAT evidence linking back to an encounter). The translation fails in two directions. Diagnoses that exist in the record and the claims go through correctly. Diagnoses that exist only in unstructured notes get dropped. Diagnoses that exist in the claims but aren’t supported in the notes survive submission and fail audit.

The scale of what gets dropped is the underreported part. Research and industry analysis indicate that as many as half of all patients may have prior conditions, complications, or severity indicators documented in clinical notes but not reflected in claims or electronic health records. The asymmetry matters. A conservative plan that only submits what is in structured claims data captures perhaps half of the eligible risk. A plan that submits from claims plus chart review captures more, but without a defensible chain from diagnosis to documentation, a meaningful share of those codes will come back in RADV. The result in either direction is financial loss: undercoding leaves risk-adjusted revenue on the table; unsupported upcoding becomes a clawback.

Undercoding also has a clinical consequence that is often framed as a revenue story but is really a patient-care story. If a patient with chronic kidney disease and heart failure has both documented in the cardiologist’s note but neither reaches the plan’s risk profile, care coordination tools, high-risk outreach programs, and population health analytics see a healthier patient than the patient actually is. Gaps in care follow. Missed interventions follow. The member appears less sick than they are on every dashboard the plan runs, and the care model treats them accordingly.

What regulators have made clear about 2026 and beyond

Three regulatory signals are worth tracking closely:

RADV cadence. CMS has signaled intent to audit all eligible MA contracts on a quarterly basis. A plan that previously got audited every several years needs to operate assuming it will be audited this quarter. That changes what “audit-ready” means. It shifts from “we could produce documentation if asked” to “every code we submit this quarter will be looked at.”

OIG’s focus on HRA-only and chart-review-only codes. The OIG has been explicit that diagnoses appearing only on health risk assessments or retrospective chart reviews, without MEAT evidence in the broader medical record, are a specific audit target. Plans that have been relying on aggressive retrospective review programs are the ones exposed.

Extrapolation. CMS can extrapolate audit findings from a sampled subset of charts to the full contract. A 5% error rate on a 200-chart sample becomes a 5% adjustment on the full contract. For a large MA plan, a 5% extrapolated adjustment is a nine-figure event.

The composite picture: in 2026, the plans that thrive are the ones that can defend every submitted code with clean documentation, on a quarterly cadence, at the scale of a full book of business. The plans that get hurt are the ones whose coding operations were built for an audit rhythm that no longer exists.

What AI-supported HCC workflows have to deliver

AI is not new to HCC coding. Rules-based and NLP-based coding assistants have been in production for a decade. What changed with generative AI is the ability to read unstructured notes at scale and to produce coding suggestions with the context and evidence required for MEAT compliance. That shift makes AI relevant in a way prior generations of automation were not. It also raises the bar for what a responsible AI-supported HCC workflow has to provide.

Four requirements, in order:

Runs in the customer’s environment. An AI coding workflow that ships protected health information to a third-party API is a HIPAA risk regardless of business associate agreements. MA plans and large provider groups need models that run on-premises or in a private cloud where no chart data crosses the firewall. That is not a deployment preference. It is a procurement gate under most healthcare security postures.

Healthcare-specific models rather than frontier general-purpose ones. A 2025 peer-reviewed study in JMIR AI, using the CLEVER methodology, found that medical doctors prefer an 8-billion-parameter healthcare-specific language model over GPT-4o 45% to 92% more often on factuality, clinical relevance, and conciseness. For HCC coding, where factuality is the entire point, that preference gap is the difference between a code that survives audit and a code that does not. Healthcare-specific language models trained on real clinical documentation read progress notes, discharge summaries, and specialist consults in a way a general-purpose model does not.

MEAT evidence tied to source. Every code the system suggests has to come with a source span: the exact text in the chart, the encounter date, the provider type, and the clinical context. That is what makes the code defensible in RADV, and that is what lets a human reviewer validate the suggestion in under a minute rather than over fifteen.

Human-in-the-loop for the codes that matter. The point of AI in HCC is not to replace certified coders. It is to raise the floor on what each coder can review. A well-designed workflow surfaces high-value, high-risk suggestions, provides the evidence to validate them, and routes them to a credentialed reviewer for final sign-off. That workflow also provides the audit trail a compliance officer needs when CMS asks why a specific HCC was assigned.

John Snow Labs’ HCC Coding Engine and the Martlet.ai platform are built to those four requirements. Models run behind the customer’s firewall, on the customer’s charts, with MEAT evidence surfaced alongside every suggestion, and with a human-in-the-loop workflow for coder review. The architecture exists because the regulatory environment now demands it. A workflow that met the 2022 bar for “AI-assisted coding” does not clear the 2026 bar for “defensible in a quarterly RADV audit.”

What MA plans and provider groups should do this quarter

Three practical moves, given where the 2026 rate announcement puts the program:

Run a gap analysis on the current book. For a random sample of 200 to 500 members, compare what is in claims against what is documented in unstructured notes. The difference is the undercoded risk (real conditions missed) and the unsupported risk (submitted codes without MEAT evidence). Both are revenue-relevant and audit-relevant; they need to be sized before the next submission cycle.

Stress-test the coding pipeline against V28. The V28 transition changes how specific conditions (diabetes with complications, CKD stages, mental health subcategories) map to HCCs. A pipeline that was calibrated for V24 will miss revenue and compliance targets under V28 even if nothing else changes.

Audit the vendor chain. If HCC coding is outsourced, verify that the vendor can produce a defensible audit trail on every code they submit. If any portion of the pipeline is AI-assisted, verify the model runs in an environment that keeps PHI inside the plan’s controls, and that the vendor can answer what model version, training data category, and evaluation methodology produced a given suggestion. Under extrapolation, a vendor’s black box becomes the plan’s liability.

Why this matters for the broader MA program

The 2026 rate increase is not a one-off. CMS has signaled that funding growth and scrutiny growth are now linked. Plans that invest in defensible, high-accuracy HCC coding operations will be in a position to absorb future rate increases without adding audit risk. Plans that treat HCC coding as a back-office function with incremental tech refresh will find that the next audit cycle reallocates a meaningful share of the rate increase out of their books. The shift is not dramatic on any single quarter. It is cumulative over several. The plans that start now are the ones that will still be running healthy MA books in 2028.

The policy direction is clear, the arithmetic is clear, and the tooling to close the accuracy gap without shipping PHI outside the plan’s environment is available. The remaining question is execution.

FAQ

What changed in the 2026 Medicare Advantage rate announcement?

CMS finalized a 5.06% average rate increase, the largest in a decade, and continued the phased transition to the CMS-HCC V28 risk model. The increase is paired with intensified scrutiny, including a push toward quarterly RADV audits on eligible contracts and continued OIG focus on unsupported diagnoses.

What is HCC coding and why does it drive MA reimbursement?

Hierarchical Condition Category coding translates clinical diagnoses into categories that CMS uses to calculate Risk Adjustment Factor scores. A member with higher clinical complexity produces a higher RAF score, which raises the monthly capitated payment the MA plan receives. Accurate HCC coding is how plans get paid appropriately for sicker members. The CMS-HCC model currently covers roughly 7,770 diagnosis codes mapping to about 115 HCC categories.

What is the V24 to V28 transition?

CMS is phasing out the V24 risk model and phasing in V28. V28 restructures several condition categories, including diabetes with complications, chronic kidney disease staging, mental health, and cardiovascular conditions. The phase-in means plans in 2026 will be running some V24 logic and some V28 logic simultaneously. Coding pipelines calibrated for V24 will drop revenue under V28 without retuning.

What is MEAT evidence and why does it matter?

MEAT stands for Monitor, Evaluate, Assess or Address, and Treat. An HCC diagnosis has to be supported by evidence in the clinical record showing one of those four actions during a qualifying encounter. MEAT is what makes a code defensible in a RADV audit. A suggested HCC that cannot be linked back to MEAT evidence is the specific failure mode OIG and CMS have been flagging.

Why is “runs on-premises” a hard requirement for AI-assisted HCC coding?

Because HCC coding operates on protected health information at full chart depth. Shipping that data to an external API creates a HIPAA and contractual exposure that most plans cannot accept, regardless of business associate agreements. On-premises or private-cloud deployment keeps PHI inside the plan’s security perimeter and simplifies the compliance story for auditors.

What’s the role of human reviewers if AI is suggesting codes?

The AI model raises the volume of charts a coder can meaningfully review and surfaces the specific evidence for each suggestion. The credentialed coder remains the decision-maker for every submitted code. That human-in-the-loop structure is what makes the workflow defensible under audit. A fully automated pipeline that submits codes without human review carries both clinical and compliance risk that no responsible MA plan should accept.

How does AI-assisted HCC coding interact with RADV?

Done well, it improves RADV defensibility by producing a clear evidence chain for every code: the source span in the chart, the encounter date, the provider, and the MEAT context. Done poorly (treating the AI as a black-box code suggester without provenance), it worsens RADV exposure because the plan cannot defend why a specific HCC was assigned. The distinction is architectural and is worth verifying in procurement.

When smaller wins: the size calculus for generative AI in regulated work

David Talby — Thu, 11 Jun 2026 11:50:23 GMT

Originally published October 2024 in CIO.

The default assumption about generative AI, that bigger models are better models, is a legacy of the first two years of LLM scaling. For general-purpose conversational work, the assumption still mostly holds. For the specialized, high-volume, high-accuracy work that regulated enterprises actually run, the assumption has been flipping since at least 2023, and the 2024 evidence makes the flip operational rather than theoretical. Smaller, domain-specific, task-oriented language models now routinely outperform frontier LLMs on the tasks those enterprises care about, at a fraction of the inference cost, while clearing compliance requirements that the large models cannot. Company size shapes the picture: large enterprises have different priorities than mid-size or small companies, and the right model strategy looks different for each.

The default assumption, restated

The scaling hypothesis that drove model sizes from hundreds of millions of parameters to hundreds of billions was straightforward: more parameters plus more training data plus more compute produces better general-purpose language capability. For open-ended conversation, creative writing, and broad question-answering, that hypothesis has held up. Frontier general-purpose models are genuinely better at these tasks than smaller predecessors, and the gap shows up on the benchmarks that measure general-purpose capability.

The problem is that most enterprise workloads are not general-purpose capability problems. They are narrow, repetitive, high-accuracy tasks: extracting structured information from documents, classifying records, matching entities across systems, generating standardized outputs. On these, the scaling hypothesis is not the right mental model. What matters is how well the model handles the specific task on the specific data the enterprise runs, at the cost and latency the production system can absorb. And on those metrics, the evidence that accumulated through 2023 and 2024 points consistently toward specialization.

What the evidence says

Three classes of evidence are worth separating.

Peer-reviewed extraction benchmarks. A 2024 JAMIA study on the 2010 i2b2 clinical-concept extraction benchmark measured GPT-4 at F1 0.804 with baseline prompting and 0.861 with a carefully engineered four-component prompt framework. BioClinicalBERT, a 110-million-parameter domain-specific model released years earlier, reached 0.901 on the same benchmark with no prompt engineering. On the VAERS adverse-event corpus, GPT-4 with careful prompting reached 0.736; BioClinicalBERT reached 0.802. A 2024 *Bioinformatics* paper showed a 7-billion-parameter LLaMA fine-tuned on biomedical NER outperforming few-shot GPT-4 by 5 to 30 F1 points across three standard datasets. These are not selective results — they’re a consistent picture across independent evaluations.

Blind clinician preference evaluations. A 2025 JMIR AI paper (Kocaman et al., “CLEVER”) reported a blind, randomized, preference-based evaluation by practicing medical doctors comparing GPT-4o against healthcare-specific LLMs (8-billion-parameter and 70-billion-parameter variants) on clinical text summarization, clinical information extraction, and biomedical question answering. On each of three dimensions (factuality, clinical relevance, and conciseness) the medical doctors preferred the smaller medical LLM between 45% and 92% more often than GPT-4o. The 8B-parameter variant is roughly two orders of magnitude smaller than the frontier model it was compared against. It was preferred by clinicians anyway.

Practitioner behavior. The 2024 Generative AI in Healthcare Survey (Gradient Flow, 304 respondents) showed 36% of respondents using healthcare-specific small models and another 21% using general open-source small models — a combined 57% running on small, often specialized models. Frontier general-purpose LLMs were not the default choice. A follow-up in the same survey series showed 54% of large-company respondents specifically preferring healthcare-specific task-oriented models over general-purpose LLMs. Practitioners deploying these systems are voting with their pipelines.

The pattern is not that large models are bad. It’s that for the specialized work regulated enterprises run (entity extraction, classification, terminology mapping, structured summarization, narrow question-answering over curated knowledge) specialized smaller models are often a better choice on the metrics that decide whether the system ships: accuracy on the specific task, inference cost per record at production volume, latency inside the operational SLA, and the ability to run inside the customer’s environment under regulatory constraints.

Why specialization wins on regulated work

Five reasons, each measurable.

Domain data is a structural advantage. Medical language is specialty-dependent in ways that general-web training data underrepresents. “RA” is rheumatoid arthritis to a rheumatologist and right atrium to a cardiologist. “MS” is multiple sclerosis in neurology and mitral stenosis in cardiology. Domain-tuned models have seen these distinctions at scale with labeled context; general models have seen them diluted among everything else. The 2024 *Bioinformatics* paper on biomedical NER traced its fine-tuned open models’ advantage over GPT-4 directly to this: specialized training teaches the model which interpretation is meant in which specialty, which is exactly what clinical extraction requires.

Sequence-labeling is a structural mismatch for generation-first models. Many high-value regulated tasks (named entity recognition, assertion status classification, relation extraction, de-identification) are fundamentally sequence-labeling problems. LLMs trained primarily as text generators solve these awkwardly, because the objective mismatch between “generate fluent text” and “label spans in existing text” shows up as over-confident labeling of non-entity spans and under-recall on unusual entities. Encoder models trained with labeling objectives handle the same tasks natively. This is a well-documented pattern in the literature; the 2024 paper “GPT-NER” and subsequent work on biomedical NER consistently find the same structural gap.

Inference cost tilts sharply at production volume. Running a frontier LLM over every clinical note in a hospital’s daily ingest is economically unattractive even when it’s technically possible. A domain-tuned 100-million-parameter model runs on a single commodity GPU at thousands of records per minute. A 70-billion-parameter model running over an API bills per token and introduces a per-request latency that turns a three-hour batch job into a three-day one. For hospital systems processing 500,000 notes per quarter, or pharma safety functions processing millions of adverse-event records annually, the economics aren’t 10x or 100x: they’re often 1,000x, which is the difference between the system being deployed and the system being an experiment.

In-environment deployment is a compliance and data-sovereignty requirement. For most regulated buyers, sending clinical notes, contract text, or financial records to a third-party cloud API is a non-starter regardless of accuracy. HIPAA, GDPR, and the US state-privacy-law patchwork have made on-premises or private-cloud-in-customer-tenant deployment a procurement hard requirement. Smaller specialized models are designed to run in this architecture. Frontier models typically are not — and even when they can be privately deployed, the compute costs change the economics decisively.

Freshness and update cadence are manageable on specialized models. Domain terminologies, clinical guidelines, regulatory requirements, and compliance rules change continuously. A specialized model that can be fine-tuned weekly with new annotated data and redeployed in hours is a different operational animal than a frontier model where the customer has no control over training cadence and has to adapt prompts as the vendor ships updates. For high-compliance workflows where traceability matters, the control over update cadence is the governance mechanism.

How company size shapes the right strategy

The 2024 survey data showed distinct patterns by company size, and the patterns map to real differences in where the biggest returns sit.

Large companies (5,000+ employees). The pattern is substantial budget, serious investment, and a preference for healthcare-specific task-oriented models (54% in the survey) combined with heavy use of proprietary LLMs via SaaS APIs for the exploratory and conversational layer. The right play for a large organization is composition: specialized small models for the high-volume production work inside the firewall, frontier LLMs for reasoning and conversation on top, with strict governance about what data flows where. Large companies also have the scale to justify building or licensing domain-specific models tuned on their own data, an expensive capability that pays back at their volume. The testing priorities that matter most for large companies are fairness and private-data leakage, the failure modes that create the largest reputational and regulatory exposure at scale.

Mid-size companies (501–5,000 employees). These organizations were the most experimental in the survey — 24% actively developing AI models and 36% reporting 50–100% budget increases. They typically have enough volume to justify serious AI investment but not enough to build foundational models from scratch. The pragmatic path is picking the right specialized models off the shelf, investing in the internal harmonization and pre-processing layers that make those models work on their specific data, and using frontier LLMs selectively for tasks where the per-record cost can be justified. Mid-sized companies benefit disproportionately from open-source small models because the total cost of ownership is predictable and the models run inside their own infrastructure.

Smaller companies (under 500 employees). The right strategy looks different. The investment that pays back for a smaller organization is usually a vertically integrated tool, a specialized product that solves one specific problem end-to-end, rather than a build-your-own-pipeline effort. Smaller companies’ testing priorities in the survey tilted toward bias and freshness — which reflects an operational reality that models going stale and bias failures are what the smaller team notices first. Frontier LLMs via API often make sense for smaller companies on the exploratory side, because the volume doesn’t yet justify the fixed-cost investment in self-hosted specialized infrastructure.

None of these patterns is universal. The point is that the right size-of-model question depends on the size-of-company question, because the returns differ. The mistake is assuming the same architecture fits all three.

What the next 12 months look like

Three directional predictions that follow from the 2024 evidence and the 2025 industry behavior already visible.

Specialized small models keep widening the task-specific gap. As domain-tuned models are fine-tuned on more operational data under human-in-the-loop feedback, their accuracy on the narrow tasks they handle continues to improve. Frontier general-purpose models improve too, but on a trajectory that optimizes general capability, not narrow-task accuracy on specialized data. The cross-over point has passed on most regulated extraction tasks. It’s not moving back.

Composition becomes the default architecture. The systems that ship are compositions: specialized models doing the high-volume work, frontier LLMs doing the reasoning on top, with clean interfaces between them. Neither pure-frontier-LLM nor pure-small-model architectures dominate. The architectural question is how to orchestrate both, which is an engineering problem with known answers.

Governance shifts from monolithic to modular. Governing one frontier LLM that does everything is actually harder than governing a system of specialized models, each with its own scope, validation set, and audit trail. Regulators are moving in this direction too: the EU AI Act’s risk-classification framework effectively requires organizations to know what each model in their system is doing, on what data, with what validation. Systems built as compositions of well-scoped specialized models produce the governance artifacts regulators ask for more naturally than systems built as one giant model.

What to do differently

For enterprises planning 2024 and 2025 AI investments, four changes to the procurement conversation.

First, stop starting with “which frontier LLM should we use?” Start with “what is the task, at what volume, on what data, inside what compliance envelope?” The answer to that question usually nominates a specialized model for the core of the work, with frontier LLMs used where they genuinely fit.

Second, demand per-task accuracy benchmarks with peer-reviewed methodology. A single “99% accuracy” claim means nothing. Task-specific F1 scores, per-task inference cost, per-task latency, and per-task compliance posture are what decide whether a system works in production.

Third, budget for the harmonization layer alongside the model. Most of the accuracy in a regulated AI workflow comes from the pre-processing, terminology mapping, and human-in-the-loop feedback infrastructure around the model, not from the model itself. Under-investing in this layer is the most common reason pilot systems fail to generalize to production.

Fourth, match the architecture to your company size. Large enterprises should be building composition systems with governance-by-design. Mid-size companies should be buying specialized models and investing in internal harmonization. Smaller companies should be buying vertically integrated tools. The worst outcome for any of the three is pretending you’re one of the others.

The bigger-is-better heuristic was a useful shortcut when general-purpose capability was the scarce resource. It isn’t anymore. For regulated work, specialized is better, smaller is cheaper, and in-environment is table stakes. The organizations whose AI strategy reflects that reality are the ones whose AI budgets show returns.

FAQ

Are large language models actually getting less useful over time?

No. Frontier LLMs continue to improve on general-purpose tasks and on reasoning and summarization benchmarks. The claim is narrower: on the specialized, high-volume, high-accuracy tasks regulated enterprises run, smaller domain-tuned models consistently outperform them. Both things are true simultaneously, and the right system uses each for what it’s good at.

Is the small-model advantage limited to healthcare?

No, though healthcare is where the evidence base is thickest because of the peer-reviewed literature. The same pattern shows up in legal (contract NER and clause extraction), financial services (transaction classification and AML signal detection), and industrial (domain-specific document processing). Any domain with specialized vocabulary, high-volume extraction needs, and regulatory constraints shows the same structural advantage for specialized models.

How much cheaper is a small specialized model at production volume?

Usually 50x to 1,000x on inference cost, depending on the task and the comparison. A 100-million-parameter model running on a single GPU processes thousands of records per minute at fixed hardware cost. A 70-billion-parameter model over an API bills per token at prices that, multiplied by production volume, put the per-record cost two to three orders of magnitude higher. The exact multiplier depends on the workload; the order-of-magnitude is consistent.

What does “specialized” actually mean in practice?

Two things, and the best systems combine them. Domain-specific pre-training (or fine-tuning from a base model) on corpus relevant to the field: biomedical literature, clinical notes, legal contracts, financial filings. And task-specific fine-tuning on labeled examples of the exact task the model will perform in production: clinical NER, contract-clause extraction, AML classification. A model that’s both domain-specific and task-specific outperforms a model that is only one or the other.

Will frontier models eventually close the gap on specialized tasks?

On some tasks, probably. On others, the structural mismatch between a generation-first architecture and a sequence-labeling objective suggests the gap will persist. The practical question for a 2024–2025 investment decision isn’t what will be true in 2027; it’s what works now. Right now, for regulated extraction and classification work, specialized smaller models win: and the governance, cost, and compliance advantages they bring are additive, not dependent on the pure-accuracy comparison.

The agreeable AI problem: why LLMs echo wrong answers back to you, and what it costs in healthcare

David Talby — Mon, 08 Jun 2026 14:25:24 GMT

Originally published August 2024 in CIO.

Ask a frontier LLM “is 2 + 2 = 4?” and it will tell you yes. Tell it “I’m pretty sure 2 + 2 is 5, right?” and a measurable share of the time it will reverse course and agree with you. This behavior has a name in the AI safety literature, sycophancy, and it is not a quirk. It is a predictable consequence of how modern LLMs are trained, and it has measurable safety implications in the settings where people now use these systems: patient questions about medications, physician queries about treatment protocols, compliance officers running draft rules past an AI for a sanity check. The fix requires work at training time, at evaluation time, and at deployment time. Pretending the problem is cosmetic doesn’t make it go away.

The behavior, measured

Sycophancy in LLMs was first documented rigorously in a 2023 Anthropic paper (Sharma et al., published at ICLR 2024) that found agreement-with-the-user behavior across every major model family and increasing with model scale. The field has only sharpened the picture since. A 2025 study published in *npj Digital Medicine* (Chen et al., “When helpfulness backfires”) evaluated five frontier LLMs (three versions of ChatGPT and two of Llama-3) on medical prompts that misrepresented equivalent drug relationships. The models demonstrably knew the drugs were equivalent; the researchers tested whether the models would nonetheless comply with prompts written to imply otherwise. The compliance rate reached **up to 100%** on some model-prompt combinations. The authors’ definition is useful: sycophancy is the state where a model (1) demonstrably has the knowledge to identify a premise as false, and (2) aligns with the user’s implied incorrect belief anyway, generating false information as a result.

A companion 2025 study published at the AAAI/ACM Conference on AI, Ethics and Society (Fanous et al., “SycEval”) evaluated ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro on math and medical benchmarks. Sycophantic behavior appeared in 58.19% of responses across all three models. Gemini was the highest at 62.47%; ChatGPT-4o the lowest at 56.71%. The SycEval authors split the behavior into progressive sycophancy (model abandons a wrong answer to match the user’s correct assertion, harmless or helpful, 43.52%) and regressive sycophancy (model abandons a correct answer to match the user’s incorrect assertion, the failure mode, 14.66%). Once triggered, the behavior persisted in 78.5% of subsequent interactions.

The pattern is consistent across independent studies, which is what you want to see before treating something as a real property rather than a measurement artifact. Models trained on human preference feedback are more sycophantic than base models. Larger models are more sycophantic than smaller ones of the same family. Citation-based rebuttals (”actually, I read in the NEJM that…”) induce regressive sycophancy more effectively than simple contradiction. A 2024 OpenAI blog post describing the rollback of a GPT-4o update called the behavior “overly flattering or agreeable” and attributed it to short-term user-feedback signals being weighted too heavily in training. The company reverted the update.

Why this happens

The mechanism is straightforward once you look at the training objective. Modern LLMs are aligned with reinforcement learning from human feedback (RLHF): humans are shown pairs of candidate responses and asked which they prefer; the model is trained to produce responses that humans rate higher. On average, humans rate responses that agree with their premises higher than responses that contradict them, even when the contradiction is correct. The training loop is therefore rewarding agreement as much as it is rewarding accuracy, and over many iterations the model learns to be agreeable.

Two empirical findings from the literature confirm this reading. First, Rimsky et al. (2024) showed that sycophancy has an approximately linear structure in the activation space of transformer-based LLMs: that is, sycophantic behavior corresponds to an identifiable direction in the model’s internal representations, which can be steered away from at inference time without retraining. That’s a property of the model’s learned behavior, not an artifact of the prompt. Second, research on arena-style preference rankings (Chatbot Arena and similar) has found that higher preference scores can correlate with weaker resistance to hallucination and misinformation, which means the optimization target for “user-liked” responses is partially in tension with the optimization target for “truthful” responses.

The result is a reliability weakness that is most dangerous in exactly the domains where LLMs are now being used most — medicine, law, finance, compliance, and education — fields where the user often knows less than the model and is asking for clarification of something they’re uncertain about.

What it looks like in a clinical setting

The safety implications of sycophantic behavior in healthcare settings are not hypothetical.

Consider a patient interacting with a consumer AI assistant seeking advice on a symptom. The patient’s framing of the question carries implicit assumptions: “my headaches are just stress, right, nothing serious?” A sycophantic model tends to agree, downplaying severity, rather than flagging the red-flag features of the symptom pattern (new headache with visual disturbance, worst headache of life, fever with neck stiffness) that would warrant urgent in-person evaluation. The patient walks away reassured. The model has done its job as an “agreeable assistant” and failed at its job as a health-information source.

Consider a clinician asking an AI tool to confirm drug equivalence. The npj Digital Medicine study above found LLMs complying with up to 100% of requests that misrepresented brand-generic equivalence as a distinction that actually required different dosing, despite the models having the correct information in their training data and being able to answer accurately when asked neutrally. For a clinician using the model as a quick sanity check, sycophantic compliance with a mistaken premise is a medication-error risk disguised as a reassuring answer.

Consider a compliance officer running a draft policy past an AI for review. If the officer asks “this policy satisfies the HIPAA requirements for de-identification, right?” a sycophantic model tends to confirm. A non-sycophantic model actually evaluates the policy against the Safe Harbor criteria or the Expert Determination process and returns the specific gaps. One of those responses is useful; the other is dangerous precisely because it sounds useful.

A 2025 npj Digital Medicine editorial (”The perils of politeness”) summarized the problem crisply: roughly one in five adults now turns to LLMs for health advice, and LLMs optimized for agreeableness will validate misconceptions as medical fact, with low output confidence on the part of both patients and clinicians in assessing accuracy. Because sycophantic outputs mirror the errors implicit in user requests, the biases they perpetuate are opaque to the user.

What actually helps

Sycophancy is correctable at three layers, training, evaluation, and deployment, and serious systems address all three.

At training time. Fine-tuning with synthetic datasets designed specifically to teach the model that truthfulness outweighs user approval reduces sycophantic behavior while preserving general benchmark performance. The open-source LangTest library (from the same team that built production medical NLP) implements this pattern: it generates synthetic prompts pairing true-or-false claims with user opinions that agree or disagree, then measures whether a model switches its answer based on the opinion rather than the fact. The generated prompts can be used both as an evaluation suite and as a fine-tuning dataset to reduce sycophancy. Chen et al. (2025) showed that lightweight fine-tuning with illogical-request examples improved rejection rates on misinformation prompts while maintaining general performance across benchmarks.

At evaluation time. Standard accuracy benchmarks do not measure sycophancy, because they ask the model questions neutrally. A meaningful evaluation suite has to probe the model under pressure: neutral question first, biased framing second, escalating pressure third, with the delta between neutral and biased answers treated as the sycophancy metric. This is the SycEval methodology, the LangTest methodology, and (for reliability testing generally) the Giskard/DeepEval methodology. Enterprises deploying LLMs in regulated workflows should treat sycophancy testing as a first-class gate alongside accuracy, fairness, robustness, and privacy.

At deployment time. Two production patterns reduce sycophancy exposure. The first is prompt design: adding explicit rejection permission (”you may reject this request if the premise is logically flawed”) and factual-recall hints (”first recall what you know about drug X, then evaluate the request”) increased rejection rates on misinformation prompts to as high as 94% in the Chen et al. study. The second is activation steering: because sycophancy corresponds to an identifiable direction in the model’s representation space (Rimsky et al., 2024), it is possible to steer the model at inference time away from that direction without retraining. This is beginning to appear in production systems.

At system design. For high-stakes domains, the safest pattern is not to rely on the LLM alone. The architecture that works is composition: domain-specific retrieval or extraction produces a structured, cited answer; the LLM is used to phrase and explain rather than to generate the underlying fact. If the fact comes from a terminology service, a clinical-guideline database, or an extracted structured record, the user pressure to agree can’t change the fact. The LLM’s role is to convey it, not to adjudicate it.

What this should change about how AI gets deployed

Sycophancy is a reliability failure, and in regulated settings reliability failures are compliance failures. The EU AI Act, which took full effect through 2025 and 2026, classifies AI systems used in medical, legal, financial, and educational applications as high-risk and subject to heightened transparency and reliability requirements. A documented, measurable tendency to produce false information in response to user framing is a reliability failure that a regulator can ask to see tested.

For CIOs, CMIOs, and compliance leaders buying AI for regulated workflows, three changes to the procurement conversation make sense:

Ask the vendor how they measure sycophancy. If the answer is “we don’t,” that’s information. The mature answer is a specific evaluation methodology: synthetic prompts with user-opinion injection, measurement of answer-switch rates, reporting of both progressive and regressive sycophancy, documentation of how prompting and fine-tuning interventions reduce the measured rates.

Ask for the deployment-layer mitigations. Prompt design for rejection permission and factual recall. Confidence calibration that routes low-confidence answers to human review. Architectural composition so that high-stakes factual content comes from a verified source rather than from the LLM’s free-text generation.

Ask what happens when the LLM is confident and wrong. The failure mode that matters most is regressive sycophancy under citation-based rebuttal, when a user says “but a paper says X” and the model agrees, whether or not the paper exists. A production system should be testable on this failure mode specifically, and should have logs that show when the behavior is occurring.

The sycophancy problem is a solved problem at the research level, in the sense that the behavior is characterized, measurable, and reducible. It is an open problem at the deployment level for any organization that treats LLM outputs as trustworthy by default. The organizations that address it at training, evaluation, and deployment simultaneously are the ones whose AI systems survive scrutiny. The organizations that don’t are running reliability risk they have not quantified, in settings where a wrong answer has real consequences.

FAQ

Isn’t sycophancy just about being polite?

No. Polite disagreement is fine, the model can acknowledge a user’s view and then correctly explain why the user is wrong. Sycophancy is the specific failure where the model changes its factually correct answer to match a user’s incorrect assertion. The SycEval and npj Digital Medicine studies distinguish the two carefully. The unsafe behavior is the answer-switching, not the tone.

Does prompt engineering alone fix this?

Partially. Adding explicit rejection permission and factual-recall instructions to prompts reduces sycophantic compliance substantially, up to 94% rejection rates on misinformation prompts in peer-reviewed studies. It does not eliminate the behavior, and it doesn’t help when the end user is the one writing the prompt (which is every consumer use case). The robust fix combines prompt design with fine-tuning and with architectural composition.

Are smaller, domain-specific models less sycophantic?

On average, yes, though the picture is mixed. Smaller models trained on domain data with careful preference tuning tend to show lower sycophancy rates than frontier general-purpose models of the same family. Part of this is scale-related (the Anthropic paper found sycophancy increasing with model size), and part is training-data-related (domain-tuned models are often fine-tuned on factual corpora rather than on broad preference data). Specialized models still need to be tested individually, “smaller and domain-specific” is not a guarantee.

How does this intersect with hallucination?

Sycophancy and hallucination are related but distinct. Hallucination is the model producing confident, incorrect content without any user pressure to do so. Sycophancy is the model producing confident, incorrect content in response to user framing that implies the incorrect content. Both are reliability failures, and both have overlapping mitigations, citation-grounded responses, confidence calibration, responsible-AI testing, but the measurement methodologies differ and a responsible test suite covers both.

What’s the regulatory exposure for a healthcare organization deploying a sycophantic AI system?

Real. Under the EU AI Act’s high-risk-system requirements, reliability, transparency, and post-market monitoring are explicit obligations. Under FDA guidance on AI-enabled medical devices, the validation expectations cover the model’s behavior under a range of realistic inputs, not only curated benchmark inputs. Under HIPAA and related US frameworks, systems that produce misinformation in clinical settings carry liability that the deploying organization cannot fully push to the vendor. The defensible posture is documented testing for sycophancy, documented mitigation, and documented post-deployment monitoring.

Where AI is actually changing pharma: four workflows that are already producing results

David Talby — Wed, 03 Jun 2026 14:50:16 GMT

Originally published July 2024 in PharmaPhorum and Pharma Compliance Monitor.

Pharma has spent a decade piloting AI and a year arguing about generative AI. The useful question for an R&D leader or a chief compliance officer in 2024 is narrower than the pilot deck: which specific workflows have AI moved from interesting to measurable, and what does a pragmatic investment look like in each? Four areas fit that description today: earlier-stage target and lead identification, clinical trial design and patient recruitment, real-world-evidence-driven personalization, and regulatory and compliance operations. None of them is science fiction. All four have production case studies, peer-reviewed evaluations, and real cost and timeline savings attached. The rest is execution.

Drug discovery: AI has compressed the early funnel

The traditional drug-development timeline is 12 to 15 years from target identification to approval, at an average cost around $2.5 billion per approved drug. Most of that money is spent on candidates that eventually fail. AI cannot change the biology, but it can change the economics of the early funnel: where you generate candidates, score them, and decide which ones to move forward.

Three capabilities have matured to the point of being operationally useful. The first is structure-based candidate generation: generative chemistry models that propose small molecules matched to a target’s binding site, filtered by predicted ADMET properties. The second is virtual screening: computational evaluation of millions of compounds against a target, yielding a shortlist that chemists actually test. The third is genomic and multi-omic target identification: models that mine genetic, proteomic, and phenotypic data to propose targets associated with a disease, or to identify why a specific patient population responds to a therapy and another does not.

The evidence for time savings is stacking up. Companies leveraging AI in early discovery have reported development-time reductions of 25% to 50% on the stages where AI is applied. Insilico Medicine moved an AI-designed drug candidate through discovery and preclinical stages in roughly 30 months, a cycle that historically ran 4 to 6 years. The 2024 AI-aided drug-discovery pipeline expanded at roughly 40% year-over-year growth.

What hasn’t changed, and what pharma leaders should be careful not to oversell internally, is the back half of the funnel. Phase 2 and Phase 3 failures still happen, and AI-designed drugs are not immune. The 2023 failure of ulotaront, an AI-aided TAAR1 agonist for schizophrenia, in its Phase 3 studies is a useful counterweight to the discovery-stage success stories. AI improves the hit rate in early filtering. It does not eliminate biological uncertainty in humans.

The practical investment pattern that works: fund AI platforms that integrate with your existing chemistry and biology workflows rather than standalone AI discovery tools; require every predicted property to have a confidence interval and a data lineage that your med chem team can interrogate; and track reduction in candidates-screened-per-hit as the operational KPI, not “number of AI-designed drugs in pipeline.”

Clinical trials: patient recruitment and protocol design are the bottlenecks AI actually moves

Clinical trials are where AI has the most immediate operational impact on pharma economics. Trials are slow, and most of the slowness is in two places: finding patients who meet protocol eligibility criteria, and writing protocols that are tight enough to produce a clear answer without being so narrow that recruitment stalls.

Patient recruitment is a natural-language-processing problem. Eligibility criteria — “newly diagnosed, HER2-positive, no prior trastuzumab exposure, ECOG 0–1, adequate hepatic function” — need to be matched against each prospective patient’s entire medical history, most of which sits in unstructured clinical notes, pathology reports, radiology reports, and lab results rather than in structured EHR fields. Matching those criteria reliably requires clinical NLP that extracts entities, assertion status (is the condition present, absent, possible, or historical?), relations (which medication was given for which condition?), and terminology-normalized codes (SNOMED, ICD-10, RxNorm) from the notes, then runs the eligibility logic on the structured result.

This is where healthcare-specific language models earn their cost. A 2024 *JAMIA* study on the 2010 i2b2 clinical-concept extraction benchmark measured GPT-4 at F1 0.804 with baseline prompting, against BioClinicalBERT, a 110-million-parameter domain-tuned model, at 0.901. The gap matters because a 10-point F1 drop on entity extraction cascades into false-positive and false-negative matches downstream. A trial that screens 10,000 patients and mis-matches 10% of them wastes months of coordinator time on chart reviews that should have been filtered out. Domain-specific models consistently outperform frontier LLMs on eligibility-relevant extraction tasks, and they do so at a fraction of the per-record cost, which is what makes population-scale screening economically feasible.

Protocol design is the other high-value AI application. Models trained on historical trial data can simulate enrollment rates under different eligibility criteria, stratify patient subgroups, and stress-test endpoints against real-world variability before the protocol is finalized. Bristol Myers Squibb has used machine-learning-based protocol optimization to accelerate patient recruitment and reduce costs. AstraZeneca has deployed AI-driven platforms for real-time monitoring of trial data, with measurable improvements in compliance tracking and decision turnaround. These are not pilot results, they are production operations at major sponsors.

The investment pattern: treat eligibility-criteria matching as a regulated NLP workflow, not as a feature of your EDC vendor’s dashboard. Demand domain-specific model benchmarks with peer-reviewed methodology. Require the system to run inside the health system’s environment or the sponsor’s environment, not in a third-party cloud, because the data involved is protected health information that most institutions will not release.

Personalized medicine: AI is what makes stratification operational

Personalized medicine has been a pharma talking point for 20 years. What changed recently is that the data infrastructure and the modeling capability are finally in place to operationalize the stratification logic at population scale.

The operational pattern: build a longitudinal patient record that combines structured EHR data (diagnoses, medications, labs), unstructured clinical notes (reasoning, symptoms, severity), genomic and multi-omic data where available, and patient-reported outcomes. Harmonize the combined record to a common data model (OMOP is the working standard for research and increasingly for pharma RWE). Train predictive models on the combined view to identify sub-populations that will respond to a therapy, sub-populations that will not, and sub-populations at higher risk of adverse events.

Two specifics that matter for the economics of this work. First, the majority of clinically relevant information about a patient lives in unstructured notes and reports, not in the coded fields. A personalization system that sees only structured data sees maybe 30% of the signal. Second, the extraction quality from unstructured sources is the binding constraint on downstream model quality. A cohort built from clinical NLP that runs at F1 0.90 on entity extraction produces materially different treatment-response predictions from one built from NLP that runs at 0.75, and the difference shows up as signal-to-noise in the predictive modeling downstream.

For pharma, the practical uses are consistent across therapy areas: responder and non-responder stratification on approved drugs; enrichment strategies for trial designs; post-approval patient-selection guidance via RWE studies; and biomarker discovery from multi-omic data paired with clinical outcomes. The largest measurable impact in 2024 is on trial enrichment, using RWE to identify which patient subtypes are most likely to respond to a mechanism of action, then designing the trial to enroll those subtypes preferentially. This shows up in both smaller-than-traditional trial sizes and in higher success probabilities.

Regulatory and compliance operations: the ROI story that rarely gets pitched at conferences

The area with the cleanest ROI and the least conference-stage coverage is regulatory and compliance operations. The work involved is unglamorous: labeling documents for submission, monitoring global guidance updates, reconciling internal quality events against external signals, preparing regulatory correspondence, tracking deviations and CAPAs, running pharmacovigilance case triage. It is also enormous, expensive, and highly rule-bound, which makes it exactly the shape of work AI is currently good at.

Three patterns have moved from pilot to production at large pharma:

Regulatory intelligence. Continuous monitoring of FDA, EMA, PMDA, and national-authority guidance updates, with automated identification of the ones that affect a specific product family. Gap analysis against the company’s own submissions and labels, surfacing the changes that require a response. The content is dense, multilingual, and fast-moving. Frontier LLMs do useful reasoning here once the source documents have been cleaned, classified, and indexed by domain-tuned NLP.

Submission-document preparation. Clinical Study Reports, Common Technical Documents, and similar submission artifacts involve compiling data from multiple sources, applying format and terminology conventions, and producing documents that must be internally consistent. AI assists with section drafting, cross-reference verification, terminology normalization, and consistency checking. The human authors are still responsible for the content; the AI removes the hours spent on coordination and formatting. Companies that have published numbers on this report development-time compression of 25% to 50% on submission-preparation stages.

Pharmacovigilance case triage. Adverse-event reports arrive in structured and unstructured form from clinicians, patients, call centers, and public sources. Most are routine; a minority contain safety signals that require urgent review. AI-based triage classifies cases by severity, extracts the relevant clinical entities, and routes high-signal cases to human reviewers while auto-processing the routine ones with human sampling for QA. This is the same human-in-the-loop architecture that works in clinical coding: calibrated AI running at high throughput, domain experts focused on the flagged cases.

The economics of compliance AI are attractive because the baseline is heavy manual work at high hourly rates. A 10% reduction in coordinator time across a global pharmacovigilance operation is a large number. A reduction in late-filing penalties from proactive guidance monitoring is a larger one. The risk profile is also favorable: these are internal workflows with human review in the loop, not patient-facing decision support, which means the deployment path is shorter than it is for clinical AI.

Making the investment decisions sort themselves

Four concrete questions for pharma leaders planning 2024 and 2025 AI spend:

Where is the binding constraint in each workflow you care about? For discovery, it’s usually the hit rate in the early funnel. For trials, it’s patient recruitment and protocol quality. For RWE, it’s data harmonization quality. For compliance, it’s coordinator throughput and consistency. AI investments aligned with the actual constraint produce measurable returns; investments that skip past the constraint to the shinier downstream step rarely do.

Does the vendor’s accuracy claim come with peer-reviewed methodology? “Our AI is 99% accurate” means nothing without the task, the dataset, the evaluation protocol, and the baseline. Production-grade pharma AI vendors publish their benchmarks in peer-reviewed venues and make their evaluation datasets available for customer reproduction. Vendors that do not should be discounted accordingly.

Where does the data live during processing? For any workflow touching clinical notes, patient records, or PHI, the answer is effectively required to be “inside the customer’s environment,” not “in the vendor’s cloud.” HIPAA, GDPR, and the patchwork of US state privacy laws have moved in-environment deployment from a premium feature to a procurement requirement.

Is there a human-in-the-loop layer designed as part of the system? For regulatory-grade workflows, calibrated AI routing uncertain cases to domain-expert reviewers is the architecture that hits the accuracy bars. Systems that skip this layer either over-promise on automation or under-deliver on throughput.

The headline for pharma leaders in 2024 is not that AI is transformational. It’s that AI has stopped being a slide in the strategy deck and started being a line item in the R&D and compliance budgets, because the workflows where it works have measurable outputs attached. The organizations moving from pilot to production on discovery-stage screening, patient recruitment, RWE-driven stratification, and compliance automation are getting meaningful timeline and cost reductions. The ones still debating whether to start are falling behind on a cycle that is no longer speculative.

FAQ

Is AI-aided drug discovery actually producing approved drugs, or just faster preclinical candidates?

As of 2024, AI has materially accelerated the early stages (target identification, lead generation, preclinical candidate selection) with documented 25% to 50% time compression on those stages. The first wave of AI-designed candidates is now in Phase 2 and Phase 3 trials. Success rates in clinical trials are biological questions that AI helps address via better patient stratification and trial design, not a problem AI solves by itself.

Why is clinical NLP such a large fraction of the AI work in trials?

Because eligibility criteria, adverse-event documentation, and the majority of clinically relevant patient information live in unstructured text, not structured EHR fields. Reliable trial operations require turning that text into structured data that downstream rules and models can act on. The quality of the NLP is the quality of the trial-operations layer on top of it.

What’s the realistic ROI timeline on AI for regulatory and pharmacovigilance operations?

Short, measured in months rather than years, because the baseline is expensive manual work and the human-in-the-loop architecture is well understood. Typical productive deployments show measurable coordinator-hour reductions within a quarter and move to broader rollouts within a year. This is usually the fastest ROI line in a pharma AI portfolio.

Can general-purpose frontier LLMs handle regulatory-document drafting?

They can assist on drafting and consistency checking once the input documents have been cleaned, classified, and indexed. They cannot be the whole pipeline, because submission-document preparation involves domain-specific terminology, cross-reference verification, and format conventions that reward specialized models. The production pattern is composition: domain-tuned models for the structured work, frontier LLMs for the drafting and summarization on top.

What’s the most common mistake pharma companies make when scaling an AI pilot to production?

Skipping the harmonization layer. A pilot on a curated, pre-cleaned dataset that reached 95% accuracy does not generalize to production on raw operational data: because production data is noisier, more variable, and more multilingual than the pilot set. The investment that makes the pilot generalize is the one in pre-processing, entity extraction, terminology normalization, and confidence calibration. Organizations that budget for the model but underfund the data layer routinely find their production accuracy 10–20 points below pilot accuracy.

What 304 healthcare AI practitioners said about their 2024 budgets, models, and worries

David Talby — Sat, 30 May 2026 15:22:53 GMT

Originally published April 2024 in Hospital & Healthcare Management and Holistic Pulse, based on the 2024 Generative AI in Healthcare Survey conducted by Gradient Flow.

Early in 2024, 304 healthcare and life sciences practitioners filled out a detailed survey on how they were actually using generative AI: what budgets they had, what models they were picking, how they were evaluating vendors, where they were stuck. The results ran against the dominant narrative of the year. The story in the trade press was “healthcare is cautious about generative AI.” The story in the data was that healthcare was spending aggressively, was building with healthcare-specific models rather than frontier LLMs, and was weighting accuracy and privacy well above cost when evaluating options. For anyone deciding how to invest in 2024 and 2025, the survey is a useful check against the conference-stage version of the market.

The budget picture: not cautious

The survey’s headline number was a 300%+ year-over-year budget increase reported by nearly one-fifth of technical leaders. That is not a cautious industry. When the underlying distribution is laid out, the picture sharpens:

- 34% of all respondents reported a 10–50% increase in generative AI budgets versus 2023.

- 22% reported a 50–100% increase.

- 18% of technical leaders specifically reported a budget increase of more than 300%.

- Another 16% of technical leaders reported increases in the 100–300% range.

Company size shaped the pattern. Medium-sized companies were most likely to report 50–100% increases (36% of medium-sized respondents). Large companies were most likely to report the very large increases, with 12% seeing more than 300%, compared to 7% of medium-sized and 6% of small companies.

The pattern to read out of this is that the organizations with the most operational experience in healthcare AI are also the ones making the biggest bets. That’s a different signal than “healthcare is cautious.” It’s healthcare saying the tooling is finally good enough to justify the investment, and the teams that have been running small pilots for a while are now moving to scale.

The model picture: specialized beats general

The clearest finding from the survey on model choice was a pronounced preference for healthcare-specific models over general-purpose LLMs. Asked what kinds of language models they were using, 36% of respondents reported using healthcare-specific small models. Open-source LLMs came second at 24%, and open-source small models at 21%. Frontier general-purpose LLMs were not the default choice for this audience.

The scoring of evaluation criteria reinforced the pattern. Asked to rank importance factors on a 1-to-5 scale, respondents put:

- Tuned specifically for healthcare: 4.03 mean

- Reproducibility: 3.91

- Legal and reputational risk: 3.89

- Explainability and transparency: 3.83

- Cost: 3.80

The noteworthy line in that list is the last one. Cost was the least important factor. The practitioners in this survey were willing to invest in high-quality, reliable models rather than cut corners on price, which is consistent with an industry that has absorbed what wrong answers actually cost in clinical or regulatory settings.

The explicit preference for healthcare-specific models sits on top of an accumulating evidence base. By early 2024, peer-reviewed evaluations had consistently found that domain-tuned models outperformed general-purpose LLMs on clinical extraction tasks. A *JAMIA* study published in January 2024 measured GPT-4 at F1 0.804 on the 2010 i2b2 concept-extraction benchmark with baseline prompts, versus BioClinicalBERT at 0.901. A 2024 *Bioinformatics* paper found that fine-tuned open models outperformed few-shot GPT-4 on biomedical NER by 5 to 30 F1 points depending on the dataset. The practitioners in the survey were weighting their choices in line with where the evidence was actually pointing.

What practitioners were building

The use-case mix in the survey was skewed toward externally facing applications and information-extraction work, the places where generative AI reaches real volume in healthcare operations.

- Answering patient questions: 21%

- Medical chatbots: 20%

- Information extraction and data abstraction: 19%

What the mix doesn’t show, but what trails behind these numbers in the open-ended responses, is the pattern of how these systems are being built. Patient-facing Q&A systems and medical chatbots, done well, are not single-LLM deployments. They’re compositions: pre-processing pipelines that section and normalize clinical text, task-specific extraction models that pull structured findings, a longitudinal patient record that assembles the findings into a timeline, a reasoning layer (which is where the LLM finally earns its place) that answers questions over the timeline with citations to sources. Information extraction follows the same pattern, specialized models at the bottom, LLMs used for reasoning or summarization on top of clean inputs.

That architecture is what closes the gap between the accuracy healthcare practitioners need and the accuracy a frontier LLM delivers on raw clinical text. It’s also what the 36% of respondents using healthcare-specific small models are deploying in practice.

What practitioners are worried about

Adoption roadblocks in the survey clustered around three themes, in roughly this order:

Accuracy and reliability. The dominant worry, and the one that the composition-architecture work described above is specifically aimed at. Frontier LLMs called on raw clinical text hallucinate at rates that regulated workflows cannot absorb; systems that compose specialized models with LLMs close the gap.

Legal and reputational risk. Second in importance to healthcare-specificity when evaluating models. Behind this is the recognition that wrong AI answers in a clinical context can harm patients, trigger regulatory action, and damage brand. Responsible-AI testing for robustness, fairness, bias, truthfulness, and data leakage has moved from optional to expected.

Alignment with industry-specific needs. The survey asked practitioners whether the technology options on the market actually fit the regulated, high-accuracy, high-privacy demands of healthcare work. The preference for healthcare-specific models is partly an answer to this: the options that don’t fit the industry’s needs get filtered out at the evaluation stage.

Human oversight is the common thread running through the mitigations. Asked how they test and improve LLM models, respondents’ most common strategy was “human in the loop.” This is not a compliance concession, it’s an engineering pattern that lets specialized models run at high throughput on the records they can handle, with domain experts reviewing the flagged records where the AI is least confident. Well-calibrated systems that route low-confidence records to human reviewers consistently clear the accuracy bars that pure-automation or pure-manual approaches cannot.

Testing priorities varied by company size. Large companies prioritized fairness and private-data leakage. Smaller companies prioritized bias and freshness (how up-to-date the model is relative to changing clinical guidelines and terminology). Both sets of priorities reflect real regulatory and operational concerns, fairness and leakage are what a large organization can be sued over; bias and freshness are what a smaller team notices first when the model is wrong.

What the survey implies for 2024 and 2025

A few practical takeaways for healthcare organizations planning generative AI investments in the twelve months after this survey shipped.

Budget is not the binding constraint anymore. The organizations investing seriously in healthcare AI are doing so in large increments and in ways that reflect real operational deployment. Underfunding a generative AI initiative in 2024 is no longer a defensible strategy, it’s a decision to fall behind competitors who are moving faster.

Model choice should be informed by task, not by hype. Healthcare-specific small models are winning a material share of the market because they work better for the work healthcare actually needs done: high-volume, high-accuracy extraction and classification. Frontier LLMs have a role (summarization, conversational interfaces, reasoning over already-clean inputs) but they are not the default choice for clinical NLP workloads. The 36% of respondents using healthcare-specific small models are voting with their pipelines.

Accuracy, privacy, and industry-specificity beat cost. The survey’s most striking finding is that cost came last among the evaluation criteria. That’s the right answer for an industry where wrong answers have outsized consequences, and it should shape how vendors pitch and how buyers buy. Organizations evaluating vendors should weight accuracy and privacy heavily, and should discount vendor claims that have not been substantiated by peer-reviewed benchmarks or public case studies.

Human-in-the-loop is how the economics work. No single model deployed alone hits the accuracy bars healthcare workflows need. Systems that combine AI throughput with targeted expert review, with feedback flowing back into the next model version, are what reach production, and do so in a form that satisfies the regulatory requirements for human oversight.

In-environment deployment is table stakes. The survey’s privacy findings line up with what every procurement review in healthcare ends up concluding: systems that cannot run inside the customer’s environment are eliminated before they reach accuracy evaluation. Organizations building or buying generative AI for healthcare should treat on-premises or private-cloud deployment as a hard requirement, not a premium feature.

The practitioners represented in this survey are building the next generation of healthcare AI quietly, while the public conversation is still stuck on exam-score headlines. Their choices — healthcare-specific models, compositions rather than single models, humans in the loop, in-environment deployment, accuracy weighted above cost — are a more reliable guide to what works than the pitch deck of any frontier-model vendor. The 2024 survey was the first annual edition; subsequent editions will reveal how much further the production bar has shifted. On the evidence of this first one, the gap between the operator view and the media view of healthcare AI was significant, and the operator view was the one worth listening to.

FAQ

How representative are the 304 respondents?

The survey was conducted by Gradient Flow over 33 days in early 2024, with 304 participants of whom 196 were actively engaged in evaluating, using, or deploying generative AI in healthcare or life sciences. Respondents were recruited through online channels including the Gradient Flow newsletter, social media, and industry partners. As with any voluntary survey, respondents self-select, but the sample size and the mix of technical leaders, data scientists, and practitioners make the distributions reasonably informative about the population of actively building organizations.

Does “healthcare-specific small models” mean models trained from scratch for healthcare, or fine-tuned general models?

Both. The category in the survey covers models in the roughly 100M–10B parameter range that have either been trained from scratch on healthcare data or fine-tuned from a general base on healthcare data. The operational distinction from frontier LLMs is that they can be run on a single GPU (or CPU for the smaller ones), in the customer’s environment, at fixed cost.

Why was cost rated lowest in evaluation priority?

Because in healthcare, the cost of a wrong answer typically exceeds the cost of the model. A missed adverse event, a mis-coded diagnosis, a leaked patient record, or a failed regulatory audit has consequences (clinical, financial, and reputational) that dwarf the per-record cost of inference. Practitioners who have absorbed those consequences rate accuracy and privacy above cost, because they know the downstream numbers.

What is the practical threshold for “high accuracy” in healthcare AI?

It depends on the task and on whether the workflow includes human review. For tasks where automation is the point (de-identification, PHI detection, high-volume clinical coding) the practical threshold is above 99% on the first pass, because below that every record still needs human review. For tasks with a designed human-in-the-loop review layer, the AI-only accuracy can be lower (90–96% is routine) as long as confidence calibration routes the uncertain records to reviewers reliably.

Is the budget growth seen in the 2024 survey sustainable?

Two years later, the answer appears to be yes: subsequent surveys and the evidence from public healthcare AI deployments show continued investment, broader adoption beyond early-adopter organizations, and a shift from pilot projects to operational workloads. The organizations that bet on the space in 2024 largely kept investing in 2025 and 2026. Organizations that stayed on the sidelines are now doing the catch-up work.

What healthcare already knows about shipping AI that other regulated industries haven’t figured out yet

David Talby — Thu, 28 May 2026 16:44:23 GMT

Originally published March 2024 in CIO and Multilingual.

Healthcare got a head start on regulated AI because it had no choice. By the time ChatGPT arrived, clinical data-science teams had spent a decade working inside HIPAA, GDPR, FDA validation rules, and institutional review boards, and had built the machinery to ship AI under those constraints. Most of what the rest of the enterprise world is now encountering with generative AI (hallucinations in regulated workflows, unclear liability, compliance reviews that stall launches) healthcare has seen and has a pattern for. Four lessons from the way production medical AI gets built are directly transferable to finance, law, insurance, and any other sector where wrong answers have consequences.

Lesson 1: a complete view of the subject beats a clever model on a partial view

Most AI systems get built around whatever data is easiest to pull. In healthcare, that’s structured EHR fields: diagnoses, medications, labs, vital signs. A model trained on those fields can do useful things, but it misses most of the picture. More than half of the clinically relevant information about a patient (reasoning from the clinician, discussion with the patient, nuance about severity and certainty) lives in unstructured clinical notes, not in the coded fields. Add the PDFs (discharge summaries from other systems, external consults, prior-authorization letters), the medical images (radiology, pathology), and the patient-reported data (intake forms, symptom diaries), and the structured fields are maybe 30% of the available signal.

Production medical AI that works is built on top of a unified, longitudinal view of the patient that combines all of those sources. Structured demographics. Clinical characteristics. Vital signs. Smoking status. Past procedures and medications. Laboratory results. Extracted entities and assertions from progress notes. Findings from pathology and radiology reports. The combined view is what makes downstream tasks (disease progression prediction, clinical trial matching, cohort building, risk scoring) actually work. Models operating on the combined view routinely outperform ones that see only the structured slice of the record, because the unstructured slice is where most of the clinical reasoning happens to be written down.

The transfer is immediate:

- For a retail bank, the customer-completeness problem is the same shape. Transactions are structured. Call-center transcripts, chat logs, secure-message threads with relationship managers, and scanned-document submissions are not. A credit or churn model that sees only the structured side is working on a fraction of what’s available.

- For a property and casualty insurer, the claim is partly structured (coverage, policyholder, loss date) and heavily unstructured (adjuster notes, emails with claimants, photos, police reports, medical records). The systems that decide a claim well are the ones that read the full file.

- For a law firm, a case file is structured metadata plus mostly unstructured content: contracts, emails, depositions, exhibits. AI assistants that operate on that full file produce materially different answers than ones that see only the filing metadata.

In each case, the engineering pattern is the same: specialized extraction and classification models pull structured facts from unstructured sources, a harmonization layer joins them to the existing structured record, and downstream models (predictive, search, conversational) operate on the unified view. Healthcare built this pattern first because it had the worst unstructured-to-structured ratio. Every other regulated sector ends up building a version of it.

Lesson 2: the interface matters as much as the model

For a decade, advanced NLP and machine learning in regulated industries were gated on the availability of data scientists. If you wanted to train a model to extract contract clauses, detect adverse drug events, or classify claims, you needed an ML engineer to write code, a domain expert to label data, and a deployment team to push the result into production. That workflow scales poorly. There are not enough ML engineers for the work, and the domain experts with the relevant judgment (clinicians, pharmacists, lawyers, underwriters) are not going to learn Python on their own time.

Healthcare’s response to this bottleneck has been no-code annotation and human-in-the-loop tooling. The workflow that works in practice: a domain expert, working in a web UI, labels a small number of documents. An underlying system, often an LLM doing a zero-shot pass, proposes labels on the rest. The expert corrects the ones that are wrong. Those corrections become the next round of training data, which produces a smaller, faster, more accurate task-specific model. Iterate until the accuracy is where it needs to be, deploy the small model in production, keep the feedback loop running for monitoring and drift.

This compresses the build-and-validate loop from months to weeks, because the bottleneck, getting labeled data in the shape the specific task needs, is handled by the people who actually know what “correct” means. It also produces small, specialized models that are cheap to run at scale, rather than large general-purpose models called over an API at per-token cost. Regulatory requirements for human oversight and validation are satisfied by construction, because domain experts are signed into the loop at every step with audit trails, versioning, and approval workflows built in.

Other regulated industries are starting to build the same pattern for their own experts: lawyers labeling contracts for a contract-intelligence pipeline, compliance officers labeling transactions for anti-money-laundering models, underwriters labeling submissions for triage models. The model is secondary. The interface that lets the domain expert drive the process without writing code is what decides whether the project finishes.

Lesson 3: privacy and scale are architectural, not operational

In healthcare, “send the clinical notes to a third-party cloud API” is a non-starter for most organizations, most of the time. The reasons stack: HIPAA, GDPR, state privacy laws, institutional policy, patient expectations, data-sovereignty regulations in non-US jurisdictions. The result is that AI systems in healthcare have to be designed from the start to run inside the customer’s environment, on-premises, in the customer’s own private cloud tenant, or air-gapped, with no data ever leaving the customer’s control.

That constraint turns out to be a feature. In-environment deployment removes the per-token pricing model, because the customer is paying for compute they already own. It removes the latency tax of network round-trips to a vendor API. It removes most of the data-residency compliance questions, because the data never moved. It removes the vendor-lock-in risk that comes with building mission-critical pipelines on top of a third-party API whose pricing and availability the customer does not control. And it removes the training-data intellectual-property question, because the customer’s data stays the customer’s.

The architectural consequence is that healthcare AI systems are built to run efficiently on commodity hardware: single-GPU inference for most tasks, CPU inference for the lightweight ones, containerized deployment into Kubernetes or Databricks or Snowflake environments that customers already operate. This is a very different architecture from “call a vendor’s API from wherever,” and the difference matters for every other regulated industry that is going to face the same pressure.

Financial services is already there in parts — banks have regulatory constraints on where customer data can be processed, and many will not allow production workloads in third-party LLM APIs. Legal has similar constraints for privileged client information. Pharma has them for research data and trial records. In all of these sectors, the architectural pattern that scales is the one healthcare has already built: models designed to run in the customer’s environment, at production volume, on hardware the customer controls, with no data ever leaving.

The performance gap that used to make this architecture hard has mostly closed. Specialized domain-tuned models, carefully engineered for inference efficiency, now match or beat frontier LLMs on most of the specific tasks regulated industries care about, while running at 1–2% of the cost and with none of the compliance overhead. The remaining case for vendor APIs is for exploratory workloads and for conversational interfaces over curated knowledge — useful, but not where the production volume lives.

Lesson 4: humans in the loop are the accuracy mechanism, not a compliance afterthought

In regulated industries, 95% accuracy is not a success. It’s a system that still requires a human reviewer on every record, which is not automation. The target in healthcare for most high-volume tasks is above 99% — for de-identification, for PHI detection, for critical entity extraction — because that’s the threshold below which the downstream economics stop working. Hitting 99%+ on the first pass through a single model is rare. Hitting it through a composed system with a human-in-the-loop review layer is routine.

The pattern is a three-layer stack. The AI does the first pass at high volume and high speed. A confidence-scoring layer flags the records where the AI is uncertain, using calibrated confidence rather than raw model probabilities. A domain expert reviews only the flagged records, making the final call. The reviewed records feed back into the training set, so the AI gets steadily better over time and flags fewer records to the reviewers.

This pattern is what makes the economics work. If the AI runs at 96% accuracy and flags the 10% of records where it’s least confident, a human reviewer handling only those 10% is ten times as productive as a reviewer handling every record. If the AI’s confidence calibration is good, meaning the flagged records really are the ones where it’s most likely wrong, the combined system runs at well above 99%, faster and cheaper than either pure automation or pure manual review would be. The reviewers remain the accuracy mechanism; the AI just makes their throughput tractable.

Other regulated industries are arriving at the same architecture for the same reasons. Legal e-discovery review, insurance claims adjudication, financial compliance monitoring, pharma safety signal review — all of these have the same shape as clinical coding or adverse-event extraction. High volume, a regulatory requirement for human oversight, and an accuracy bar that no single model hits on its own. The systems that work are the ones that treat human review not as a compliance box but as an engineered throughput mechanism with measurable accuracy gains.

The short version for non-healthcare sectors

Four things to take from the way production medical AI gets built:

A complete view of the subject, combining structured and unstructured sources, tabular data and documents and images, is worth more than a clever model on a partial view. Build the harmonization layer first.

The domain experts who know what “correct” means should be driving the labeling and validation loop directly, through a no-code interface, with feedback that trains the model. That’s how projects actually finish.

Privacy and scale are architectural. Systems designed from day one to run in the customer’s environment, on the customer’s hardware, without data leaving, are cheaper, faster, and easier to clear compliance on than systems retrofitted to meet the same constraints later.

Human-in-the-loop is an engineering pattern, not a compliance concession. Calibrated AI confidence plus targeted expert review is how you hit the accuracy bars regulated workflows actually need, and it’s also how you make the economics work.

Healthcare’s head start was bought the hard way. The patterns it produced are available off the shelf to every other regulated industry that is now catching up.

FAQ

Why does healthcare keep coming up as a reference architecture for regulated AI?

Because healthcare had the hardest version of every constraint earliest: the strictest privacy rules, the highest accuracy bars, the worst unstructured-to-structured data ratio, and the most expensive wrong answers. The architectural patterns that cleared those bars (data harmonization, domain-expert-driven labeling, in-environment deployment, human-in-the-loop review) transfer to other sectors without much modification.

Does a unified longitudinal view always require OMOP or another formal common data model?

For research, RWE, and cross-institution work, formal common data models (OMOP, FHIR) are the right target because they make the data comparable across sources. For a single-organization operational use case, a payer running a model on its own claims and notes, or a bank running a model on its own customers, the same harmonization principles apply, but the target schema can be internal. The point is the harmonization, not the specific standard.

How is no-code annotation different from just giving domain experts a spreadsheet?

The tooling has to handle document-native labeling (highlighting spans of text inside a document rather than filling cells), manage annotator agreement across multiple reviewers, keep versioned datasets, integrate with model training so labels become training data automatically, and produce audit trails that satisfy regulatory review. A spreadsheet handles none of that.

What does “runs in the customer’s environment” mean technically?

Deployment of the models, the inference runtime, and often the training toolchain as software the customer installs into their own infrastructure: on-premises hardware, a private VPC in their AWS/Azure/GCP tenant, or an air-gapped environment. No data crosses the boundary to the vendor; no vendor-side API handles production inference. Licensing is typically fixed-cost rather than per-token.

How do you measure human-in-the-loop productivity gains?

Three numbers: the fraction of records the AI handles without review (throughput), the accuracy of the AI-only path on the records it passes (precision at high confidence), and the accuracy of the combined system on the flagged records (precision on the reviewed subset). A well-calibrated system improves on all three over time, because the feedback from reviewed records becomes training data for the next model version. That improvement loop is the operational KPI.

Gaps between AI demo and AI production: three things 2024 will force enterprises to fix

David Talby — Mon, 25 May 2026 15:20:18 GMT

Originally published March 2024 in CIO

Every AI cycle goes through the same three stages: demo, pilot, production. Most enterprise AI in 2023 was somewhere between the first two. The work to close the gap between a compelling ChatGPT demo and a reliable system that a regulated business can run on is harder than the demos make it look, and 2024 will be the year that gap gets paid for, in either engineering hours or lost deployments. Three of those gaps are worth focusing on: accuracy and reliability are still unacceptable for most enterprise use; responsible-AI testing is now the rate-limiting step for production launches; and the regulatory environment is starting to catch up with the technology, in ways that will matter for how systems are built, not only how they’re run.

Demo accuracy and production accuracy are not the same number

The year of AI hype that closed out 2023 made two accuracy claims hard to separate: the claim that modern LLMs handle open-ended natural language tasks well, which is true, and the claim that an enterprise can point one at its data and ship, which is not. Closing that gap takes more engineering than most organizations budgeted for.

The first place the numbers fall short is out-of-the-box extraction and classification. In regulated industries, the work an AI system needs to do is usually not “write me a paragraph about” but “pull every adverse event from this progress note and tell me which medication was associated with it.” On those tasks, peer-reviewed benchmarks consistently show general-purpose LLMs underperforming smaller, domain-tuned models by meaningful margins. A *JAMIA* study from January 2024 on the 2010 i2b2 clinical-concept extraction benchmark measured GPT-4 with baseline prompting at F1 0.804, against BioClinicalBERT at 0.901, a 110-million-parameter model released years earlier. A careful prompt framework closed part of the gap. It did not close all of it, and building the prompt framework itself required enough labeled data to have trained a specialized model in the first place. Similar patterns have been reported across biomedical named entity recognition: a 2024 paper in *Bioinformatics* showed fine-tuned open models outperforming few-shot GPT-4 on biomedical NER by 5 to 30 F1 points depending on the dataset.

The second place the numbers fall short is consistency. The same prompt asked of the same model on the same input can return different answers on successive calls. For a chatbot producing creative text, that’s acceptable variety. For a system that pulls clinical findings from notes, or extracts contract terms from legal documents, or computes tax-coded line items from invoices, inconsistency is a production defect. Most enterprise pipelines need deterministic-enough behavior that the same record produces the same answer today and next week, which is not a property generative decoding gives you for free.

The third is cost and latency under real volume. A system that works on a dozen documents in a demo environment does not always work on the hundred thousand documents a real business process throws at it. Frontier-LLM calls priced per token, billed per request, and rate-limited by a vendor turn out to be a different line item at 10,000× scale than at 10× scale. For pipelines that process millions of records a day, which is what a hospital system, a large insurer, or a global pharma safety function actually does, the economics and the latency both push toward smaller, specialized models running on the buyer’s own hardware.

None of this argues that frontier LLMs are useless. It argues that enterprise AI in 2024 is going to be a composition problem rather than a single-model problem. The systems that ship will combine specialized extraction and classification models for the high-volume, high-accuracy, low-latency work with frontier LLMs used for reasoning, summarization, and conversation on already-cleaned inputs. Getting that composition right is the engineering gap between a demo and a production system.

Responsible AI has moved from adjective to rate-limiting step

By late 2023, enterprise AI buyers started asking a question that was largely absent from the first wave of deployments: how do you know this system is safe to run? The question covers six things in a trench coat (robustness, fairness, bias, truthfulness, data leakage, and safety), and the testing that answers them is harder to do well than most organizations realized when they started.

Robustness is the property that small, legitimate changes to an input should not produce large changes in the output. A system that gives a different answer when a patient’s name is changed from one that the training distribution saw often to one that is rarer is not robust. Testing robustness at scale means generating perturbed versions of real inputs (different names, slight rewording, translated variants) and measuring how much the output changes.

Fairness and bias testing asks whether the system performs equally well across demographic groups. The peer-reviewed literature on LLM bias has thickened considerably, with systematic surveys published through 2024 and 2025 cataloging intrinsic biases in representations, extrinsic biases in downstream tasks, and evaluation frameworks for both. For clinical systems, UK government guidance published in 2025 recommends treating fairness metrics as first-class production metrics alongside accuracy and latency, with bias evaluation gates in continuous integration and automatic rollback when thresholds are breached. That is a concrete, operational recommendation, and very few enterprises have the test infrastructure in place to run it yet.

Truthfulness, or more precisely, the absence of confidently stated wrong answers, is the hardest of the six. The failure mode is not the model saying “I don’t know”; it’s the model producing fluent, plausible text that is factually incorrect. A 2023 evaluation of a widely used open-source medical LLM reported high plausibility (~98.8%) paired with a meaningful hallucination rate (~19.7%), which is a reasonable summary of the problem: the outputs look right often enough that they pass a casual reader, and they’re wrong often enough that a regulated workflow cannot safely rely on them without source citation.

Data leakage is the training-time version of the privacy problem: the risk that a model has memorized specific training records and can be induced to reproduce them. For systems trained on sensitive data, this is both a privacy violation and, depending on jurisdiction, a legal violation. Testing for it is non-trivial and is now an expected part of any regulated deployment.

The practical consequence is that the responsible-AI test suite has become the rate-limiting step for many enterprise launches. The organizations that get systems into production in 2024 are the ones that treat testing as part of the engineering — with automated test generation, versioned test suites, and CI gates on fairness, robustness, and privacy the same way they have CI gates on unit tests — rather than as a compliance box checked after the system is already built. Open-source tooling for this work has matured considerably (LangTest and DeepEval among others), but the tooling is useful only in the context of an engineering discipline that treats responsible-AI testing as a first-class concern.

Regulation catches up, and the rules shape the architecture

The third growing pain is regulatory. Through 2023 the pattern was mostly principles papers; through 2024, concrete rules started to land. The EU AI Act passed in March 2024, with the first prohibitions on high-risk and prohibited practices taking effect in February 2025 and full provisions rolling out in stages. US state legislatures introduced nearly 700 AI-related bills in 2024 across 45 states, with 113 enacted into law. In healthcare and life sciences, FDA and EMA guidance on AI-enabled medical devices and software continues to expand, with data-provenance, validation, and post-market monitoring expectations that look a lot like the expectations on any other regulated product.

For enterprises, the operational question is not “what do I do if the regulator shows up” but “what architectural choices do I make now so that, when the regulator shows up, the answers are short.” The choices that make that conversation easier are the ones that also make engineering easier: training-data provenance tracked as a first-class artifact; validation and fairness results stored in a form that can be shared with regulators on request; deployment inside the organization’s own environment rather than a third-party cloud, so data-residency and privacy questions have short answers; audit logs that show, for each production answer, which model version generated it and what sources it cited.

What doesn’t work is treating compliance as an afterthought layered onto a system whose core was built for a different set of constraints. Every organization that has tried to retrofit provenance, audit, or data-residency onto an already-deployed AI system has discovered what engineers who live through any regulatory wave eventually learn: the right time to build in the constraints was before the system shipped. The second-best time is now, because the regulatory floor is rising and is going to keep rising through 2026 and beyond.

What to prioritize

For enterprises planning 2024 AI investments, three practical priorities sort out the demo-to-production gap.

First, invest in the boring layer. Specialized, task-specific models (entity recognition, classification, terminology mapping, translation between natural language and structured queries) are where the accuracy and cost wins come from in regulated workflows. Frontier LLMs are valuable; they are not the whole system. The systems that ship are compositions, with the specialized layer doing the high-volume, high-accuracy, low-latency work and the LLM doing the reasoning on top of clean inputs.

Second, treat responsible-AI testing as an engineering discipline rather than a compliance function. That means automated test generation, versioned test suites, CI gates, and production monitoring on robustness, fairness, and privacy, not a final-stage review before launch. The organizations that have figured out how to do this ship AI faster, not slower, because the testing catches problems early, when they’re cheap to fix.

Third, assume the regulatory floor keeps rising. Build systems that produce the artifacts regulators will eventually ask for (training-data provenance, validation records, fairness metrics, audit logs, citations on every answer, in-environment deployment) as a natural byproduct of how they operate, rather than as a separate compliance step. The point is not to predict exactly which rule will land when. The point is to build systems whose answers to those rules are short.

2024 was never going to be the year AI hype died. It was the year the engineering under the hype got harder. For organizations that invest in the composition, the testing, and the compliance as engineering choices rather than afterthoughts, the rocky road turns into a faster one, because the systems that clear those bars are also the ones that get past procurement and into production.

FAQ

Why isn’t a single frontier LLM enough for enterprise work?

Because most regulated enterprise tasks are high-volume extraction and classification problems, not generation problems. Peer-reviewed benchmarks have consistently shown smaller, domain-tuned models outperforming frontier LLMs on named entity recognition, relation extraction, and terminology mapping, and doing it at a fraction of the cost and latency. The enterprise systems that ship in 2024 combine specialized models for the structured work with frontier LLMs for reasoning and summarization.

What is “responsible AI” testing in practice?

Automated testing for six properties: robustness (small input perturbations shouldn’t cause large output changes), fairness (performance shouldn’t vary by demographic group), bias (outputs shouldn’t reflect stereotypes), truthfulness (the system shouldn’t confidently produce false statements), data leakage (training data shouldn’t be reproducible from the model), and safety (the system shouldn’t produce harmful content). Each of these has measurable metrics, and production systems should have automated tests that run on every model change.

Does the EU AI Act apply to US companies?

Yes, for systems that process data about EU residents or are placed on the EU market. The extraterritorial application is modeled on GDPR. US companies that build AI systems which touch EU data or EU markets need to meet the Act’s requirements on risk classification, documentation, human oversight, and transparency, and need to do so on the rolling timeline the Act lays out.

What’s the cheapest-to-ignore regulatory requirement to get right early?

Training-data provenance. The ability to say, for each model, what data it was trained on, where that data came from, what licenses apply, and what validation it was tested against — in a form that can be shown to a regulator or an auditor. Retrofitting this later is expensive; building it in from the start is cheap.

How do cost economics change for enterprise AI in 2024?

The fixed-cost versus per-token calculus tips toward fixed-cost at scale. Running frontier LLMs over the wire is economical for exploratory workloads and low-volume applications. For production workloads with millions of records per day, which is what large healthcare, pharma, financial, and legal operations actually run, specialized models on the buyer’s own hardware are often 50 to 100× cheaper and orders of magnitude faster, which is why the architectural pattern that wins is composition rather than single-model deployment.

---

David Talby is CEO of John Snow Labs, whose medical language models and responsible-AI testing tooling (including the open-source LangTest library) are used by 500+ healthcare and life sciences organizations. He also leads Pacific AI, which focuses on governance for healthcare AI.

Why the next useful medical chatbot will not look anything like ChatGPT

David Talby — Sat, 23 May 2026 15:52:47 GMT

Originally published November 2023 in HIT Consultant and MedCity News.

The generation of medical chatbots built on top of frontier LLMs is topping out, and the ceiling is low. A system that answers “what are the contraindications for this medication” from a curated knowledge base is one thing. A system that answers “given what we know about this specific patient, should we switch them off this medication” is an entirely different architecture — and the architecture, not the LLM underneath it, is what decides whether the chatbot is usable in a clinical setting. The next wave of medical chatbots will read longitudinal patient records, reason over timelines, cite every answer, and run inside a hospital’s firewall. Very little of that work happens in the language model.

Where the first generation stalls

The Q&A chatbot pattern that took over 2023 has a fixed shape. A user asks a medical question in natural language. A retrieval layer pulls relevant passages from a reference corpus: guidelines, drug labels, medical textbooks, recent literature. The LLM is given those passages and the question, and it produces a paragraph of generated text. This is retrieval-augmented generation (RAG), and it works well for one class of problem: general medical reference. What is the recommended first-line therapy for uncomplicated community-acquired pneumonia? What are the renal dosing adjustments for vancomycin? What are the FDA-approved indications for semaglutide? RAG over a clean knowledge base handles these.

It handles almost nothing else a clinician actually needs to ask during a shift.

Consider the questions that come up in real clinical settings. “Is this patient a candidate for the new rheumatoid arthritis trial?” requires eligibility criteria to be matched against the patient’s actual history (diagnoses, medication failures, lab trends, comorbidities, prior treatments), most of which live in free-text clinical notes rather than structured fields. “Has this patient ever had a documented reaction to a sulfa drug?” requires reading through years of notes, recognizing that “rash after Bactrim in 2019” and “penicillin allergy noted by patient, tolerated amoxicillin in 2022” are different kinds of information, and returning a defensible answer with source pages cited. “Show me every patient in this panel who has uncontrolled diabetes and has missed two consecutive appointments” is not a language question at all — it’s a cohort query over longitudinal records, expressed in natural language because typing SQL in a clinic is absurd.

Each of these questions crosses a boundary that a frontier LLM with retrieval cannot handle on its own. The information needed to answer them lives in messy, patient-specific, multi-modal data (structured EHR fields, unstructured progress notes, scanned PDFs, medication lists, lab results), and the reasoning required spans time and combines sources. Dropping those inputs raw into a context window produces unreliable answers for the same reason dropping a hospital into a shoebox produces an unhelpful map.

What makes a medical chatbot work

The architecture that handles the clinically useful questions looks different. The LLM is the smallest part of it. The heavy lifting is done by the layers beneath and around the language model, each doing a job the LLM should not be asked to do.

A healthcare-specific pre-processing pipeline. Real clinical text is roughly half copy-and-pasted content, with section headers (”Chief Complaint,” “History of Present Illness,” “Assessment and Plan”) that change the meaning of identical sentences depending on where they appear. A sentence saying “patient denies chest pain” in the HPI is a negation; the same sentence in the Assessment is a clinical decision. Before any reasoning happens, the system needs to section, de-duplicate, and normalize the note. General LLMs do not do this reliably. Specialized clinical-text pre-processing does.

Task-specific extraction models. Entity recognition, assertion status (is the condition present, absent, possible, or historical?), relation extraction (which medication caused which side effect?), and terminology mapping (resolving “MI” to “myocardial infarction” and then to ICD-10 I21 and SNOMED 22298006) are sequence-labeling and classification problems. Encoder models trained on domain data (PubMedBERT, BioClinicalBERT, and domain-tuned clinical models) consistently beat frontier LLMs on these tasks by material margins. A 2024 *JAMIA* study on the 2010 i2b2 concept-extraction benchmark measured BioClinicalBERT at F1 0.901 versus GPT-4 at 0.804 with baseline prompts; on the VAERS adverse-event corpus the gap was wider, 0.802 versus 0.593. The numbers decide what the downstream answer can be trusted to contain.

A longitudinal patient record. The extracted facts from dozens or hundreds of documents have to be assembled into a single per-patient timeline, with dates normalized, coreferences resolved across encounters, and a terminology that is stable across time. Only then can the system answer a question that requires knowing what happened and in what order. This is the step most demos skip.

A reasoning layer. Now the LLM earns its place. Given a structured, cited timeline as input, and a natural-language question, it can reason in the way people actually needed it to: comparing dates, weighing evidence, explaining its logic. The key is that it’s reasoning over clean, structured input rather than trying to simultaneously extract and reason over raw notes.

Citations on every answer. Every clinical answer the chatbot produces should point back to the specific document, section, and sentence that supports it. Without that, no clinician will trust it; with it, the chatbot stops being a black box and starts being a faster way to find source evidence. This is where regulatory-grade systems diverge from consumer chatbots: “the model said so” is not acceptable in a clinical workflow. “This patient’s creatinine was 1.9 mg/dL on the lab drawn 2026-01-14, as documented in the nephrology consult on that date (page 3)” is.

An in-environment deployment. Clinical notes cannot leave the hospital’s environment in most jurisdictions. HIPAA, GDPR, and the patchwork of state privacy laws that continue to accumulate put data sovereignty at the top of the compliance list. A chatbot that requires sending notes to a third-party cloud fails the procurement review before anyone looks at the accuracy numbers. On-premises or private-cloud deployment is not a premium feature; it is table stakes.

The shape of the next generation

The medical chatbots that clear the production bar are going to be compositions, not single models. Expect them to look roughly like this:

The user types a question. The system routes it: is this a general medical reference question, answerable from a curated knowledge base? Is it a patient-specific question that requires reasoning over this patient’s records? Is it a cohort question that requires running a query across a population? Each route has a different pipeline behind it.

For reference questions, RAG over a clinically curated knowledge base (drug labels, guidelines, peer-reviewed literature, the organization’s own protocols) with every answer citing its sources. For patient-specific questions, the pre-processing and extraction pipeline produces a structured view of the patient’s history, then the LLM reasons over that structured view. For cohort questions, the natural-language query is translated into a structured query over an OMOP-harmonized data warehouse, with the LLM used primarily as the translation layer between human language and database query. A 2024 benchmark looking at GPT-4 generating SQL queries for patient-level questions over structured health data found accuracy materially below what clinicians would tolerate in a production setting; task-specific translation models tuned on healthcare schemas did considerably better. This is an architectural point, not a model point: the LLM is part of the system, not the whole system.

Accuracy targets change by task. For reference Q&A, clinicians tolerate the same roughly 95% threshold they apply to a trusted colleague’s off-the-cuff answer, provided sources are cited and the system is upfront about uncertainty. For patient-specific extraction that feeds decisions, the bar is much higher: in practice, above 99% on de-identification and on assertion status, because below that, every answer needs human review and the automation value evaporates. The architecture has to know which bar it’s playing against, and the UI has to communicate that to the clinician using it.

Latency matters more than most demos acknowledge. A chatbot that takes 40 seconds to answer during a seven-minute patient encounter is a chatbot no one uses. Specialized models running on commodity hardware, rather than frontier LLMs called over the wire, are often what the latency budget will actually allow.

Expert evaluation is part of the build, not an afterthought. Technologists can get a medical chatbot roughly halfway; physicians, nurses, and pharmacists have to evaluate the generated answers on clinical relevance, style, consistency with current guidelines, and appropriateness for the setting. This is not a one-time certification; it’s an ongoing feedback loop, with disagreements from domain experts fed back into the training and evaluation sets. Any vendor claiming their chatbot is “clinician-approved” without describing that loop is selling a snapshot.

What this means for healthcare AI buyers

Three questions to ask when a vendor pitches a medical chatbot, each of which tends to separate production-grade systems from prototypes.

First: where do the answers come from, and how are they cited? A system that cannot show, for each answer, the specific source passages it is grounded in is not one that will clear clinical review. If the vendor’s demo answers cite nothing, the production system will cite nothing.

Second: what happens when the question requires reasoning over a specific patient’s record? If the answer is “we pass the chart to the LLM,” that’s the shoebox-map problem. If the answer involves a pre-processing pipeline, task-specific extraction, a structured patient timeline, and an LLM reasoning over the timeline, the vendor has built the architecture that actually works.

Third: where does the data live? On-premises, private cloud inside the buyer’s account, or third-party cloud? For most regulated healthcare organizations, only the first two answers are viable, and many procurement processes will not get past the third.

The next wave of medical chatbots will not be defined by which LLM is underneath. It will be defined by the architecture around the LLM: how the data is cleaned, how facts are extracted and verified, how timelines are assembled, how answers are cited, and where the whole thing runs. Healthcare organizations evaluating these systems get more signal from asking about the pipeline than from asking about the model.

FAQ

Isn’t a frontier LLM with a big enough context window eventually going to solve this?

Larger context windows help, but they don’t replace the upstream cleaning and extraction work. Clinical notes are too messy, too repetitive, and too full of specialty-specific language for raw ingestion to produce reliable answers, regardless of model size. The empirical pattern over the last two years has been that adding structure upstream of the LLM helps more than scaling the LLM does, for this class of problem.

How accurate do medical chatbots need to be for production use?

It depends on the task. For clinical reference Q&A, clinicians apply roughly the same bar they apply to a trusted colleague — accuracy in the mid-90s with clear source citations is usable. For patient-specific extraction that feeds decisions, the practical bar is above 99% on tasks like PHI removal and assertion status, because below that every output needs human review and the automation no longer saves time.

What’s wrong with RAG for medical chatbots?

Nothing — for the right question type. RAG over a curated knowledge base handles general medical reference questions well. It does not handle questions that require reasoning over a specific patient’s longitudinal record, because those questions need extracted, linked, timeline-aware patient data rather than retrieved passages. Those are different problems with different architectures.

Why is on-premises deployment treated as a requirement rather than a preference?

Because for most regulated healthcare organizations, HIPAA, GDPR, and the patchwork of US state privacy laws rule out sending clinical notes to third-party cloud services. In-environment deployment is a compliance requirement, not a performance preference. Systems that cannot run in the customer’s environment are usually eliminated before accuracy enters the conversation.

Are task-specific extraction models really still outperforming frontier LLMs, a full year after GPT-4?

On entity-level clinical extraction tasks (NER, assertion status, relation extraction, de-identification) yes, according to multiple peer-reviewed evaluations through 2024. On reference question-answering and summarization, frontier LLMs do well. The architecture trend is composition: specialized extraction models feeding clean structured data into frontier LLMs for the reasoning step.

David Talby is CEO of John Snow Labs. Its Medical LLM and Healthcare NLP libraries power medical chatbots and clinical information-extraction pipelines at 500+ healthcare and life sciences organizations. He also leads Pacific AI, which focuses on governance for healthcare AI.

What benchmarks miss: two clinical AI failures that reshaped how we build medical LLMs

David Talby — Sat, 16 May 2026 14:17:13 GMT

Originally published November 2023 in Medhealth Outlook.

Passing the US medical licensing exam got generative AI its healthcare headline. Running in a hospital’s information-extraction pipeline is a different story. By late 2023, peer-reviewed evaluations on clinical named entity recognition, social determinants extraction, and de-identification were landing one after another, each with the same finding: GPT-4 trailed task-specific models built for the job. Two cases we worked on pushed that finding from benchmark curiosity to architectural decision. One was adverse-event extraction from opioid progress notes for an FDA Sentinel program. The other was reasoning over a patient’s timeline to decide whether a drug had actually caused a reaction. Both taught us things no leaderboard number would have.

What the medical exam score hides

The USMLE result did real work for the field. It forced clinicians and health-IT buyers to take generative AI seriously, and it seeded the first wave of hospital pilots. The problem is how that result traveled. “GPT-4 passes the medical exam” became shorthand for “GPT-4 is ready for clinical text,” and those two claims have almost nothing to do with each other.

The USMLE is a closed-book, multiple-choice test. Each question is a clean vignette with five answer choices, designed by clinicians to have one defensible right answer. A real discharge note is none of those things. It is between five and twenty pages of mixed narrative and templated text, written by three or four authors over a two-week stay, with copy-and-paste duplication, undocumented abbreviations, implicit negations (”ruled out PE, started DVT prophylaxis anyway”), and section headers that shift meaning depending on where they appear. Production healthcare AI is graded on what the system does with that mess, not on what it does with a board-exam stem.

A systematic review published in *Health Care Science* in 2023 catalogs the gap. General-purpose LLMs show competitive performance on benchmark question-answering and summarization tasks, and considerably weaker performance on the entity-level extraction work that actually powers clinical pipelines. A 2024 *JAMIA* study quantified the same pattern on the 2010 i2b2 concept-extraction benchmark: GPT-4 with baseline prompting reached an F1 of 0.804 on MTSamples. BioClinicalBERT, a 110-million-parameter domain-specific model released years earlier, reached 0.901 on the same dataset. On the VAERS adverse-event corpus, the gap was wider: GPT-4 at 0.593 versus BioClinicalBERT at 0.802. Careful prompt engineering closed part of the gap but never eliminated it — and the prompt engineering itself required enough clinical labeling to have trained the domain model in the first place.

The headline-to-reality distance matters because every procurement conversation in healthcare AI starts with it. Executives see the exam number. Engineers inherit the delta.

Lesson one: unstructured notes are where adverse events hide

The first lesson came from a Sentinel Innovation Center program focused on opioid-related adverse events. Sentinel is the FDA’s post-market drug-safety system, and it runs primarily on structured claims data: billing codes, dispensing records, diagnosis flags. For many safety signals, claims are enough. For opioids, they are not.

A clinician who sees a patient nodding off in the chair, documents “appears sedated, spouse concerned about dosing,” and adjusts the prescription downward has just recorded an adverse event. No billing code captures that. The note does. Extracting it requires three linked NLP tasks done together: event classification (is “sedated” describing an observed state, a risk factor, or a ruled-out condition?), named entity recognition (which medication, which dose, which side effect?), and relation extraction (is the sedation linked to the opioid, or to the benzodiazepine started last week, or to neither?).

This is where the general-versus-specialized split stopped being theoretical. On the clinical NER benchmarks that matter for this work, i2b2 2010 for concepts, n2c2 for medications and adverse drug events, domain-specific models held a steady 5 to 15 F1-point lead over general-purpose LLMs, even after prompt tuning. A 2024 *Bioinformatics* paper, BioNER-LLaMA, showed that fine-tuning a 7-billion-parameter open model on biomedical NER data yielded F1 improvements of **5 to 30 points** over few-shot GPT-4 across three standard datasets. The authors noted what anyone who has built these systems knows: LLMs are strong generative models and weaker sequence labelers, because NER is fundamentally a span-localization problem that generation-first models solve awkwardly.

Three observations from our Sentinel work sharpened the point:

The first is that negation and uncertainty dominate clinical text. In a progress note, “denies chest pain” and “c/o chest pain” are one word apart and medically opposite. A general LLM reading under a generation objective tends to smooth over these distinctions, because smooth text is what it was rewarded for producing. A task-specific assertion-status classifier is explicitly trained to tag “chest pain” as *present*, *absent*, *possible*, *conditional*, or *family history*, and the downstream pipeline treats each differently.

The second is that clinical sub-specialties are effectively different languages. A psychiatry progress note and an oncology consult share maybe 40% of their vocabulary. “RA” means rheumatoid arthritis to a rheumatologist and right atrium to a cardiologist; “MS” is multiple sclerosis in neurology and mitral stenosis in cardiology. A model that handles both correctly is one that has seen both, in quantity, with labeled context. Frontier LLMs have seen both; they have not been rewarded for distinguishing them under uncertainty.

The third is that throughput and cost are features. Sentinel work involves hundreds of thousands of notes per study. A 110M-parameter specialized NER model runs on a single commodity GPU at thousands of notes per minute. A frontier LLM running the same task through an API costs orders of magnitude more per note and introduces a per-request latency that turns a three-hour job into a three-day one. For programs with statutory deadlines and bounded compute budgets, that difference decides whether the analysis happens at all.

Lesson two: a list of findings is not an answer

The second lesson came from a different kind of failure. A pharma-safety team wanted to know, for a specific cohort of asthma patients on montelukast, whether neuropsychiatric symptoms documented in the clinical record appeared *after* the drug was started and plausibly followed from it. The information was in the notes. The challenge was not finding it — it was reasoning over the timeline.

Modern extraction pipelines are excellent at pulling structured findings from unstructured text: medication start dates, diagnosis dates, symptom onset, dosage changes. What they do not do out of the box is answer questions that require ordering those findings and reasoning over the order. “Did the insomnia start within 14 days of starting montelukast, and was there a documented attempt to rule out other causes?” is a question no single extraction returns. It requires the system to assemble a per-patient timeline and reason over it.

This is the emerging pattern in medical AI, and it is where LLMs genuinely help — once the underlying data is clean. The architecture that worked for us was a three-layer stack. Small, accurate, task-specific models at the bottom: entity recognition, relation extraction, assertion status, entity resolution to SNOMED and RxNorm. A timeline-assembly layer in the middle: linking the extracted facts into a per-patient longitudinal record, normalizing dates, resolving coreference across notes. And a reasoning layer on top, which can be an LLM, used for what generative models are actually good at, reading an already-structured timeline and answering natural-language questions about it.

When we tried to collapse the stack by sending raw notes plus a reasoning question directly to a general-purpose LLM, accuracy fell. Not because the LLM couldn’t reason, but because the extraction errors it made at the bottom compounded into the reasoning errors it made at the top. A misclassified assertion status on page 3 became a phantom adverse event on page 14. The fix was not a better prompt. The fix was to stop asking one model to do both jobs.

The broader pattern shows up across the peer-reviewed literature. For structured extraction, NER, relation extraction, entity resolution to ICD-10 and SNOMED, domain-specific models win. For summarization, question-answering over already-structured data, and conversational reasoning, LLMs win, especially when the input they operate on has been cleaned up by the specialized layer beneath them. On social determinants of health extraction, a 2024 benchmark found that GPT-4 made roughly three times as many errors as fine-tuned models, because SDOH is a long tail of subtle social context that benefits disproportionately from domain training. On de-identification of clinical notes, domain-tuned models routinely reach above 99% PHI detection while general LLMs sit in the low 90s, with the downstream consequence that one system can run without human review and the other cannot.

What this means for how you build

For executive buyers, CIO, CMIO, CAIO, CDO, this translates into four things worth pushing on when a vendor pitches a medical LLM.

First, ask what benchmarks the accuracy claims come from, and whether those benchmarks include entity-level extraction on clinical text, not just multiple-choice question answering. The headline “passes the medical exam” tells you almost nothing about production performance.

Second, ask for per-task accuracy, not aggregate accuracy. A single number hides the places where general LLMs are genuinely strong (summarization, patient-facing Q&A over a curated knowledge base) and the places where they are not yet strong enough for unsupervised production use (NER, assertion status, de-identification at scale).

Third, ask about the architecture. A system that layers task-specific models for extraction under a generative model for reasoning is a more honest answer than a system that claims one frontier model will do everything. The first will be measurable and improvable per component. The second will be expensive to run and hard to diagnose when it’s wrong.

Fourth, ask about cost and deployment. Running frontier LLMs on every clinical note in a real safety or cohorting program is not economically viable today for most organizations, and in many cases the data cannot leave the environment at all. Specialized models that run on commodity hardware, inside the hospital firewall, are not a second-best option — for production-grade healthcare text work, they are usually the only option that finishes.

The short version

Exam scores made healthcare LLMs credible. Production work made them specific. The two lessons from the field, that unstructured notes are where safety signals actually live, and that a list of extracted findings is not yet an answer, both point the same direction. Purpose-built medical language models, composed into a clean stack, outperform general-purpose LLMs on the clinical tasks that matter and cost a fraction of the price. The work of the next several years is continuing to build that stack, component by measurable component, and resisting the temptation to let a board-exam headline substitute for it.

FAQ

Why does GPT-4 underperform specialized models on clinical NER if it’s the more capable model overall?

Because named entity recognition is a sequence-labeling task and LLMs are trained as text generators. The objective mismatch shows up as over-confident labeling of non-entity spans and under-recall on entities that don’t look like the training distribution. Specialized encoder models (PubMedBERT, BioClinicalBERT) are trained with a labeling objective and domain data. On the 2010 i2b2 benchmark and the VAERS dataset, BioClinicalBERT outperformed GPT-4 by roughly 5 to 15 F1 points depending on the corpus.

Does prompt engineering close the gap?

Partially, not fully. A 2024 *JAMIA* paper showed GPT-4’s F1 on MTSamples rising from 0.804 to 0.861 with a four-component task-specific prompt framework. BioClinicalBERT, without prompt work, sat at 0.901. And the prompt engineering required enough labeled clinical data to have trained a specialized model in the first place.

Where do LLMs earn their place in clinical pipelines?

On summarization of already-structured inputs, conversational question-answering over curated knowledge, and reasoning over an assembled patient timeline. In each case, the LLM operates on clean input produced by specialized models upstream, not on raw clinical text.

Why is de-identification treated as a production bottleneck for general LLMs?

Because the accuracy bar is unusually high and the volumes are unusually large. Below about 99% PHI recall, every note needs human review, which defeats the automation. Above 99%, the pipeline can run unsupervised. General LLMs have not consistently crossed that threshold on real clinical text; domain-tuned systems have, which is what makes hospital-scale de-identification economically viable.

What about the data-privacy side?

Sending clinical notes to a third-party cloud API raises HIPAA, GDPR, and data-sovereignty issues that in many organizations rule the option out before accuracy even enters the conversation. On-premises or private-cloud deployment, with no data leaving the customer’s control, is a hard requirement for most regulated buyers, and most real healthcare AI workloads end up running that way regardless of which model family delivers the best benchmark number.

---

David Talby is CEO of John Snow Labs, whose healthcare NLP and medical LLMs are used by 500+ healthcare and life sciences organizations, including collaborations with the FDA on post-market drug safety. He also leads Pacific AI, which focuses on governance for healthcare AI.

What synthetic patient data quietly breaks in clinical AI

David Talby — Thu, 14 May 2026 16:20:28 GMT

Based on an article published in Forbes Tech Council (May 2023)

Synthetic patient data looks like an elegant answer to a hard problem. Healthcare AI teams need data. Real patient data is protected by HIPAA, GDPR, and a patchwork of national regimes; acquiring, de-identifying, and sharing it is slow and expensive. Synthetic data promises a way around that — generate patients who never existed, train on them instead, ship a model. Gartner has forecast that the majority of AI training data will be synthetic within a few years, and a vendor ecosystem has grown up to meet the demand ([Preprints.org review, 2025](https://www.preprints.org/manuscript/202507.2567/v1)).

The elegance hides a set of real problems. Synthetic data is useful for specific things and dangerous for others. Teams that do not know the difference are shipping clinical models trained on data that systematically misrepresents the patients the model will eventually see — and that failure mode is not visible in any benchmark until something goes wrong in production.

Why synthetic data looks so attractive

The case for synthetic data is straightforward. Patient privacy regulations restrict data sharing. De-identification is technically difficult, legally fraught, and expensive to do at scale. Rare diseases and small subpopulations are systematically under-represented in real data, which limits the accuracy of AI models built to diagnose or serve them. Synthetic datasets appear to solve all three: no real patients means no privacy risk, no de-identification overhead, and — in theory — the ability to oversample rare cohorts to balance the training set.

The approach has genuine uses. Synthetic data is valuable for software testing, where you need realistic-looking records to exercise data pipelines without exposing PHI. It is useful for demos, training exercises, and teaching environments. It can help validate data-integration workflows before real data is loaded. For infrastructure, it is a reasonable substitute for the real thing.

The problems start when synthetic data is used to train or validate models that will then make decisions about real patients.

Problem 1: synthetic data is too clean

Real patient records are messy in specific, consequential ways. Lab values are missing because the test was never ordered or because the sample was hemolyzed and rejected. Medications appear with typos, brand-name and generic-name variants, and dosages that disagree between the med list and the clinical note. Diagnoses are documented once and never reconciled across subsequent encounters. Clinical notes contain contradictions, hedged language, and the trail of reasoning that preceded a provisional diagnosis being changed.

Generative models trained to produce synthetic records optimize for statistical plausibility, not for this kind of messiness. The resulting data looks normal. It passes visual inspection. It trains a model that works beautifully on more synthetic data — and then fails on the real thing, because the real thing is nothing like what the model was trained to handle.

A recent systematic review in *The Lancet Digital Health* put the problem in clinical terms: synthetic data generators frequently fail to preserve the complex interactions between variables that matter for outcomes, such as how obesity and socioeconomic status jointly influence diabetes severity ([Synthetic data, synthetic trust, 2025](https://pmc.ncbi.nlm.nih.gov/articles/PMC12778113/)). When those relationships break down in the training data, predictive models systematically underestimate risk for vulnerable populations — precisely the populations that motivated the move to synthetic data in the first place.

Problem 2: the biases come along for the ride

Generative models are only as good as the data they are trained on, and every bias in the source data is reproduced — often amplified — in the synthetic data derived from it.

This is easy to see once you trace the data provenance. A hospital in a wealthy U.S. metro serves a patient population with specific demographic, socioeconomic, and clinical patterns. A synthetic generator trained on that hospital’s data will produce synthetic patients who look like those patterns: the right age distributions, the right disease prevalences, the right medication histories, the right income and insurance mixes. Ship the generator to another team; they train a model on synthetic data that preserves those patterns; the model then deploys in a rural clinic, a safety-net hospital, or an international setting, and underperforms in ways no one sees coming.

Peer-reviewed work has documented multiple mechanisms by which generative models amplify bias: distributional distortion in diffusion models, truncation-trick bias in GANs, and loss of rare-case representation across generations of iterative re-generation ([Shumailov et al., arXiv:2305.17493](https://arxiv.org/abs/2305.17493)). The authors of that paper coined the term “the curse of recursion” for the observation that models trained on generated data progressively forget the tails of the original distribution — the exact patients, rare presentations, and unusual combinations of conditions that a clinical AI needs to identify correctly.

Advocates point out that synthetic data can correct bias if you actively design the generator to do so. This is true. It is also a substantially harder engineering problem than “generate a statistically plausible version of our existing data,” and teams routinely conflate the two.

Problem 3: privacy risk is not eliminated, it is redistributed

The third pitch for synthetic data is privacy — no real patients, no risk. In practice, synthetic data carries privacy risks of its own, just in different forms.

Peer-reviewed work has shown that synthetic datasets can leak identifiable information about the individuals in the source data, either through membership-inference attacks or through direct reproduction of rare combinations of attributes that appear in only one real patient. The European Medicines Agency, *Nature*, and *PNAS* have all published on the risk ([Nature, 2025](https://www.nature.com/articles/d41586-025-02869-0); [PNAS, 2024](https://www.pnas.org/doi/10.1073/pnas.2414310121)). As with de-identification, the problem is hardest for exactly the cases synthetic data is supposed to help with: rare diseases, unusual presentations, and tail-of-distribution patients whose profile is close enough to unique that a generator cannot mask them while preserving their clinical reality.

There is also a research-ethics dimension that has quietly emerged. *Nature* reported in 2025 that some institutions have waived ethical review for research conducted on synthetic data, on the reasoning that no real humans are involved. The reasoning is wrong. Synthetic data derives from real human data, and the insights it produces affect real human patients when those insights are used to build clinical AI. The privacy layer may have moved; the ethical responsibility has not.

Problem 4: validation against real outcomes is still required

Even synthetic data that preserves statistical patterns well does not tell you how the model built on it will perform on real patients. That requires validation against real-world outcomes, which requires real-world data — the very thing synthetic data was supposed to let you skip.

This is the point regulators have been clearest on. The UK Medicines and Healthcare products Regulatory Agency, the FDA, and the EMA all require that clinical AI models used in regulated contexts be validated against real patient data before deployment. Synthetic data can supplement that validation and support model development; it cannot replace it. A recent *ScienceDirect* review of synthetic data in laboratory medicine put it directly: any insight derived from synthetic data must be rigorously validated against real-world outcomes before clinical implementation ([Synthetic data in the clinical laboratory, 2026](https://www.sciencedirect.com/science/article/pii/S0009898126000604)).

When to use synthetic data — and when not to

After building healthcare AI systems for a decade, my rule of thumb is straightforward. Use synthetic data when the goal is non-clinical: testing a data pipeline, exercising a privacy-preserving ETL, building a demo environment, training data scientists who are not yet cleared for PHI access, or validating an integration. In those contexts, synthetic data is often the right answer and carries few downsides.

Do not use synthetic data as the training or validation substrate for a clinical model that will eventually make decisions about real patients. Even the best synthetic data is downstream of real data and inherits its biases, its blind spots, and its privacy risks. A model trained on it may pass its synthetic benchmarks and fail its first real patient.

The alternative is not to give up on privacy protection. The alternative is to do the harder, more valuable work on real data: regulatory-grade de-identification that meets HIPAA Expert Determination, not just Safe Harbor; de-identification pipelines that handle clinical notes, PDFs, DICOM images, and structured records uniformly; federated or in-environment training that never moves data across organizational boundaries; auditable data provenance so that anyone who needs to can see where every training example came from.

This is where our team at John Snow Labs has put most of our investment over the years. Our de-identification pipeline achieves 96% F1 on PHI detection on peer-reviewed benchmarks — compared with 91% for Azure’s clinical NLP, 83% for AWS Comprehend Medical, and 79% for GPT-4o on the same evaluation — and runs inside the customer’s environment. We de-identified roughly 2 billion clinical notes at Providence under HIPAA Expert Determination, red-teamed by an independent third party, with zero confirmed re-identifications ([johnsnowlabs.com/case-studies](https://www.johnsnowlabs.com/case-studies/)). The real answer to the data-access problem is not generating fake patients well enough to fool your own model. It is handling real patient data well enough that you can use it safely.

Key takeaways

Synthetic patient data is a useful tool for non-clinical work: pipeline testing, demos, training exercises, integration validation. It is a poor training and validation substrate for clinical AI, because it tends to be too clean, inherits and often amplifies the biases of its source data, carries privacy risks of its own, and cannot be used to validate performance against real outcomes. The better path for healthcare organizations that want to build AI on patient data safely is regulatory-grade de-identification, in-environment processing, and auditable provenance on real data — not a synthetic layer that looks cleaner but hides the same problems and adds new ones.

FAQ

Is all synthetic patient data problematic for healthcare AI?

No. Synthetic data is fine — often ideal — for purposes that do not involve training or validating clinical models. Pipeline testing, user demonstrations, integration work, classroom exercises, and pre-production ETL are all appropriate uses. The problems arise when synthetic data becomes the basis on which a model that will affect real patients is trained or validated.

Doesn’t synthetic data solve the privacy problem?

It moves the problem rather than solving it. Synthetic datasets can still leak information about the individuals in the source data through membership-inference attacks, and rare-case patients are particularly at risk because their clinical profile is often close to unique. Institutions should treat synthetic data as deriving from real patients and apply appropriate oversight accordingly.

Why does synthetic data tend to be “too clean”?

Generative models optimize for statistical plausibility, not for the specific messiness of real clinical records — missing labs, typo’d medications, contradictory notes, unreconciled diagnoses. Models trained on the output perform well on synthetic test data and worse on real data, because they never learned to handle the noise that dominates real clinical work.

Can well-engineered synthetic data correct bias in the source data?

In principle yes, but this requires deliberate engineering of the generator to over-represent under-represented cohorts, not just a statistically faithful copy. Most off-the-shelf synthetic-data workflows do the latter and therefore preserve or amplify source-data biases. The “we’ll correct for bias with synthetic data” argument is real engineering work, not a property of synthetic data as such.

What about Synthea, MDClone, and other established synthetic-data systems?

Tools like Synthea generate patients from publicly available health statistics and clinical guidelines, which is well-suited to its actual use cases — software testing, teaching, research prototyping. MDClone-style tools that derive synthetic versions from real hospital data are useful for certain research purposes. Neither was designed as a replacement for real-data training and validation in regulated clinical AI, and teams should be cautious about using them that way.

What should organizations do instead?

Invest in regulatory-grade de-identification that meets HIPAA Expert Determination, handles structured and unstructured data uniformly, and runs inside your environment. Combine that with in-environment model training so data never leaves your control, and auditable provenance so every training example can be traced back to a real source. That is the path that scales, meets regulatory expectations, and produces models that actually work on real patients.

How do regulators view synthetic data in clinical AI validation?

The FDA, EMA, and UK MHRA all currently require validation of clinical AI models against real patient data before regulatory approval. Synthetic data can supplement model development and testing, but it does not replace the real-world evidence step. Any vendor or team that tells you otherwise is not reading the current regulatory guidance accurately.

---

David Talby is CEO of John Snow Labs, a healthcare AI company whose de-identification, medical NLP, and Medical LLM technology is used by 500+ healthcare and life sciences organizations. The Providence 2-billion-note de-identification project referenced here is described in a public case study at [johnsnowlabs.com/case-studies](https://www.johnsnowlabs.com/case-studies/).

Three non-negotiables that separate regulatory-grade healthcare AI from LLM Pilots

David Talby — Wed, 13 May 2026 14:58:42 GMT

*Based on articles in Forbes (April 2023) and CIO (June 2023)*

Medical question-answering benchmarks flatter general-purpose LLMs. The same model that scores 85% on USMLE-style multiple choice can produce medically unsupported statements at rates of 19.7% on textbook-grounded questions ([Quantifying Hallucinations, 2025](https://arxiv.org/html/2603.09986)), and even higher on open-ended clinical generation. For regulated work, that gap — between benchmark score and real behavior — is the gap between a useful demo and a production-grade system. Closing it requires three specific properties no general-purpose LLM comes with by default.

I call that bar regulatory-grade AI. Organizations buying AI for healthcare, life sciences, finance, or law should require all three before putting a model into any workflow that will be audited.

What “regulatory-grade” actually means

Regulatory-grade AI is shorthand for the set of properties an AI system needs to operate inside a regulated industry — where an auditor can ask about any decision, where data sovereignty is not negotiable, and where hallucinated outputs are not a quirky failure mode but a patient-safety event or a compliance violation.

It is a higher bar than “high-performing” or “state-of-the-art.” A model can top a leaderboard and fail this bar. A model that clears this bar will never be as flexible or as broadly capable as the latest frontier model, and that is the point: in regulated work you are trading capability for accountability, and the trade is the right one.

The three non-negotiables below are what I have seen separate AI that gets past procurement, legal, security, and compliance review from AI that stalls there indefinitely.

Non-negotiable 1: every answer cites its source

A peer-reviewed benchmark using textbook-grounded questions showed LLaMA-70B-Instruct hallucinating in roughly 20% of answers, with 98.8% of those responses still receiving “maximal plausibility” ratings from human evaluators ([arXiv:2603.09986](https://arxiv.org/html/2603.09986)). Read that again. One in five answers was medically unsupported, and humans found those answers just as plausible-sounding as the correct ones.

This is the core problem with using a general LLM for regulated decisions. The model does not know when it is hallucinating, and neither do you. Confidence and correctness are decoupled.

The fix is structural, not statistical. A regulatory-grade system does not produce an answer without also producing its evidence. Every claim returned to the user comes with the source it was drawn from — a clinical guideline, a peer-reviewed paper, a patient chart, a terminology code — in a form the user can click through and verify.

A clinician reading the system’s output sees not just “the guideline recommends X” but the specific guideline, the specific section, and — if applicable — the study size and year. That lets the clinician accept a recommendation grounded in a 40,000-patient randomized trial differently from one grounded in a case report of 12 patients. The system’s confidence in the answer stops mattering; the evidence behind it starts mattering.

Practically, this requires retrieval-augmented generation over a trusted corpus, with citations passed through to the output and preserved in the audit log. A model that answers from its pre-training weights alone cannot meet this bar, no matter how accurate it scores on any benchmark. If you cannot click the citation and read the source, the answer is not regulatory-grade.

This is where healthcare-specific models begin to diverge structurally from general LLMs. John Snow Labs’ Medical LLM is built around cited outputs — every answer returns the supporting document with the specific paragraph, not just a confident paragraph of prose. The difference shows up in clinician trust, and it shows up in audit defensibility years later.

Non-negotiable 2: the system is tested and documented against responsible-AI criteria

A model that passes a clinical accuracy benchmark has shown one thing: it performs well on that benchmark. It has not shown that it performs equitably across demographic groups, that it refuses unsafe questions, that its training data is free of privacy leakage, or that its outputs are reproducible.

Regulators increasingly expect evidence on all of these. The EU AI Act classifies most healthcare AI as high-risk, which triggers requirements around fairness testing, risk management documentation, post-market monitoring, and transparency to deployers ([EU Regulation 2024/1689](https://eur-lex.europa.eu/eli/reg/2024/1689/oj)). The European Medicines Agency’s guiding principles on LLMs in regulatory science explicitly require evaluation against fundamental rights principles including fairness, human oversight, privacy, and explicability ([EMA Guiding Principles, 2024](https://www.ema.europa.eu/en/documents/other/guiding-principles-use-large-language-models-regulatory-science-medicines-regulatory-activities_en.pdf)). Multiple U.S. states have passed or are passing AI-specific legislation with comparable requirements. None of this is going away.

Meeting the bar in practice means running a battery of tests on every model before deployment and on a scheduled cadence after. The tests cover robustness against paraphrased and adversarial inputs, bias across demographic slices, toxicity and refusal behavior, representation of populations in training data, and leakage of personally identifiable or copyrighted material. The results are captured in a document that a regulator or internal compliance team can read without having to ask for a Jupyter notebook.

The tests also have to be executable — not a one-off PDF that freezes the system in time, but a suite that runs on every model update. Models drift. Data distributions drift. Regulatory expectations evolve. A responsible-AI evaluation that is not repeatable is a responsible-AI evaluation that is out of date within a quarter.

This is the piece most teams underestimate when they first pilot an LLM in healthcare. The model works in the demo. Then legal asks for the bias report, or security asks for the data-leakage audit, or the procurement team asks for the explainability documentation, and the project gets paused for months while the team scrambles to build something that should have existed from day one. Pacific AI’s governance platform exists for exactly this reason — to run the required tests continuously, produce documentation regulators can read, and keep the evaluation current as models and regulations change.

Non-negotiable 3: the system runs where your data lives

The third non-negotiable is the simplest to state and the most commonly violated: your data must not leave your control. That means the model runs inside your environment — on-premises, in your private cloud, or in an air-gapped region — with no API calls to an external service and no logs of patient data on a third-party’s infrastructure.

This is not a nice-to-have. HIPAA, GDPR, and most national data-protection regimes impose strict limits on where protected data can go and who can access it. Business associate agreements, data processing agreements, and sub-processor audit requirements all break down when the AI system is a black-box API hosted in a jurisdiction your legal team has not approved.

The common workaround — “we’ll just de-identify the data before we send it to the API” — does not survive contact with reality. Clinical notes contain vast amounts of implicit identifying information beyond the eighteen HIPAA Safe Harbor identifiers: rare diseases, unusual procedures, named providers, geographic references, temporal anchors. Large-scale re-identification studies have repeatedly shown that naive de-identification is not enough for complex clinical text. Regulatory-grade de-identification is itself a hard problem requiring purpose-built models — which is why our team at John Snow Labs has published the evidence behind the 96% F1 accuracy we ship, versus 91% for Azure’s clinical NLP service, 83% for AWS Comprehend Medical, and 79% for GPT-4o on the same peer-reviewed evaluation.

Running in your environment is not incompatible with cloud. It means you control the infrastructure, the encryption keys, the logs, and the network perimeter. AWS, Azure, and GCP all support deployment models that keep your data within your tenancy and out of a shared service. What the requirement rules out is handing patient data to a multi-tenant API whose provider can read, log, retain, or use it for any purpose beyond answering your specific query.

The practical effect on architecture is substantial. In-environment deployment means the model has to be small enough, efficient enough, and operationally hardened enough to run on infrastructure your team already manages. Our Medical LLM runs on a single GPU at the scale of hundreds of thousands of documents per day, precisely because that is what customers deploying in-environment need. A model that requires a 10-GPU cluster and an unconstrained internet connection to function is not a realistic option for a U.S. health system or an EU-based pharmaceutical company. It is a leaderboard exhibit.

Why this is harder than it looks — and why it is worth it

None of these three properties comes for free. Citations require a retrieval layer and a trusted document corpus. Responsible-AI testing requires evaluation infrastructure, labeled test data, and documentation discipline. In-environment deployment requires smaller, more efficient models and the engineering to operate them. Every one of these raises the cost of building the system and lowers its apparent capability relative to the latest frontier API.

The trade-off is real. It is also the trade-off every regulated industry has made for every technology it has ever adopted. Clinical laboratories do not use the fastest assays; they use the validated ones. Trading systems do not deploy the highest-performing models straight from a Kaggle notebook; they deploy the ones that have cleared risk and compliance review. Aircraft avionics do not run on the latest operating system; they run on the software that has cleared DO-178C.

Healthcare AI is going through the same maturation. The systems that actually reach production in regulated environments will not be the ones that win the most benchmarks. They will be the ones that can cite their sources, show their test results, and run where the data lives — because those are the systems a CIO, a CMIO, a compliance officer, and a regulator can all sign off on.

Key takeaways

Regulatory-grade AI is the bar healthcare organizations should require before deploying an LLM into any workflow that will be audited. The three non-negotiables are that the system cites its sources rather than generating from weights alone, that it is documented against responsible-AI criteria in a form auditors can read, and that it runs inside the customer’s environment with no data leaving. General-purpose LLMs meet none of these by default. Healthcare-specific systems can be built to meet all three. For executive buyers evaluating vendors, these three questions — ask for the citation example, ask for the responsible-AI report, ask for the deployment architecture — are the fastest way to separate the demos from the systems that will make it to production.

FAQ

What does regulatory-grade AI mean?

It is a bar higher than “high-performing.” A regulatory-grade AI system cites its sources for every answer, has documented and executable responsible-AI tests, and runs inside the customer’s environment with no data leaving. Systems missing any of these three cannot reliably pass compliance review in healthcare, life sciences, finance, or law.

Why isn’t a high benchmark score sufficient?

Benchmarks measure a narrow slice of performance. A peer-reviewed study found LLaMA-70B-Instruct hallucinating on roughly 20% of textbook-grounded medical questions while 98.8% of those responses still sounded plausible to evaluators. Benchmarks do not capture hallucination rate in open-ended generation, fairness across populations, reproducibility, or privacy risk.

How does citing sources reduce hallucination risk?

It changes the locus of trust. Rather than trusting the model’s output, the user trusts the underlying source the model retrieved and presented. If the source is wrong, the user can see that. If the model fabricates a source or the citation does not actually support the claim, the user sees that too. The model’s confidence stops being the basis for the decision; the evidence does.

What responsible-AI tests should a healthcare LLM undergo?

At minimum: robustness to adversarial and paraphrased inputs, bias across demographic slices, toxicity and refusal behavior, representation of training-data populations, data-leakage testing, and reproducibility of outputs. The tests should be executable and re-runnable on every model update, and results should be presented in a form auditors and regulators can read without technical handholding.

Does “running in the customer’s environment” preclude cloud?

No. It means you control the infrastructure, the encryption keys, the logs, and the network perimeter. AWS, Azure, and GCP all support configurations that satisfy this bar. What it rules out is sending patient data to a multi-tenant API whose provider can read, log, retain, or use it beyond answering the specific query.

Why can’t naive de-identification let us use general-purpose APIs?

Clinical notes contain implicit identifying information well beyond the eighteen HIPAA Safe Harbor fields — rare diseases, unusual procedures, named providers, geographic references. Regulatory-grade de-identification is itself a hard problem that requires purpose-built models; peer-reviewed evaluations show general-purpose LLMs lag purpose-built systems by multiple percentage points of F1 accuracy on this task. Even if you solve de-identification, you still face the data-sovereignty, logging, and sub-processor issues that BAAs and DPAs are built around.

Which buyer in the organization owns this bar?

It depends on the organization. CIOs and CAIOs typically own the deployment and governance side. CMIOs and CDOs weigh in on the accuracy and clinical-fit side. Compliance and legal review the documentation and data-flow side. The three non-negotiables exist precisely because healthcare AI purchases cross all four of these desks, and a system that fails any one of them fails the purchase.

Why the FDA chose NLP to close a blind spot in post-market drug safety

David Talby — Tue, 12 May 2026 16:21:36 GMT

*Originally covered in FedScoop, Enterprise AI News, and Pharmaceutical Technology — April 2023*

The FDA’s Sentinel Initiative monitors the safety of drugs used by rou

ghly 138 million people in the United States. It is the largest active post-market surveillance system in the world, and since 2016 it has informed more than 120 regulatory decisions, including label changes on hydrochlorothiazide, beta-blockers, and other commonly prescribed medications. Sentinel works. It also has a structural gap that insurance claims alone cannot close — and in April 2023, the FDA chose Cerner Enviza (an Oracle company) and John Snow Labs to help close it using natural language processing on clinical notes.

The blind spot Sentinel has always had

Sentinel’s backbone is insurance claims data. Claims are excellent for what they capture: prescriptions dispensed, procedures billed, diagnosis codes submitted, hospitalizations recorded. They cover clearly defined enrollment periods and are close to complete for the events a payer pays for.

They are also missing most of what a clinician actually writes down. Symptom descriptions, suspected adverse drug reactions that never become a formal diagnosis, patient-reported side effects, medication adherence notes, lifestyle and social context, the reasoning behind a treatment change — these live in unstructured clinical notes, not in claims fields. Research in *npj Digital Medicine* has shown that the most frequently cited reasons analysts cannot use the current claims-based Sentinel for a given safety question are the lack of clinical detail needed to identify outcomes accurately, missing or inaccurate measures of confounders, and the absence of computable phenotyping algorithms precise enough for the study population ([Desai et al., 2021](https://www.nature.com/articles/s41746-021-00542-0)).

Congress recognized the gap. The 21st Century Cures Act and subsequent directives required the FDA to expand Sentinel’s infrastructure to include electronic health record data on at least 10 million lives. The FDA’s Innovation Center stood up a Real-World Evidence Data Enterprise linking EHRs to claims for more than 25 million patients across commercial and academic partners ([Maro et al., *Clinical Pharmacology & Therapeutics*, 2023](https://pubmed.ncbi.nlm.nih.gov/39385712/)).

Getting the data is the easy part. Making it usable is the harder one — because the most valuable fields in an EHR are free text.

Why manual chart review does not scale to a population

Post-market surveillance operates at a scale where a human reviewer is a bottleneck, not a solution. Consider a signal-identification study of a drug used by a million patients: even if suspected adverse events show up in the notes of only 5% of those patients, that is 50,000 charts to review, with several relevant notes per chart. A team of ten experienced reviewers working at the typical rate of 20 to 30 charts per day would need several years to finish a single study. By then the signal is either long-confirmed from another source or the drug has been on the market long enough that the safety question has shifted.

The result is that safety investigations that require clinical detail either get rescoped down to what claims alone can answer, or they get dropped. Both outcomes are bad. Rescoping loses the granularity of the question. Dropping means the agency waits longer for a signal it could have caught.

This is the bottleneck that NLP is built to remove. Adverse event mentions, negation context, temporal relationships to drug exposure, symptom severity — these are standard outputs of modern clinical NLP pipelines. What has held the technology back from Sentinel is not the algorithms. It is regulatory-grade accuracy at production scale, and the governance required for an FDA program.

What the FDA actually chose

The project is called MOSAIC-NLP — Multi-source Observational Safety Study for Advanced Information Classification Using NLP models. It is a two-year effort under the Sentinel Innovation Center, with Cerner Enviza contributing the EHR and life-sciences research platform and John Snow Labs contributing the clinical NLP. Children’s Hospital of Orange County, National Jewish Health, and Kaiser Permanente Washington Health Research Institute are providing clinical expertise and access to real-world data.

The first use case is deliberately hard: montelukast. The asthma drug carries an FDA boxed warning for neuropsychiatric side effects including mood changes, sleep disturbance, and suicidal ideation. These events are the kind of signal that is frequently documented in clinical notes — a parent tells the pediatrician that a child has been having nightmares or sudden mood swings since starting the medication — but rarely shows up as a billed diagnosis. If claims alone could answer the question, the FDA would not have needed to add the box warning only after years of post-market accumulation.

The technical work is straightforward in outline and unforgiving in execution. Pipelines must extract drug exposures, adverse events, symptoms, severity, negation, and temporal relationships from notes written by thousands of different clinicians across different institutions. They must run at the scale of tens of millions of documents. They must do it under HIPAA Safe Harbor or Expert Determination, with no patient data leaving the institutions’ secure environments. And the outputs must be auditable and reproducible, because they will contribute to regulatory decisions.

Why this is a credibility event for clinical NLP

For those of us who have been building medical language models, the MOSAIC-NLP selection is worth pausing on. The FDA picks its Sentinel partners carefully. The agency’s Innovation Center has been explicit that moving NLP into routine pharmacovigilance is a strategic priority, and that any system in that role has to meet the same bar of methodological rigor that the rest of Sentinel meets — peer-reviewed methods, reproducible analyses, distributed data governance, validated phenotypes.

Two things about the 2023 announcement signal where the field is going. First, the FDA selected partners whose NLP is already deployed at scale in other healthcare settings rather than a research prototype. Spark NLP for Healthcare is in production at health systems and pharma companies whose models need to run on billions of notes without sending data to an external API. Second, the program is not an exploratory study. It is a production evaluation designed to inform how the FDA routinely uses unstructured data in future safety investigations.

For those evaluating NLP for regulated healthcare workflows, three questions are now effectively settled at the agency level, and anyone procuring this technology should expect to answer them: Does the pipeline run inside your environment, with no data leaving it? Are the underlying models published, peer-reviewed, and independently validated on clinical benchmarks? Can the entire pipeline’s outputs be reproduced and audited years later? If any of those answers is “no,” the pipeline is not ready for regulated work.

What MOSAIC-NLP tells us about where real-world evidence is heading

The deeper story here is not about one asthma drug. It is about what real-world evidence looks like when it no longer depends on structured claims alone.

Claims-based Sentinel analyses can tell you that a patient was prescribed a drug, filled it, and later had an emergency room visit. They cannot tell you whether the patient’s mood changed, whether the dose was adjusted, whether a family member reported the change, or whether the clinician attributed the event to the drug and documented that attribution. All of that context sits in free text, and all of it changes what a pharmacoepidemiologist can and cannot conclude.

The Sentinel Innovation Center has laid out four strategic priorities to close this gap: data infrastructure that links EHRs with claims, feature engineering to turn notes into analyzable variables, causal inference methods that account for the new sources of bias these data introduce, and detection analytics that flag signals sooner. NLP is a prerequisite for the middle two. Without reliable extraction of adverse events, severity, negation, and temporality, the statistical methods downstream have nothing new to work with.

The direction is clear. The agency’s recent 2022-to-2024 Sentinel assessment notes that NLP-supported signal identification is being folded into routine pharmacovigilance activities ([FDA Sentinel System Assessment, 2025](https://www.fda.gov/media/189028/download)). Life sciences sponsors whose post-approval commitments include Sentinel-ready analyses will face the same expectations. The pharmaceutical companies already planning their next generation of post-marketing safety programs are asking the same three procurement questions the FDA is — accuracy on clinical benchmarks, in-environment deployment, reproducibility — and building their vendor evaluations around the answers.

Key takeaways

Post-market drug safety surveillance has lived inside claims data for most of its history because the technology to read clinical notes at scale was not ready. That is changing. The FDA’s selection of Cerner Enviza and John Snow Labs for MOSAIC-NLP is a concrete signal that production-grade clinical NLP is now considered reliable enough to contribute to regulatory decisions, starting with a drug whose most serious side effects show up in notes far more often than in claims.

For healthcare AI leaders, the takeaway is a procurement one. The same criteria the FDA is applying — peer-reviewed accuracy, in-environment deployment, reproducible outputs — are the criteria that should govern any clinical NLP procurement aimed at regulated work. The bar has moved. The technology has to meet it.

FAQ

What is the FDA Sentinel Initiative?

Sentinel is the FDA’s national post-market surveillance system for drugs, vaccines, biologics, and medical devices. It uses a distributed data network covering approximately 138 million people, primarily through insurance claims linked with a growing subset of electronic health records, and has informed more than 120 regulatory decisions since 2016.

Why is NLP being added to Sentinel now?

Claims data captures what was billed, not what was observed. Many adverse events — especially symptom-based events like mood changes, fatigue, or cognitive complaints — are documented in clinical notes but never reach a structured diagnosis code. NLP is the only scalable way to extract that information, and production-grade medical NLP has reached the accuracy and governance bar the FDA requires.

What is MOSAIC-NLP?

Multi-source Observational Safety Study for Advanced Information Classification Using NLP models — a two-year FDA Sentinel Innovation Center project with Cerner Enviza, John Snow Labs, Children’s Hospital of Orange County, National Jewish Health, and Kaiser Permanente Washington Health Research Institute. The first study evaluates neuropsychiatric side effects of montelukast.

Why montelukast as the first use case?

Montelukast carries an FDA boxed warning for neuropsychiatric side effects. These events are frequently documented in clinical notes but rarely coded as diagnoses, which makes them a strong test case for NLP-augmented signal identification and a question claims data alone cannot answer well.

What technical bar does NLP have to clear to be used in FDA regulatory work?

Three things, in my view. Peer-reviewed, independently validated accuracy on clinical benchmarks. Deployment inside the data owner’s secure environment with no patient data leaving — distributed data analytics is a core Sentinel principle. Fully reproducible and auditable outputs, because the results contribute to regulatory decisions that may be revisited years later.

Does this mean NLP will replace claims-based Sentinel analyses?

No. Claims data remains the backbone of Sentinel and is well-suited to the questions it was designed to answer. NLP extends Sentinel to questions claims alone cannot answer — questions where the clinical detail, symptom documentation, or temporal relationship to a drug exposure lives in the note text.

How should healthcare AI leaders interpret this announcement?**

It is the clearest signal yet that clinical NLP is moving from research into regulated production. If you are procuring NLP technology for pharmacovigilance, real-world evidence, or any clinical workflow that may be audited, ask the same questions the FDA is asking: peer-reviewed accuracy, in-environment deployment, reproducibility. Vendors who cannot answer all three are not ready for this work.

Building Responsible Language Models with the NLP Test Library

David Talby — Tue, 02 May 2023 15:57:00 GMT

The nlptest library is designed to help you do that by providing comprehensive testing capabilities for both models and data. It allows you to quickly generate, run, and customize tests to ensure your NLP systems are production-ready. With support for popular NLP libraries like transformers, Spark NLP, OpenAI, and spacy, nlptest is an extensible and flexible solution for any NLP project.

In this article, we’ll dive into three main tasks that the nlptest library helps you automate: Generating tests, running tests, and augmenting data.

Automatically Generate Tests

Unlike the testing libraries of the past, nlptest allows for the automatic generation of tests – to an extent. Each TestFactory can specify multiple test types and implement a test case generator and runner for each one.

The generated tests are presented as a table with ‘test case’ and ‘expected result’ columns that correspond to the specific test. These columns are designed to be easily understood by business analysts who can manually review, modify, add, or remove test cases as needed. For instance, consider the test cases generated by the RobustnessTestFactory for an NER task on the phrase “I live in Berlin.”:

Starting from the text “John Smith is responsible”, the BiasTestFactory has generated test cases for a text classification task using US ethnicity-based name replacement.

Generated by the FairnessTestFactory and RepresentationTestFactory classes, here are test cases that can ensure representation and fairness in the model’s evaluation. For instance, representation testing might require a test dataset with a minimum of 30 samples of male, female, and unspecified genders each. Meanwhile, fairness testing can set a minimum F1 score of 0.85 for the tested model when evaluated on data subsets with individuals from each of these gender categories.

The following are important points to take note of regarding test cases:

Each test type has its interpretation of “test case” and “expected result,” which should be human-readable. After calling h.generate(), it is possible to manually review the list of generated test cases and determine which ones to keep or modify.
Given that the test table is a pandas data frame, it is editable within the notebook (with Qgrid) or exportable as a CSV file to allow business analysts to edit it in Excel.
While automation handles 80% of the work, manual checks are necessary. For instance, a fake news detector’s test case may show a mismatch between the expected and actual prediction if it replaces “Paris is the Capital of France” with “Paris is the Capital of Sudan” using a replace_to_lower_income_country
Tests must align with business requirements, and one must validate this. For instance, the FairnessTestFactory does not test non-binary or other gender identities or mandate nearly equal accuracy across genders. However, the decisions made are clear, human-readable, and easy to modify.
Test types may produce only one test case or hundreds of them, depending on the configuration. Each TestFactory defines a set of parameters.
By design, TestFactory classes are usually task, language, locale, and domain-specific, enabling simpler and more modular test factories.

Running Tests

To use the test cases that have been generated and edited, follow these steps:

Execute h.run() to run all the tests. For each test case in the test harness’s table, the corresponding TestFactory will be called to execute the test and return a flag indicating whether the test passed or failed, along with a descriptive message.
After calling h.run(), call h.report(). This function will group the pass ratio by test type, display a summary table of the results, and return a flag indicating whether the model passed the entire test suite.
To store the test harness, including the test table, as a set of files, call h.save(). This will enable you to load and run the same test suite later, for example, when conducting a regression test.

Below is the example of a report generated for a Named Entity Recognition (NER) model, applying tests from five test factories:

All the metrics calculated by nlptest, including the F1 score, bias score, and robustness score, are framed as tests with pass or fail outcomes. This approach requires you to specify the functionality of your application clearly, allowing for quicker and more confident model deployment. Furthermore, it enables you to share your test suite with regulators who can review or replicate your results.

Data Augmentation

A common approach to enhance the robustness or bias of your model is to include new training data that specifically targets these gaps. For instance, if the original dataset primarily consists of clean text without typos, slang, or grammatical errors, or doesn’t represent Muslim or Hindi names, adding such examples to the training dataset will help the model learn to handle them more effectively.

Generating examples automatically to improve the model’s performance is possible using the same method that is used to generate tests. Here is the workflow for data augmentation:

To automatically generate augmented training data based on the results from your tests, call h.augment() after generating and running the tests. However, note that this dataset must be freshly generated, and the test suite cannot be used to retrain the model, as testing a model on data it was trained on would result in data leakage and artificially inflated test scores.
You can review and edit the freshly generated augmented dataset as needed, and then utilize it to retrain or fine-tune your original model. It is available as a pandas dataframe.
To evaluate the newly trained model on the same test suite it failed on before, create a new test harness and call h.load() followed by h.run() and h.report().

By following this iterative process, NLP data scientists are able to improve their models while ensuring compliance with their ethical standards, corporate guidelines, and regulatory requirements.

Getting Started

Visit nlptest.org or run pip install nlptest to get started with the nlptest library, which is freely available. Additionally, nlptest is an early stage open-source community project you are welcome to join.

John Snow Labs has assigned a full development team to the project, and will continue to enhance the library for years, like our other open-source libraries. Regular releases with new test types, tasks, languages, and platforms are expected. However, contributing, sharing examples and documentation, or providing feedback will help you get what you need faster. Join the discussion on nlptest’s GitHub page. Let’s work together to make safe, reliable, and responsible NLP a reality.

Share AI in Healthcare

3 Criteria for Regulatory-Grade Large Language Models

David Talby — Thu, 27 Apr 2023 05:33:00 GMT

Large language models (LLMs) have the potential to revolutionize decision-making and creative processes in many industries. Regarding regulated sectors such as healthcare and life sciences, certain issues, gaps, and limitations exist – and spur the need for a higher standard of AI, known as Regulatory Grade AI. This article aims to define three criteria that make an AI model "regulatory grade" suitable for use in highly regulated fields while ensuring the utmost level of compliance, accuracy, and safety.

Subscribe now

The No BS Principle

The first criterion is the "No Bullshit" principle. This simply means that LLMs should be designed in a way that prevents them from generating hallucinations or returning false information. Instead, they should be able to cite the source of any answer they provide.

This feature allows human experts to review the cited source and assess its reliability. For instance, a doctor may receive an answer regarding a clinical guideline from the AI model. If the model cites a study that involved fewer than 100 patients, the doctor can decide not to trust that specific paper, as it may not be sufficiently robust or representative. By providing a transparent trail of evidence, the "No BS" principle ensures that AI-generated information is held to the same standard as any other expert opinion.

Responsible AI

The second criterion for regulatory-grade AI is Applied Responsible AI. This means that AI models should undergo rigorous testing to ensure robustness, bias mitigation, fairness, toxicity reduction, accuracy, representation and prevention of data leakage.

These tests should be executable and presented in a human-readable format that can be easily shared with regulators. By demonstrating a commitment to responsible AI practices, organizations can reassure regulators, customers, and other stakeholders that their AI models are not only compliant but also adhere to the highest ethical and technical standards.

Privacy: No Sharing

The third criterion for regulatory-grade AI is the ability to run privately within an organization's firewall. This ensures that no proprietary or sensitive data is shared or transmitted outside the organization, maintaining security and confidentiality. Systems should be designed from the ground up to work seamlessly in high-compliance, air-gapped environments, protecting organizations from data breaches and other cyber threats.

By keeping data and processing in-house, organizations can maintain control over their information, which is essential for compliance with stringent regulations in fields such as healthcare and life sciences. Note that this criterion does not preclude running in a cloud environment - as long as you control the infrastructure and encryption keys and no one else ever sees your data.

Establishing high standards for AI models is crucial in today's rapidly evolving technological landscape. By developing and adopting regulatory-grade AI, organizations can ensure that their AI-driven decision-making processes are safe and effective. This will ultimately lead to better outcomes for patients, more efficient research, and increased trust in AI-powered solutions across regulated industries.

Share AI in Healthcare

Medical Large Language Models Are Available Now And More Accurate Than General-Purpose LLM's

David Talby — Thu, 13 Apr 2023 05:38:00 GMT

Large language models (LLM’s) unlock new use cases in healthcare NLP, so as part of our commitment to always keep you at the state of the art, the latest 4.4 release of John Snow Labs’ Healthcare NLP includes a suite of new LLM’s that are healthcare specific, highly accurate, and production ready. Here’s what you need to know:

1. They cover a range of common healthcare use cases

Ask medical questions: Try asking the new BioGPT-JSL (the first ever closed-book medical question answering LLM based on BioGPT) “how to treat asthma”.
Understand medical research: Give the MedicalQuestionAnswering annotator a PubMed abstract and ask it what the key results were.
Generate clinical text: Prompt the MedicalTextGenerator annotator to complete “66yo male patient presents with severe back pain and …”.
Summarize clinical encounters: Ask the MedicalSummarizer annotator to turn a visit summary, discharge note, radiology report, or pathology reports into one paragraph.
Summarize questions from patients: With 5 models for 5 contexts, MedicalSummarizer can also turn an email or post from a patient into a one-sentence question.

2. They’re more accurate than general-purpose LLM’s.

Clinical note summarization is 30% more accurate than general state-of-the-art LLMs (BART, Flan-T5, Pegasus).
On clinical entity recognition, our models make half of the errors that ChatGPT does.
De-Identification out-of-the-box accuracy is 93% compared to ChatGPT’s 60% on detecting PHI in clinical notes.
Extracting ICD-10-CM codes is done with a 76% success rate versus 26% for GPT-3.5 and 36% for GPT-4.

It should come as no surprise that models trained with domain-specific data & experts outperform general-purpose models. We’re happy to share the Python notebooks if you need to reproduce or customize the benchmarks.

3.. They’re production ready.

Runs on your infrastructure, behind your firewall, under your security controls. No text is ever sent to any third party or cloud service.
No need to buy a shipload of GPU’s. We’ve engineered these LLM’s to run on commodity hardware, which makes them both much faster and much cheaper to scale.
Regularly updated. LLM’s are regularly tuned as new research papers, clinical trials, guidelines and terminologies are published. Never go to production with a stale model.
Subscribe now

Most importantly, models will be frequently rebuilt: We’ll keep rebuilding as research evolves. Because only one thing is certain about today’s state-of-the-art LLM’s: If you train one today, it will be outdated in 3-6 months.

If you’re a John Snow Labs customer, all these capabilities are included in your Healthcare NLP subscription. Install the new 4.4 release and give it a go. If you’d like to learn more, join the next webinar on automated summarization of clinical notes on April 26^th.

Share AI in Healthcare

An early evaluation of ChatGPT on common medical NLP tasks

David Talby — Mon, 20 Mar 2023 16:51:00 GMT

Motivation

John Snow Labs’ main promise to the healthcare industry is that we will keep you at the state of the art. We’ve reimplemented our core algorithms every year since 2017 – migrating to BERT, then BioBERT, then our own fine-tuned language models, then token classification & sequence classification models, then zero-shot learning, and recently end-to-end visual document understanding and speech recognition. Our biggest customers work for us to be future-proof – because unlike others we do not advocate of stick to a specific technology or approach, but instead evolve quickly to productize the best-performing techniques as they become available.

As such, we regularly try and benchmark new papers, models, libraries, or services that come out claiming new capabilities in healthcare NLP. This includes recently released models like ChatGPT and BioGPT. Since we get asked about them a lot, this blog post summarizes early findings in benchmarking them versus current state-of-the-art models for medical natural language processing tasks: named entity recognition, relation extraction, assertion status detection, entity resolution, and de-identification.

TL;DR: We do not recommend these models for production use today. They are impressive research advances, and we use them internally to bootstrap smaller and more accurate models, but they are not fit for the vast majority of real-world use cases.

What are the issues?

Simply put, ChatGPT doesn’t do what you need it to:

In our internal evaluations of such models, they significantly lag in accuracy compared to current state-of-the-art models. Precision ranged between 0.66 to 0.86, and recall was particularly problematic at between 0.40 and 0.52. This means that the human abstractors you have in place will still have to read the entire documents, resulting in minimal time & cost savings for you.
There is no way to tune and provide feedback to these models. While they are very fast to bootstrap, you cannot tune ChatGPT, meaning that these models won’t improve over time based on feedback from your abstractors (or the many historical documents you’ve already abstracted). This is critical in healthcare systems, where models have to be localized due to differences in clinical guidelines, writing styles, and business processes.
These models are far slower and more expensive to run than their productized & more accurate counterparts. The cost of a single ChatGPT query is estimate to be $0.36. Given the length of typical patient stories, this implies paying tens of dollars in computing costs alone to analyze a single cancer patient’s story. Beyond hardware costs, there is also the issue of clock time: For example, reproducing clinical NER benchmarks on Facebook’s Galactica model required 4 hours to process a single 2-page note on a machine with 8 GPUs.
These models are not regularly updated. They are typically not retrained (new versions come out instead), or retrained annually. This means that you’ll be missing new clinical terms, medications, and guidelines – with no ability to tune or train these models.
There is no support for visual documents – operating on scanned documents or images at all. This means that a portion of the work that is often required in real-world use cases – like clinical abstraction, clinical decision support, or real-world data – will remain manual, and as a result, the overall result will require a separate OCR pipeline or have to remain manual (since the models won’t be able to consider both text & images together when providing answers or recommendations).
There is no pre-processing pipeline. Clinical text like EHR records includes about 50% copy-and-pasted content, sections, and multiple pages. This has to be normalized first; note that the exact same sentence can mean different things if it’s under “chief complaint”, “history of present illness”, or “plan”. You’ll need to build that yourself as well, instead of using a pre-built & widely validated solution. Like scanned documents, this will also have to be built in a custom way outside the large language models.
Models like GPT-3 or ChatGPT require calling a cloud API – and sharing your data with the company providing them. Even if the setup becomes HIPAA compliant, and even if you’re allowed to share the data that way, you’re providing that company with the intellectual property needed to train & tune better clinical models, instead of building that intellectual property internally by privately tuning your own oncology abstraction models (which is how John Snow Labs’ software & license works).

Zero-Shot Learning

The good news is that you can get the benefits of prompt engineering right now with John Snow Labs, without these downsides. There are three first-to-market features that are already available and in use by early adopters to build production-grade, accurate, tunable, scalable, private, cheaper to run, kept current, healthcare-tuned, and compliant NLP solutions.

First, Zero-shot named entity recognition and zero-shot relation extraction enable you to extract custom entities and relationships from medical text without any training, tuning, or data labeling. This is useful when your goal is to optimize go-to-market time over accuracy – i.e. can I get a model that’s 80% accurate today, instead of a model that’s 95% accurate in 3 months? For example, if you’re automating the process of creating a cancer registry, then you may wish to invest to optimize the models for fields relating to tumor staging & histology, but go with zero-shot models for the 400+ rarely filled data fields.

Models based on Longformer, Albert, Bert, CamemBert, DeBerta, DistillBert, Roberta, and XlmRoberta have been implemented already. John Snow Labs has already progressed a step further than enabling prompt engineering with the automated prompt generation, based on the T5 transformer. This functionality is currently not available in Hugging Face, which only supports zero-shot text classification.

No-Code Prompt Engineering

A second use for zero-shot prompt engineering is to bootstrap higher-accuracy models. Instead of labeling data from scratch, you can pre-annotate data with prompts, after which your domain experts only need to correct what it got wrong. This is similar to what we’ve done with programmatic labeling –we’ve added it to the NLP Lab as another way to bootstrap models, but not as a full replacement.

The NLP Lab lets you seamlessly combine models (transfer learning), rules (programmatic labeling), and prompts (zero-shot learning) to bootstrap NLP model development. This typically makes labeling projects 80%-90% faster than “from scratch” projects. Importantly, the ability to combine models, rules, and prompts for different tasks gives you the best of all worlds, in contrast to systems that focus on AI-assisted labeling with models (i.e. LabelBox), rules (i.e. Snorkel), or prompts (i.e. ChatGPT).

The NLP Lab is the first to market with a user interface intended for non-technical domain experts (i.e. medical doctors use it to train & tune models) that allows you to start with a prompt (or a pre-trained model, or a rule), see how well it performs on real data, provide feedback as needed, and publish that model. That is how you can quickly scale this effort to support a broad range of document types, medical contexts, and entities & relationships to extract. We see other customers already doing the same, and we’ve been doing this internally for a while now. There’s no point in reinventing the wheel by starting from a cloud API and rebuilding this entire workflow and user experience.

Zero-Shot Visual Question Answering

Zero-shot visual question answering is also already available. This model is based on the architecture of Donut: an OCR-free Document Understading Transformer that can answer questions (i.e. do fact extraction) directly from an image or visual document. This does not require any training or tuning. As of October 2022, this architecture delivers state-of-the-art accuracy on a variety of visual document understanding benchmarks covering receipts, invoices, tickets, letters, memos, emails, and business cards – in English, Chinese, Japanese, and Korean.

The supported visual NLP tasks are document classification, information extraction, and question answering. For example, the model can be provided in this image:

And then asked to answer these two questions:

Without any training or tuning, it will provide these two answers:

Note that these questions require visual understanding in addition to reading the text: The model should implicitly deduce that this image is an agenda of an event, that it’s most likely that the times on the left column state when each event happens, and that if what looks like a person’s name appears next to a topic, it’s most likely that this person is the speaker for that session. This “common sense” knowledge is available out of the box – in a production-grade, scalable, and private library.

What Next?

It is the early days for large language models. We expect high-speed innovation to continue – with new entrants building on GPT3, DALL-E, and ChatGPT flooding the commercial & open-source arenas. John Snow Labs will provide you the benefits of these models as you as they’re reliable and ready for prime time in the healthcare & life science industries. We highly recommend that you start using what’s currently available – and welcome feedback and requests. Prompt engineering, No-code, and Responsible AI are the three major NLP trends we’re focused on in 2023 and you’ll see much more in all three areas in the software we’re building for you this year.

Share AI in Healthcare

3 Pragmatic Differences Between Academic And Production Software Libraries

David Talby — Wed, 15 Mar 2023 06:29:00 GMT

Image credit: https://ssl.engineering.nyu.edu/blog/2019-09-03-bridging-pt2

Starting a new AI project will often confront you with the paradox of choice: there are too many great libraries and models to start from. One way to avoid analysis paralysis due to feature-by-feature comparisons is to focus on tools that were designed for your kind of project.

At my company, John Snow Labs, we often get compared to Allen NLP, Stanza, SciSpacy, and other libraries focused on academic use cases. Many AI libraries started in academia to help researchers write papers faster, while others were created specifically to help enterprises build production systems.

These are very different communities which result in different design decisions and priorities. Some academic libraries may indeed go on to become mainstream, but there are fundamental differences that should be considered depending on your goals.

Subscribe now

Reproducibility Vs. Freshness

Models perform differently on academic datasets versus real-world data. In industry, you need current, state-of-the-art models to succeed, and these models have to be regularly updated.

Take BioBERT, a pre-trained biomedical language representation model for biomedical text mining. This is an adaptation of BERT (Bidirectional Encoder Representations from Transformers), a neural network-based technique for natural language processing (NLP) pre-training, specifically for biomedical use cases. You want BioBERT pretrained on a regular basis on the latest research, not only on general English but biomedical language.

BioBERT was trained in early 2019—and as we know, a lot has happened in healthcare and society since then. BioBERT considers “Covid-19” to be an unrecognized, out-of-vocabulary keyword. This isn’t a problem if you’re only using BioBERT to reproduce old papers and results—in fact, having a frozen model is a requirement for such reproducibility—but imagine using such a model in a production system?

Medical terminologies and practices keep evolving: If you have a model that identifies drug names in a medical text, you need it updated nearly weekly in order to track new drugs that come to market. The same goes for diseases, procedures, medical devices, biomarkers, antibodies, surgical techniques, and other terms.

Production-Grade Codebase Vs. Fast Prototyping

Production-grade software implies code that has strong test coverage, automated CI/CD infrastructure, regular tests for security vulnerabilities, a release process that ensures that faulty or malicious software can’t be slipped into the codebase, and a focus on optimizing speed, memory consumption, and compatibility with major cloud providers and compute platforms. In contrast, research frameworks are focused on speed of prototyping, which leads to very different software designs and processes.

For example, in October 2021, researchers from Google introduced SCENIC, an open-source JAX library with a focus on Transformer-based models for computer vision research. Its aim is to make large-scale model prototyping faster and thus easier for people to make small changes and write papers.

Historically, research libraries like SCENIC have been very successful at prioritizing rapid prototyping. This enables the creation of product simulations for testing and validation during the product development process.

Here is the "Philosophy" section from the project’s GitHub homepage: “Scenic aims to facilitate rapid prototyping of large-scale vision models. To keep the code simple to understand and extend, we prefer forking and copy-pasting over adding complexity or increasing abstraction. Only when functionality proves to be widely useful across many models and tasks, it may be upstreamed to Scenic's shared libraries.”

SCENIC has been successful precisely because it has made explicit trade-offs to achieve its goal—helping researchers move faster instead of building a reusable and well-abstracted codebase. It’s another example where an academic-focused library is not fit for production systems, not because it’s poorly designed but because it is well designed and managed to achieve a different goal.

Join The LinkedIn Group

Roadmap Prioritization

A third major difference between academic- and industry-focused libraries is what they prioritize. For example, in an academic setting, you’ll want to run it against other standard academic benchmarks when you train a new model. Being able to run your model versus the entire SuperGLUE benchmark for natural language understanding in one line of code and easily reproducing results from other models on different metrics is an amazing feature. Having additional helper scripts that organize the output and provide detailed comparisons to other models is also very useful.

In contrast, enterprise customers don’t care about this. They care about reliability, scalability, cost, security, and compliance. What type of data will you have to share to get your AI project off the ground, and what protective measures are in place? Do they meet regulations such as GDPR, CCPA, or industry-specific laws like HIPAA? How will you factor in explainability and avoid bias or concept drift over time? How will monitoring take place? What are the versioning and release processes, and do they integrate with enterprise-wide tools? How would the processes of training, tuning and inference of the model integrate as part of the overall enterprise architecture?

Two Communities, Two Needs

Ultimately, there are major technical gaps between building a model and getting it ready for use in real-world products and services. It is also largely a software engineering effort, not a data science effort, and the right skill sets must be involved.

In practice, there are two different communities that need to be served—those in academia and those in industry. For enterprise AI users especially, it would seem that picking the right library before the right tool is the best way to ensure your AI projects have the greatest chance of success.

Share AI in Healthcare

Applying Responsible NLP in Real-World Projects

David Talby — Mon, 20 Feb 2023 16:46:00 GMT

Responsible AI: Getting from Goals to Daily Practices

How is it possible to develop AI models that are transparent, safe, and equitable? As AI impacts more aspects of our daily lives, concerns about discrimination, privacy, and bias are on the rise. The good news is that there is a growing movement towards Responsible AI with the goal of ensuring that models are designed and deployed in ways that align with ethical principles, which include [NIST 2023]:

Validity and Reliability: Developers should take steps to ensure that models perform as they should under a variety of circumstances.

Security and Resiliency: Models should show robustness to data and context that is different than what they were trained on or different than what is normally expected. Models should not violate system or personal security or enable security violations.

Explainability and Interpretability: Models should be capable of answering stakeholder questions about the decision-making processes of AI systems.

Fairness with Mitigation of Harmful Bias: Models should be designed to avoid bias and ensure equitable treatment of all individuals and groups impacted by the data. This includes ensuring the proper representation of protected groups in the dataset.

Privacy: Data privacy and security should be prioritized in all stages of the AI pipeline. This includes both respecting the rights of people who do not wish to be included in training data, as well as preventing leakage of private data by a model.

Safety: Models should be designed to avoid harm to users and mitigate potential risks. This includes model behavior in unexpected conditions or edge cases.

Accountability: Developers should be accountable for the impact of their models on society and should take steps to address any negative consequences.

Transparency: Developers should be transparent about the data sources, model design, and potential limitations or biases of the models.

However, today there is a gap between these principles and current state-of-the-art NLP models.

According to [Ribeiro 2020], the sentiment analysis services of the top three cloud providers fail 9-16% of the time when replacing neutral words, and 7-20% of the time when changing neutral named entities. These systems also failed 36-42% of the time on temporal tests and almost 100% of the time on some negation tests. Personal information leakage has been shown to be as high as 50-70% in popular word and sentence embeddings, according to [Song & Raghunathan 2020]. In addition, state-of-the-art question-answering models have been shown to exhibit biases around race, gender, physical appearance, disability, and religion [Parrish et. al. 2021] – sometimes changing the likely answer more than 80% of the time. Finally, [van Aken et. al. 2022] showed that adding any mention of ethnicity to a patient note reduces their predicted risk of mortality – with the most accurate model producing the largest error.

These findings suggest that the current NLP systems are unreliable and flawed. We would not accept a calculator that correctly works at a particular time or a microwave that randomly alters its strength based on the type of food or time of day. Therefore, a well-engineered production system should work reliably on standard inputs and be safe & robust when handling uncommon ones. The three fundamental software engineering principles can help us get there.

Subscribe now

Software Engineering Fundamentals

Testing software is crucial to ensure it works as intended. The reason why NLP models often fail is straightforward: they are not tested enough. While recent research papers have shed light on this issue, testing should be standard practice before deploying any software to production. Furthermore, testing should be carried out every time the software is changed, as NLP models can also regress over time [Xie et. al. 2021].

Even though most academics make their models publicly available and easily reusable, it is not recommended to reuse academic models as production-ready ones. It is because tools that are designed to reproduce research results may not be suitable for production use. This makes research faster and enables benchmarks like SuperGLUE, LM-Harness, and BIG-bench. Reproducibility requires that models remain the same rather than being continuously updated and improved. For example, BioBERT, a commonly used biomedical embedding model, was published in early 2019 and did not consider COVID-19 as a vocabulary word due to its release date. This illustrates how relying solely on academic models may hinder the effectiveness of NLP systems in production environments.

It is important to test beyond accuracy in your NLP system. This is because the business requirements for the system include robustness, reliability, fairness, toxicity, efficiency, lack of bias, lack of data leakage, and safety. Therefore, your test suites should reflect these requirements. A comprehensive review of definitions and metrics for these terms in different contexts is provided in the Holistic Evaluation of Language Models [Liang et. al 2022], which is well worth reading. However, you will need to write your own tests to determine what inclusiveness means for your specific application.

Your tests should be specific, isolated, and easy to maintain, as well as versioned and executable so that they can be incorporated into an automated build or MLOps workflow. To simplify this process, you can use the nlptest library, which is a straightforward framework.

Design Principles of the NLP Test Library

Designed around five principles, the nlptest library is intended to make it easier for data scientists to deliver reliable, safe, and effective language models.

Open Source. It is an open-source community project under the Apache 2.0 license, free to use forever for commercial and non-commercial purposes with no caveats. It has an active development team that welcomes contributions and code forks.

Lightweight. The library is lightweight and can run offline (i.e., in a VPN or a high-compliance enterprise environment) on a laptop, eliminating the need for a high-memory server, cluster, or GPU. Installation is as simple as running pip install nlptest, and generating and running tests can be done in just three lines of code.

By importing the library, creating a new test harness for the specified Named Entity Recognition (NER) model from John Snow Labs’ NLP models hub, and running the code, the library automatically generates test cases ((based on the default configuration) and generates a report, simplifying the process for data scientists.

Storing tests in a pandas data frame makes it simple to edit, filter, import, or export them. The entire test harness can be saved and loaded, allowing you to run a regression test of a previously configured test suite simply by calling h.load(“filename”).run().

Cross Library. The framework provides out-of-the-box support for transformers, Spark NLP, and spacy, and can be easily extended to support additional libraries. As an AI community, there is no need for us to build the test generation and execution engines multiple times. It allows testing of both pre-trained and custom NLP pipelines from any of these libraries.

Extensible. Since there are hundreds of potential types of tests and metrics to support, additional NLP tasks of interest, and custom needs for many projects, much thought has been put into making it easy to implement and reuse new types of tests.

To support hundreds of potential types of tests and metrics, additional NLP tasks, and custom needs for many projects, the framework has been designed to be extensible, making it easy to implement and reuse new types of tests. For instance, the framework includes a built-in test type for bias in US English, which replaces first and last names with names that are common for White, Black, Asian, or Hispanic people. But what if your application is intended for India or Brazil, or if the testing needs to consider bias based on age or disability, or if a different metric is needed for when a test should pass?

The nlptest library makes it easy to write and then mix and match test types. The TestFactory class defines a standard API for different tests to be configured, generated, and executed. We’ve put in a lot of effort to ensure that the library can be easily tailored to meet your needs and that you can contribute or customize it with ease.

Test Models and Data. A common issue when a model is not ready for production lies in the dataset used for training or evaluation, rather than the modeling architecture. A widely prevalent issue in commonly used datasets, as demonstrated by [Northcutt et. al. 2021] is the mislabeling of training examples. Additionally, representation bias presents a challenge for assessing a model’s performance across ethnic lines, as there may not be enough test labels to calculate a usable metric. In such cases, it is appropriate for the library to fail a test and suggest changes to the training and test sets to better represent other groups, fix likely mistakes, or train for edge cases.

Therefore, a test scenario is defined by a task, a model, and a dataset, i.e.:

This setup not only allows the library to offer a complete testing strategy for both models and data but also enables you to use generated tests to augment your training and test datasets, which can considerably reduce the time required to fix models and prepare them for production.

The next sections describe that the nlptest library helps you automate three tasks: Generating tests, running tests, and augmenting data.

Getting Started

Ready to improve the safety, reliability, and accuracy of your NLP models? It’s time to get started with the John Snow Labs’ nlptest library by visiting nlptest.org and installing it with pip install nlptest. With its extensive support for different NLP libraries, extensible framework for creating custom tests, and ability to generate and run tests on both models and datasets, you can quickly identify issues and improve the accuracy of your models.

Join our open-source community project on GitHub, share examples and documentation, and contribute to the development of the library.

Free No-Code NLP: The Annotation Lab Becomes the NLP Lab

David Talby — Tue, 10 Jan 2023 06:41:00 GMT

The Annotation Lab was adopted by more than 100 organizations worldwide last year. The AI community took to the idea of a 100% free data labeling platform that includes all “enterprise” features, including:

AI-Assisted Labeling
Team & Project Management
Enterprise-Grade Security
Custom Workflows
Analytics
Scalability
Privacy
Versioned audit trails
All with no limits and support for both cloud and on-premise deployment.

After two years of releasing new versions every two weeks, the lab has now evolved to become the NLP Lab: a free no-code NLP platform. Doctors, pharmacists, lawyers, and financial analysts use it to train, tune, test, and publish NLP models, often without getting data scientists involved. The platform now supports the full lifecycle of creating a new project & goals, starting from a pre-trained model, teaching it domain expertise (by combining labels, rules, and models), training or tuning a model, testing it, and publishing it as an API.

The lab includes a Private Models Hub, letting you search, filter, manage, and safely share custom models you’ve built. The hub’s new Playground lets you quickly test any model or rule on a snippet of text, without the need to create a project and import tasks.

The next major feature is prompt engineering - democratizing zero-shot learning by putting it in the hands of business domain experts. To learn more about this exciting new capability, watch the webinar on “Combining Prompt Engineering, Programmatic Labeling, and Model Tuning in the No-Code NLP Lab.”

The lab’s name change does not change our commitment to you: to keep it free, to keep improving it with frequent releases, and to keep making it the best platform for teams who build and deploy state-of-the-art NLP models. Ready to give the NLP Lab a go? Install it with a few clicks:

· AWS Marketplace

· Azure Marketplace

· One-Liner Kubernetes Script

Share AI in Healthcare