Why cancer registries stay years out of date - and what regulatory-grade oncology AI changes

Jun 18, 2026

Originally published in Forbes, July 2025.

Most of what matters in a cancer patient’s record is free text. Stage, histology, biomarker status, treatment response, progression: none of it lives cleanly in a discrete EHR field. It lives in pathology reports, radiology narratives, oncologist progress notes, and multidisciplinary tumor board summaries. Manually abstracting that information for a cancer registry takes a certified tumor registrar about two hours per case, and only 14% of US registries consistently meet the National Program of Cancer Registries’ target of reporting 90% of cases within 12 months of diagnosis. Regulatory-grade oncology AI changes that ratio, and with it the timelines for research, trial matching, and quality reporting that cancer care depends on.

Why oncology data is uniquely hard to structure

Oncology is a free-text discipline. Cancer staging follows AJCC rules that depend on tumor size, nodal involvement, metastatic spread, and (for most solid tumors) molecular or genomic features that are described, not coded. A prostate cancer report references Gleason patterns. A breast cancer report specifies ER, PR, and HER2 status in language that varies by pathologist. A lung cancer case depends on EGFR, ALK, ROS1, KRAS, and increasingly a dozen more biomarkers, each with its own testing methodology and reporting convention.

Claims data captures almost none of this. Discrete EHR fields capture some of it, inconsistently. The rest, which is typically the clinically decisive information, sits in narrative reports that a human has to read.

That is why, as of the latest assessment, central cancer registries in the US take an average of two hours per case to abstract, with complex patients consuming several days of registrar time. A single full-time registrar can process 6 to 10 cases per day. A thousand-patient cohort costs roughly a full year of certified registrar labor. Oncology informatics research out of the Cancer Institute of New Jersey has documented the structural reasons: EHR infrastructure varies widely across treating facilities, patient-reported physician lists disagree with registry records in 42% of cases, and the average lung cancer patient generates about 300 pages of records that have to be reviewed line by line.

The consequence is that cancer surveillance operates on a two-to-four-year lag. Clinical trial matching misses eligible patients because their biomarker status is not yet coded. Outcomes research depends on cohorts that are systematically incomplete. Quality measures under CMS’s Oncology Care Model and the MIPS Promoting Interoperability category depend on structured data that often does not exist until long after the care episode is closed.

What regulatory-grade accuracy means in oncology

General-purpose LLMs can read a pathology report and produce a summary that looks right. Regulatory-grade oncology extraction is a higher bar. It means:

Every extracted entity (diagnosis, stage, biomarker, medication, procedure) is mapped to a controlled terminology such as SNOMED CT, ICD-O-3, RxNorm, or LOINC, with documented confidence and provenance to the source sentence. Negation and temporality are handled correctly: “no evidence of metastatic disease” does not become a metastasis flag, and a history of prior tamoxifen therapy is not confused with current treatment. Results are reproducible: the same input produces the same output, which regulators and auditors require. The pipeline runs in the customer’s environment so PHI never leaves their control.

Peer-reviewed work on cancer-specific information extraction has shown the gap between healthcare-specific models and frontier LLMs on exactly these tasks. On the CACER (Clinical Concept Annotations for Cancer Events and Relations) benchmark, GPT-4 scored below 0.50 F1 on cancer entity extraction, while a medical language model tuned for oncology reached materially higher accuracy. On structured extraction from diagnostic reports, a medical language model reached 0.80 F1 on relation extraction versus GPT-4’s below 0.60. Assertion classification, which handles negation and uncertainty and matters more in oncology than in almost any other clinical domain, reached above 90% accuracy with healthcare-specific assertion models; general LLMs produced inconsistent output under prompt variation.

These differences are not cosmetic. At scale, a 15-point F1 gap means a different cohort in every study, a different denominator in every quality measure, and a different set of eligible patients for every trial.

What changes when extraction runs in minutes instead of hours

Three downstream consequences follow when regulatory-grade oncology extraction is available at scale.

Cancer registry reporting converges toward real time. Our Medical LLMs cut per-case abstraction from two hours to one to two minutes, a 60–100x productivity gain, while preserving human-in-the-loop review for complex cases. That changes the operating model of a registry from “build the backlog, then catch up” to “review by exception.” A recent medRxiv preprint out of China Medical University, using a locally deployed 20B-parameter open-weight model on a single professional-grade GPU, reached similar conclusions: autonomous multi-stage extraction of pathology reports is now feasible inside a hospital firewall, without depending on an external API.

Clinical trial matching catches patients earlier in their journey. A trial protocol requires EGFR mutation status, ECOG performance status, prior lines of therapy, and measurable disease per RECIST 1.1. When those variables are extracted continuously from incoming pathology and oncology notes, rather than abstracted months later for registry purposes, matching happens in the window where enrollment is still possible.

Quality and outcomes reporting become operationally feasible. Measures like 30-day readmissions after cancer surgery, time from diagnosis to treatment initiation, and adherence to NCCN guideline recommendations require structured clinical data that most health systems cannot produce reliably from claims or discrete EHR fields. Regulatory-grade extraction closes that gap.

What this does not do

Oncology AI that extracts structured data from notes does not make clinical decisions. It does not recommend treatment, diagnose cancer, or replace the judgment of a multidisciplinary tumor board. The pipeline produces structured, auditable data; clinicians and certified tumor registrars continue to interpret and act on that data.

That distinction matters for two reasons. First, it defines the regulatory path. Extracting a structured representation of information that already exists in the clinical record is a different regulatory question from generating novel clinical recommendations. The FDA’s 2025 Predetermined Change Control Plan guidance and the evolving framework around decision support interventions apply differently to each. Second, it defines where the accuracy bar sits. An extraction pipeline that is 98% accurate on biomarker status is useful immediately; a decision-support tool at the same accuracy level is not, because the 2% tail sits on clinical outcomes rather than on a data field a human will review.

What to ask when evaluating oncology AI

For health systems, pharma RWE teams, and cancer centers looking at vendors in this space, four questions separate regulatory-grade offerings from demos:

What is the peer-reviewed accuracy on your use case, against a published benchmark and ground-truth data? Benchmark results on MedQA or USMLE-style questions do not predict performance on pathology report extraction.

Where does the data live during processing? If the answer is “our API,” that is a different compliance, cost, and data-sovereignty posture than “in your environment, behind your firewall.”

What terminology does the output map to, and how is provenance tracked? A number without a code and a source sentence is not registry-grade data.

How is human-in-the-loop review supported? Complex oncology cases (rare tumors, ambiguous staging, contradictory reports) require registrar judgment. The tool either supports that workflow or forces a shadow system around it.

The shift underway

Oncology was the first clinical domain where the gap between what the record contains and what the structured data captures became operationally unacceptable. It is now the first domain where that gap is closing at scale, driven by healthcare-specific language models that run inside the customer’s environment and hit the accuracy, terminology, and provenance bar that registries, trials, and quality programs require.

The practical consequence for cancer centers and cancer research is that the two-to-four-year surveillance lag is no longer inevitable. For pharma RWE, it is that oncology cohorts can be built from real clinical narrative rather than from claims proxies. For patients, it is that trial opportunities show up while they are still options, not months after another line of treatment has started.

Regulatory-grade accuracy is what makes all of that possible.

Frequently asked questions

Why can’t general-purpose LLMs handle oncology extraction out of the box?

They can read a pathology report and produce a reasonable summary. They struggle on the specific tasks oncology data pipelines require: mapping to controlled terminologies like SNOMED CT and ICD-O-3, handling nested negation, distinguishing current from historical treatment, and producing reproducible output under prompt variation. Peer-reviewed benchmarks on cancer-specific information extraction show healthcare-specific models meaningfully outperforming GPT-4 on relation extraction, assertion classification, and entity resolution.

What is a realistic accuracy target for a registry-grade extraction pipeline?

It depends on the entity. For straightforward diagnoses and medications, above 95% F1 is routine with healthcare-specific models. For staging, biomarker status, and response assessment, the bar is 90%+ with human review of ambiguous cases. Published benchmarks and reproducible notebooks are the right way to evaluate vendors; demo videos are not.

Does this replace certified tumor registrars?

No. It changes what they spend their time on. Registrars move from line-by-line abstraction of routine cases to review of complex cases, validation of AI output, and the judgment calls on rare tumors and ambiguous staging that automation cannot handle.

Can the same pipeline run on pathology, radiology, and oncology notes?

Yes, with the right architecture. Healthcare-specific pipelines combine document classifiers that route input to the right extraction engine, cancer-specific NER models tuned for each report type, and a unified output representation (typically OMOP Oncology or a CDM extension) that supports downstream research and reporting.

How does this intersect with FDA guidance on AI in clinical care?

Extracting structured data from an existing clinical record is a different regulatory question than generating clinical recommendations. The FDA’s Predetermined Change Control Plan guidance and the broader framework around decision-support interventions apply, but the accuracy and validation requirements for data extraction are primarily about auditability and reproducibility, not about the model making a clinical decision. Oncology AI that supports registries, trial matching, and RWE is data infrastructure.

What about privacy and data sovereignty?

Any oncology AI pipeline processing identifiable patient records should run inside the customer’s environment (on-premises or in a private cloud tenant), with no PHI leaving the firewall. API-based approaches that send clinical notes to an external LLM vendor are difficult to reconcile with HIPAA, GDPR, and the data-use agreements that cancer centers and pharma RWE teams operate under.

What is the biggest operational change when this is deployed at scale?

The registry workflow shifts from “build the backlog” to “review by exception.” Timeliness improves; cases that used to take weeks to abstract are available within days or hours of documentation, and registrar time moves to the cases where human judgment actually changes the output.