Unlocking biomarker intelligence from clinical text: what we learned building BIOPSY
Unlocking biomarker intelligence from clinical text: what we learned building BIOPSY
Sanya A. Chetwani, Technical Lead, Data Science, Kognitic
Earlier this year, Jaseem Mahmmdla and I published a paper at EMNLP 2025, one of the top conferences in natural language processing. The paper introduced BIOPSY, a pipeline we built at Kognitic for extracting structured biomarker data from clinical text. It was the first end-to-end system to handle entity recognition, mutation linking, stratification, and expression levels in a single framework.
To put our work in context, I want to explain why we built the pipeline, what it does, and what the results tell us about the future of biomarker intelligence in pharma.
The problem we kept running into
Biomarkers are at the center of modern oncology. They determine patient eligibility for clinical trials. They define treatment selection. They shape competitive positioning across therapeutic areas. There are now over 150 biomarkers recognized by the FDA and more than 10 major scoring techniques used by clinical specialists worldwide.
The challenge is not that this information is missing. It is buried in clinical text. Trial protocols, conference abstracts, registry records, and published literature all contain biomarker data. But it is written in natural language, not structured tables. And clinical language is far more complex than it appears.
For example, consider a sentence from a trial eligibility section: “Patients should not have any EGFR sensitizing mutation to qualify for enrollment.” While a domain expert recognizes that this implies EGFR-positive patients are excluded, computationally parsing it requires understanding negation, interpreting that “should not have” modifies biomarker status, and recognizing that “to qualify” establishes an exclusion criterion.
In cases with nested negation, such as: “Patients will be excluded if no EGFR-sensitizing mutation is found.” there are two layers of negation. Here, inclusion actually requires EGFR positivity. General-purpose language models often underperform not because of a lack of capability but because they lack sufficient training on domain-specific patterns.
This challenge was exactly what we set out to solve.
What BIOPSY does
BIOPSY is a modular pipeline with four stages. Each one addresses a specific task that existing tools either handled in isolation or did not handle at all.
The first stage is entity extraction. The system identifies biomarkers, mutations, and drug targets in clinical text. We evaluated multiple NER models on our proprietary dataset and found that GLiNER, a generalist model fine-tuned on our biomarker-specific data, achieved the highest F1 score of 0.88. It reliably distinguished between biomarkers and drug targets, which is a critical distinction. In one sentence, HER2 might be a biomarker of disease severity. In the next sentence, it might be a drug target for therapeutic intervention. Same term, different meaning. Context is everything.
The second stage is relation extraction. Once the system identifies biomarkers and mutations separately, it needs to link them. “The patient must test positive for EGFR Exon 19, ALK, BRAF V600, and HER2 mutations.” That single sentence contains four biomarker-mutation pairs that need to be correctly associated. We built an ensemble model that combines BioBERT, BlueBERT, and PubMedBERT using an attention-based stacking mechanism. It achieved an F1 of 0.87.
The third stage is stratification. This is where the pipeline classifies each biomarker as positive, negative, or assessment (meaning the patient needs to be tested). The nested negation problem I described earlier lives here. We found that Llama 3.1 70B, fine-tuned on our dataset, best handled these complexities, achieving an F1 score of 0.85.
The fourth stage determines biomarker expression levels. Clinical texts often report qualification scores rather than explicit positive or negative labels. “PD-L1 Tumor Proportion Score should be greater than 50%” implies PD-L1 positive status, but you have to understand the scoring system to make that inference. We built a syntax-guided extraction approach using constituency parsing and domain-specific normalization rules, developed in collaboration with our clinical science team.
The final output is a structured tuple for each biomarker: the biomarker name, the mutation (if present), the stratification class, and the expression score. Traceable, structured, and ready for downstream use.
What the results showed
We evaluated BIOPSY on 5,000 hand-labeled oncology abstracts sourced from ClinicalTrials.gov and PubMed. The pipeline achieved an overall F1 of 0.86.
We then tested it on 2,000 neuroscience abstracts, a completely different therapeutic domain, without any additional training. It scored 0.87. The pipeline generalized.
We also benchmarked against GPT-4o, one of the most capable general-purpose language models available. GPT-4o scored 0.73 in oncology and 0.74 in neuroscience. It demonstrated strong zero-shot capabilities, as expected. But our fine-tuned pipeline outperformed it by 13 points in oncology and 13 points in neuroscience.
This is not intended as a criticism of GPT-4o. It is a capable model. Our results reinforce an assumption we held before running the experiment: effective clinical biomarker extraction demands domain-specific architecture. General models can offer useful starting points, but they lack the precision necessary for high-stakes clinical and commercial decisions.
Why this matters beyond the paper
At Kognitic, this research is not academic. It feeds directly into how the platform structures biomarker intelligence for competitive analysis, trial matching, and evidence benchmarking.
Every day, pharma teams make decisions that depend on biomarker data. Which patients are eligible for a trial? How a competitor’s biomarker selection strategy compares to yours. Whether the evidence supports a specific line of therapy positioning.
These decisions require structured, accurate, traceable biomarker data. Not summaries. Not approximations.
BIOPSY is part of how we deliver that. The pipeline processes clinical text at scale and produces structured outputs that our platform uses to build the biomarker layer underneath competitive landscapes, evidence benchmarks, and trial intelligence views.
We also chose to publish the evaluation dataset on GitHub and to submit the paper for peer review. In a field where many companies claim AI capabilities without showing their methodology, we wanted to be transparent about how our models perform, where they succeed, and where there is still room to improve.
What comes next
BIOPSY currently handles oncology and neuroscience. We are extending it to additional therapeutic areas. We are also exploring automatic dataset generation pipelines to reduce the manual annotation burden and enable faster adaptation to new domains.
The biomarker landscape in oncology is evolving, with new targets and scoring systems emerging. Our key takeaway: intelligence infrastructure must advance at the same pace to deliver accurate, up-to-date biomarker insights for clinical and competitive needs. That is our focus moving forward.
If you are interested in the technical details, the full paper is available through the EMNLP 2025 proceedings. The evaluation dataset is on GitHub. And if you want to see how this works in the Kognitic platform, we are happy to walk you through it.
BIOPSY was co-authored by me and Jaseem Mahmmdla, Co-founder and CEO of Kognitic, and presented at the EMNLP 2025 Industry Track.
Schedule a Landscape Audit to see Kognitic’s biomarker intelligence for your area.