LangExtract: AI Transforms Clinical Notes to Structured Data

Towardsdatascience

In the vast and complex world of healthcare, a significant portion of critical patient data remains buried within unstructured text—primarily clinical notes. These documents, often lengthy and filled with abbreviations, inconsistencies, and medical jargon, pose a formidable challenge to data extraction and analysis. Important details, such as drug names, dosages, and especially adverse drug reactions (ADRs), frequently get lost in the textual deluge, making rapid detection and response difficult. Addressing this challenge, Google developers have introduced LangExtract, a new open-source project designed to transform messy, unstructured text into clean, structured data by leveraging the power of large language models (LLMs). While originating from Google developers, it’s important to note that LangExtract is not an officially supported Google product.

The timely detection of adverse drug reactions is paramount for patient safety and the broader field of pharmacovigilance. An ADR is any harmful, unintended consequence arising from medication use, ranging from mild side effects like nausea to severe outcomes requiring immediate medical attention. Identifying these reactions quickly is crucial, yet in clinical notes, ADRs are often intertwined with a patient’s medical history, lab results, and other contextual information, making manual extraction a laborious and error-prone process. While LLMs are an active area of research for ADR detection, recent studies indicate they can effectively flag potential issues but are not yet reliably precise for definitive extraction. This makes ADR extraction an excellent stress test for LangExtract, evaluating its ability to pinpoint specific adverse reactions amidst a host of other medical entities.

LangExtract operates on a straightforward three-step workflow. Users begin by defining their extraction task through a clear, descriptive prompt that specifies the exact information they wish to extract. Next, they provide a few high-quality examples, known as “few-shot examples,” which serve to guide the model toward the desired format and level of detail for the output. Finally, users submit their input text, select their preferred LLM (which can be either a proprietary API-based model or a local model via platforms like Ollama), and allow LangExtract to process the data. The resulting structured data can then be reviewed, visualized, or directly integrated into downstream analytical pipelines. The tool’s versatility extends beyond clinical notes, with examples ranging from entity extraction in literary texts to structuring radiology reports.

To demonstrate its capabilities in a clinical context, LangExtract was tested on its ability to identify ADRs using Google’s Gemini 2.5 Flash model. The extraction task was clearly defined: extract medication, dosage, adverse reaction, and any action taken, including the severity of the reaction as an attribute if mentioned. Crucially, the prompt instructed the model to use exact text spans from the original note, avoiding any paraphrasing, and to return entities in their order of appearance. A guiding example was provided, illustrating how a note detailing “ibuprofen 400 mg” leading to “mild stomach pain” and the patient “stopping the medicine” should be structured. When presented with a real clinical sentence from the ADE Corpus v2 dataset, LangExtract successfully identified the adverse drug reaction without confusing it with the patient’s pre-existing conditions—a common hurdle in such tasks.

Real-world clinical notes are often significantly longer than simple sentences. LangExtract accommodates these extended texts by offering specific parameters to enhance performance. extraction_passes allows for multiple scans of the text to improve recall and capture more subtle details. max_workers facilitates parallel processing, enabling faster handling of larger documents, while max_char_buffer splits the text into smaller, manageable chunks, helping the model maintain accuracy even with very long inputs. Furthermore, LangExtract offers the flexibility of working with local LLMs through Ollama, a significant advantage for organizations dealing with privacy-sensitive clinical data that cannot leave a secure, on-premise environment.

In summary, LangExtract presents a promising solution for transforming unstructured clinical notes into actionable, structured data, saving substantial preprocessing effort for information retrieval systems and metadata extraction applications. Its performance in ADR experiments was encouraging, accurately identifying medications, dosages, and reactions. The quality of the extracted output, however, is directly influenced by the quality of the few-shot examples provided, highlighting that human expertise remains a vital component in the loop. While the initial results are positive, the high-risk nature of clinical data necessitates more extensive and rigorous testing across diverse datasets before LangExtract can be widely adopted for production use.