Skip to content

Releases: I2C9W/fromtexttotables

v0.5.0

29 Nov 12:06
Compare
Choose a tag to compare
v0.5.0 Pre-release
Pre-release

Release Version 0.5.0

Title of the Release:

From Text to Tables: A Local Privacy Preserving Large Language Model for Structured Information Retrieval from Medical Documents

Background and Overview

Version 0.5.0 of our software pipeline embodies the principles and findings detailed in our paper. Focused on extracting structured information from medical texts, this release introduces a local, privacy-preserving Large Language Model (LLM), "Llama 2", optimized for clinical feature detection in medical reports.

Key Highlights of Version 0.5.0

  • Local Deployment of LLM: "Llama 2" is a locally deployable model that addresses privacy concerns inherent in processing personal healthcare data, a notable advancement over remote data center-dependent LLMs like ChatGPT.

  • Clinical Feature Extraction: The tool efficiently extracts key clinical features associated with decompensated liver cirrhosis, such as abdominal pain, shortness of breath, confusion, liver cirrhosis, and ascites.

  • Model Versions and Sizes: Incorporating three versions of "Llama 2" with varying parameters (7 billion, 13 billion, and 70 billion), the tool demonstrates enhanced performance in data extraction and analysis.

  • Improved Data Processing and Formatting: The pipeline uses the llama.cpp framework for efficient and consistent JSON output formatting, subsequently converting this data into CSV format via Python's pandas library.

  • Zero-shot and Chain-of-Thought Prompting: Incorporates advanced prompting techniques for enhanced extraction accuracy and explainability.

Performance and Evaluation

  • High Accuracy and Specificity: The 70 billion parameter model showed remarkable sensitivity and specificity in detecting explicit clinical features like liver cirrhosis and ascites.

  • Diverse Clinical Feature Detection: Demonstrated varied sensitivity and specificity across different clinical features, showcasing the model's ability to parse and interpret medical text effectively.

  • Bootstrapping for Robust Evaluation: Utilized bootstrapping techniques with 1000 iterations to ensure reliable and robust statistical estimates of the model's performance.

Ethical and Methodological Approach

  • Use of Anonymized Data: Employed anonymized patient data from the MIMIC IV database, adhering to ethical standards and providing a broad spectrum of patient data for analysis.

  • Focus on Decompensated Liver Cirrhosis: Selected 500 patient histories from MIMIC IV, aiming to detect early signs of decompensated liver cirrhosis, critical for timely and effective patient care.

Conclusion and Future Directions

Version 0.5.0 of our pipeline signifies a significant step towards on-premise and point-of-care deployment of LLMs for medical text analysis, aligning with our study's aim to demonstrate the capability of using locally deployed LLMs for clinical information extraction.


Keywords: Text Mining, Artificial Intelligence in Medicine, Large Language Models, LLM, Medical Text Analysis

All source codes and detailed methodologies are available at [GitHub Repository Link].


This release is part of an ongoing project, and we welcome contributions and feedback from the medical and AI research communities.