This repository contains a completed cap-stone project for the Udacity "Applying AI to 2D Medical Imaging Data" course, part of the AI for Healthcare Nanodegree program. It has been reviewed by Udacity instructors and met project specifications.
Advancements in deep learning and computer vision allow new opportunities to create software to assist medical physicians. Assistive software can improve patient prioritization or reduce physicians' efforts to examine medical images. In this project, computer vision with a convolutional neural network (CNN) model is trained to predict the presence or absence of pneumonia from chest X-Ray images. The VGG16 CNN model was fine-tuned for this classification task. The intended use for this model is to pre-screen chest X-Ray images prior to radiologists' review to reduce their workload.
The paper of Pranav Rajpurkar et al., "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning", provides benchmarks to compare pneumonia classification performance against [1]. This paper established F1-scores as the metric to compare radiologists' and algorithms' performance in identifying pneumonia. F1-scores are the harmonic average of the precision and recall of a model's predictions against ground truth labels.
In a subset of 420 images from the ChestX-ray14 dataset [2], the CheXNet algorithm achieved an F1 score of 0.435, while a panel of four independent Radiologists averaged an F1 score of 0.387 [1].
This repo's CNN model achieved an F1 score of 0.366, which is similar in performance to the panel of radiologists.
References
[1] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, Matthew P. Lungren, Andrew Y. Ng, "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning," arXiv:1711.05225, Dec 2017. Link
[2] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, Ronald M. Summers. "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases", IEEE CVPR, pp. 3462-3471,2017 Link
This project is organized in three Jupyter Notebooks:
- 1_EDA (Exploratory Data Analysis): NIH X-Ray Dataset metadata analysis and X-ray image pixel-level analysis.
- 2_Build_and_Train_Model: Image pre-processing with Keras ImageDataGenerator, split dataset using Scikit-Learn, build & train a Keras Sequential model, and convert probabilistic outputs to binary predictions.
- 3_Inference: DICOM pixel data extraction, normalize & standardize pixel data, and apply trained model to make predictions.
Figure 1. Example of in-line prediction output in 3_Inference.ipynb
Jupyter Notebook
This project uses the ChestX-ray14 dataset curated by Wang et al. and released by NIH Clinical Center.
It is comprised of 112,120 X-Ray images with disease labels from 30,805 unique patients.
The disease labels for each image were created using Natural Language Processing (NLP) to process associated radiological reports for fourteen common pathologies. The estimated accuracy of the NLP labeling accuracy is estimated to be >90%.
References
[1] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, MohammadhadiBagheri, Ronald M. Summers. "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases", IEEE CVPR, pp. 3462-3471,2017 Link
- Set up your Anaconda environment.
- Clone
https://github.com/ElliotY-ML/Pneumonia_Detection_ChestX.git
GitHub repo to your local machine. - Open
1_EDA.ipynb
with Jupyter Notebook for exploratory data analysis. - Open
2_Build_and_Train_Model.ipynb
with Jupyter Notebook for image pre-processing with Keras ImageDataGenerator, ImageNet VGG16 CNN model fine-tuning, and threshold analysis. - Open
3_Inference.ipynb
with Jupyter Notebook for inference with a DICOM file. - Complete project results discussion can be found in
FDA_Preparation.md
.
Using Anaconda consists of the following:
- Install
miniconda
on your computer, by selecting the latest Python version for your operating system. If you already haveconda
orminiconda
installed, you should be able to skip this step and move on to step 2. - Create and activate a new
conda
environment.
Download the latest version of miniconda
that matches your system.
Linux | Mac | Windows | |
---|---|---|---|
64-bit | 64-bit (bash installer) | 64-bit (bash installer) | 64-bit (exe installer) |
32-bit | 32-bit (bash installer) | 32-bit (exe installer) |
Install miniconda on your machine. Detailed instructions:
- Linux: https://docs.conda.io/en/latest/miniconda.html#linux-installers
- Mac: https://docs.conda.io/en/latest/miniconda.html#macosx-installers
- Windows: https://docs.conda.io/en/latest/miniconda.html#windows-installers
For Windows users, these following commands need to be executed from the Anaconda prompt as opposed to a Windows terminal window. For Mac, a normal terminal window will work.
These instructions also assume you have git
installed for working with GitHub from a terminal window, but if you do not, you can download that first with the command:
conda install git
Create local environment
- Clone the repository, and navigate to the downloaded folder. This may take a minute or two to clone due to the included image data.
git clone https://github.com/ElliotY-ML/Pneumonia_Detection_ChestX.git
cd Pneumonia_Detection_ChestX
-
Create and activate a new environment, named
ChestX-Pneumonia
with Python 3.8. Be sure to run the command from the project root directory since the environment.yml file is there. If prompted to proceed with the install(Proceed [y]/n)
type y and pressENTER
.- Linux or Mac:
conda env create -f environment.yml source activate ChestX-Pneumonia
- Windows:
conda env create -f environment.yml conda activate ChestX-Pneumonia
At this point your command line should look something like:
(ChestX-Pneumonia) <User>:USER_DIR <user>$
. The(ChestX-Pneumonia)
indicates that your environment has been activated.
The original Udacity project instructions can be read in Udacity_Project_Overview.md
.
Project Overview
- Exploratory Data Analysis
- Building and Training Your Model
- Inference
- FDA Preparation
Open 1_EDA.ipynb
with Jupyter Notebook for exploratory data analysis. The following data are examined:
- ChestX-ray14 Dataset metadata contains information for each X-Ray image file, the associated disease findings, patient gender, age, patient position during X-ray, and image shape.
- Pixel level assessment of X-Ray image files by graphing Intensity Profiles of normalized image pixels. X-Rays are also displayed using scikit-image.
Part 2: Building and Training Your Model, Fine Tuning Convolutional Neural Network VGG16 for Pneumonia Detection from X-Rays
Inputs:
- ChestX-ray14 dataset containing 112,120 X-Ray images (.png) in
data/images
and metadata indata/Data_Entry_2017.csv
file [1].
NOTE: The dataset is not included in this GitHub repo, because the dataset size is greater than 42GB. Please download a copy of the dataset from https://nihcc.app.box.com/v/ChestXray-NIHCC and unpack into/data/images
.
Output:
- CNN model trained to classify a chest X-Ray image for presence or absence of pneumonia in
/out/my_model1.json
. /out/xray_class_my_model.best.hdf5
containing model weights.
NOTE: This is not included in this GitHub repo.
- Open
2_Build_and_Train_Model
with Jupyter Notebook. - Create training data and validation data splits with scikit-learn train_test_split function.
- Ensure training data split is balanced for positive and negative cases. Ensure validation data split has a positive to negative case ratio that reflects clinical scenarios. Also check that each split has demographics that are reflective of the overall dataset.
- Prepare image preprocessing for each data split using Keras ImageDataGenerator.
- To fine-tune the ImageNet VGG16 model, create a new Keras Sequential model by adding VGG16 model layers and freezing their ImageNet-trained weights. Subsequently add Dense and Dropout layers, which will have their weights trained for classifying chest X-Ray images for pneumonia.
- The model training will have a history to show loss metrics at each training epoch. The best model weights are also captured at each training epoch.
- Model predictions initially return as probabilities between 0 and 1. These probabilistic results were compared against ground truth labels.
- A threshold analysis was completed to select the boundary at which probabilistic results are converted into binary results of either pneumonia presence or absence.
The CheXNet algorithm achieved an F1 score of 0.435, while a panel of four independent Radiologists averaged an F1 score of 0.387 [2]. This project's final F1 score is 0.36, which is similar in performance to the panel of Radiologist.
References
[1] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, Ronald M. Summers. "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases", IEEE CVPR, pp. 3462-3471,2017 Link
[2] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, Matthew P. Lungren, Andrew Y. Ng, "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning," arXiv:1711.05225, Dec 2017. Link
The 3_Inference Jupyter Notebook
contains the functions to load DICOM files, pre-process DICOM image, load the model built in 2_Build_and_Train_Model, and predict the presence of pneumonia from the DICOM image.
Inputs:
- .dcm DICOM medical imaging file, contains metadata and a medical image
Output:
- DICOM image is displayed with a prediction of whether the patient is Positive or Negative for Pneumonia
The following steps should be performed to analyze a chest X-Ray DICOM file:
- Load DICOM file with
check_dicom(filename)
function. It's output is the DICOM pixel_array or an error message if the DICOM file is not a Chest X-Ray. - Pre-process the loaded DICOM image with
preprocess_image(img=pixel_array, img_mean=0, img_std=1, img_size=(1,224,224,3))
function. - Load trained model with
load_model(model_path, weight_path)
. - Make prediction with
predict_image(model, img, thresh=0.245)
.
Complete project results discussion can be found in FDA_Preparation.md
This project is licensed under the MIT License - see the LICENSE.md