Deep learning for Chest X-ray diagnosis

In the US, over a 150 million chest X-rays are done yearly by radiologists to help doctors with diagnosis. Chest X-rays are currently the best available method for diagnosing diseases like pneumonia. Automated chest radiograph interpretation at the level of practicing radiologists could provide substantial benefit in many medical settings. In this blog, we outline how we tried to automate this crucial aspect of modern medicine using deep learning approaches.

Dataset description

The dataset of Chest X-rays that we will be using for the project is called CheXpert collated by the Stanford ML group. Two variants of the dataset are provided : Low Resolution (15 GB), High Resolution (236 GB). The dataset contains X-rays on patients taken at different studies(think visits to the doctor), with each study having 1 or more X-rays. It is important to note that some studies have two X-ray views (frontal and lateral). Each X-ray has multiple labels (diseases) assigned to it and multiple lables can occur simultaneously. For instance, the most common cause of enlarged cardiomediastinum is cardiomegaly and can therefore occur simultaneously. This makes the problem a multi-label classification problem.

Besides the training dataset provided, we also have validation data intended to test the performance of our model.

Some images from the dataset:

Healthy X-ray:
image alt

Cardiomegaly:
image alt ><

Edema:
image alt ><

Pleural Effusion:
image alt ><

Lung lesion:
image alt ><

Dataset enhancements

Think of neural networks as a black box that is capable of only memorizing images in the orientation that it was trained with. For instance, if a neural net was shown a car with it’s front pointed towards the right, it’ll have a hard time recognizing the same car if it were flipped and were facing left. In order to handle this, we randomly flip X-rays so that our models can generalize well. Additionally, for single lobe diagnoses like pleural effusion, this will make sure that the network is capable of handling the occurence of such diagnoses on both lobes. Also, X rays taken in different settings might have slight variations in the brightness and saturation. We augment our dataset to account for this by adding slight variations in the brightness, contrast and saturation.
These augmentations make the network more robust to any variations in the image.

Uncertain Labels

The Stanford dataset for some patients marks the diagnosis as unknown. In the field of medicine, it is better to have false positives as opposed to false negatives. Therefore, we decided to treat all uncertain diagnoses as the presence of the pathology. Interestingly, this gave us the best performance amongst multiple approaches for removing the uncertain labels.

Data hypotheses

During data exploration, we tried to understand and visualize the different pathologies. We were specifically interested in the cause of the different diseases and how they manifest in X-rays. After careful examination of the dataset, we were able to formulate hypotheses for 4 of the pathologies based on how we expected the diseases to manifest on X-rays:

  1. Edema: The likelihood of Edema increases with how foggy the image is due to build up of liquid in the lungs.
  2. Pleural Effusion: The likelihood of Pleural Effusion increases as there is build up of white towards the bottom of one of the lungs. This is representative of liquid building up in the Pleural space between the lungs and the chest cavity.
  3. Lung Lesion: The likelihood of Lung Lesion increases if there is a small white oval on the lungs that is somewhat transparent.
  4. Cardiomegaly: The likelihood of Cardiomegaly increases as space between the two lobes increases.

These are the features that we think the model should look for as the identifiers for these pathologies. We will verify our models on these hypotheses to test whether the model recognizes the expected features.

Models

Before jumping to complex models, we evaluated some basic models in order to create a baseline. This also allowed us to establish the performance gain between different models.

For our base models, we tried Logistic Regression, a basic Feed Forward Network and a CNN designed and trained from scratch. The mean AUC scores (Higher is better) of these models were :

Once we had our baseline ready, we experimented with Transfer Learning approaches with deeper architectures. Transfer learning is exactly what the name sounds like. A child can clearly identify a Granny Smith apple once she knows what a red apple looks like and that a Granny Smith apple looks the same except that it is red in color. The same approach can be used for neural networks. Transfer learning focuses on storing knowledge gained while solving one problem and applying it to a different but related problem This is very useful in reducing training time and in getting good performance for domains without a lot of data. Generally, only the last layer of the network is fine tuned for the problem by freezing (not changing the network weights) the intermediate layers and training on the given dataset. The goal is to reduce the number of parameters that need to be optimised, while reusing the lower level layers. The lower level layers are most likely to be similar regardless of the problem domain, and the model has to freedom of combining higher level layers together, specific to the problem.
For our dataset, we lack a pretrained model on a similar domain. We take models pretrained on imagenet images and retrain them for our problem. Our images are very different from what imagenet was trained on. But since the two problems are fundamentally very different, we don’t freeze the intermediate layers, instead we use the imagenet weights as an initialization for our model weights and train the entire network on our dataset. We also experimented with progressive freezing of the weights (freezeOut) on ResNets.

Densenet121

Neural networks can be thought of as many people standing in line playing telephone. Information is lost as the number of people(layers) are increased. Densenets overcome this by creating skip connections(analogous to skipping people in telephone). Each layer of the densenet is connected to all possible subsequent layers. This ensures that information isn’t lost. In DenseNet, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. Concatenation is used. Each layer is receiving a “collective knowledge” from all preceding layers.
Since each layer receives feature maps from all preceding layers, network can be thinner and compact, i.e. number of channels can be fewer. The growth rate k is the additional number of channels for each layer.

So, it have higher computational efficiency and memory efficiency. The following figure shows the concept of concatenation during forward propagation:

Dense blocks

Advantages of DenseNets over ResNets:

Performance
Using the Densenet model, we saw a significant jump in performance over the CNN model. With the Densenet model, we achieved a mean AUC of 0.766 on the stanford validation set. The model did really well for a lot of pathologies especially for pleural effusion and edema with AUC scores of 0.89 and 0.87. Here are the ROC curves (True Positive Rate vs false positive rate) for the pathologies:

It can be seen that the Densenet does improve the model performance quite a bit over the CNN. However, it performs poorly on enlarged cardiomediastinum. We believe that this is because it generally co-occurs with cardiomegaly making it hard to differentiate. In fact, the most common cause of enlarged cardiomediastinum is cardiomegaly. This highlights an important shortcoming of neural nets. For a radiologist, it might be trivial to learn that one of the most common causes of enlarged cardiomediastinum is cardiomegaly. However, for neural nets it is hard to add this knowledge in a straightforward manner.

View based models

We devised an architecture to make classifications at a study level. Since the radiologists had access to both the frontal and lateral X-rays for some patients, we wanted to incorporate this in our model. We trained two separate densenets from the previous section - one trained only on lateral images and one trained only on frontal images. Once again, we see how neural networks aren’t capable of handling different kinds of data in a straightforward manner. We combine the results from these two images at a study level by taking the maximum of the two if both images are available for a study. If not, we only use the model corresponding to the view available. We incorporate our human understanding that if one of the views depicts that there is a disease present, it is highly likely the disease is present regarless of what the other view depicts. We also tried taking the average of the two results and doing a weighted average but taking the maximum worked better in general.


Novel architecture
This model was trained with the same parameters as the vanilla densenet model. The only difference being that the frontal view model was trained for 3 epochs and lateral view model was trained for 6 epochs. An epoch trains the neural network with the entire dataset once. The intuition behind training the lateral model for more epochs is that the number of lateral views is much lesser than the frontal views in the dataset.

Results
We got AUC score of 0.803 on the stanford validation dataset. On the validation set, this model outperformed the DenseNet121. Here are the ROC curves for the combined model:

Model interpretation

We interpreted the model using Class Activation Mapping(CAM). It highlights the features in an image that are the most important for making a classification.
The best performing model was used for this section. The CAM images in general looked as expected. For instance, with Edema, the entire image is highlighted because the model needs to check whether both lobes of the lung are hazy and filling with fluid. Similarly, for cardiomegaly, the model correctly focuses on the lobe of the lung that was pushed away due to enlargement of the heart. For lung lesions, the model correctly focuses on the lesion.

Pleural effusion CAM
Effusion

Edema CAM
Edema

Lesion CAM
Lesion

We also experimented with another visualization method called Guided Backpropagation. Guided backpropagation aims at identifying the locations in the image that activate the final layer neurons. In order to better visualize the ROIs, only the neurons that are activated by the image i.e have a non-zero gradient are retained during backpropagation and all negative gradients are clipped to 0.

In the below example, we can observe how the model focuses on the position of the diaphragm and the lower boundaries of the lungs for the detection of cardiomegaly. By observing the extremities of the lung lobes as well as the heart boundary, the model focuses on the whether the heart pushes too much into the lobe or not.

Hypotheses verification

To verify our intial hypotheses, we edited the image and checked if our model increased the score for that class. For instance, for cardiomegaly we added white space between the lungs and checked if the score for cardiomegaly went up. Similar edits were done for other hypotheses as well. A few sample images have been attached. We are happy to report that all of our intial hypotheses were correct. The scores for the classes before and after the edit have been reported below.

Diagnosis Un-Edited Edited(Hypothesized) image
Edema 0.036 0.4133
Pleural effusion 0.8927 0.4536
Cardiomegaly 0.072 0.5925
Lesion 0.016 0.5785

Table: Scores for each class before and after edit
For pleural effusion, an X-ray diagnosed to have pleural effusion was edited

cardiomegaly
Cardiomegaly left(unedited)and right(edited)
edema
Edema left(unedited)and right(edited)
Lesion
Lesion left(unedited)and right(edited)
effusion
Pleural effusion left(unedited)and right(edited)