Author + information
- aDepartment of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts
- bInstitute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts
- cDivision of Cardiology, Massachusetts General Hospital, Boston, Massachusetts
- ↵∗Address for correspondence:
Dr. Collin Stultz, Massachusetts Institute of Technology Institute for Medical Engineering and Science and Department of Electrical Engineering and Computer Science, MIT Building 36-796, 77 Massachusetts Avenue, Cambridge, Massachusetts 02138.
Machine learning is all the rage. At the time of this writing, a PubMed search using the phrase “machine learning” found almost 20,000 articles published within the last 5 years, and more than 5,000 of these papers report results using “deep learning.” Although much of this enthusiasm is understandable, applications of deep learning to problems in health care pose unique challenges (1–3).
Machine learning refers to a class of methods that allow computer systems to acquire knowledge from data where the learned knowledge is typically used to accomplish some pre-specified task. Although recent years have seen a significant increase in the application of these methods in the clinical domain, machine learning has actually been used in health care for some time. Logistic regression, cluster analysis, and many data mining techniques, for example, all fit below rubric (2).
Deep learning refers to a class of machine learning methods that strives to “learn” abstract ways to represent data. In most applications, these learned data abstractions are used to accomplish some task of interest, for example, patient risk stratification, diagnoses from visual images of pathology slides, and so forth. Deep learning models are typically complex neural networks that contain on the order of 106 modifiable parameters (3). Given their complexity, understanding what a successful deep learning model has actually learned is far from straightforward. Such “black boxes,” which provide little insight into how the model arrives at a given result, are therefore particularly difficult for a clinician to trust. This perception is compounded by the fact that many computer scientists who work in this domain are, unfortunately, too enamored with building sophisticated tools and relatively unconcerned with developing approaches that help users understand what knowledge these models have garnered.
A necessary condition for the success of any machine learning model is that it achieves an accuracy that is superior to pre-existing methods that are intended to accomplish the same task. In health care, however, accuracy alone does not, nor should it, ensure that a model will gain clinical acceptance. So, what constitutes a good deep learning model for clinical applications? Unlike problems outside of medicine, poor performance for clinical models can have deleterious consequences for patients. In view of the fact that no model in practice has 100% accuracy, attempts to understand when a given model is likely to fail should form an important part of the evaluation of any machine learning model that will be used clinically. In addition, the most useful clinical models are explainable in the sense that it is possible to describe in clearly understandable language why the model arrives at a particular result for a given set of inputs. Admittedly, translating higher level data abstractions, which arise from deep learning models, into language that the health care provider can understand is challenging. Nevertheless, the difficulty of this endeavor only highlights its necessity.
In this issue of JACC: Clinical Electrophysiology, Howard et al. (4) use a deep neural network to determine the model of a cardiac rhythm device using only radiographic images. The importance of the problem is undeniable because knowing a device’s model/manufacturer is a requisite piece of information that is needed before the device can be interrogated and programmed. Patients admitted with inappropriate implantable cardioverter-defibrillator (ICD) shocks, for example, often benefit from early interrogation, and the sooner the device model is identified, the earlier corrective measures can be taken. Furthermore, in addition to addressing a clinically important problem, this work is a nice example of a study that strives to address other issues that are important for generating clinically useful deep learning models.
Howard et al. (4) begin their work by retraining 5 previously constructed convolutional neural networks (CNNs) that exhibited impressive performance as part of the ImageNet large-scale visual recognition challenge—an ongoing competition that evaluated different algorithms for object identification and image classification (5). For context, a CNN is a type of deep neural network that is inspired by our knowledge of how images are processed by the visual cortex (6). The adjective “convolutional” refers to a set of mathematical functions that are used to quantify correlations within data. Convolutions are an effective way to extract features from visual images because images have a fair amount of underlying structure (i.e., 1 pixel in an image is more similar to other nearby pixels than it is to pixels that are far away in the image). Simply put, convolution functions provide an efficient platform for capturing the underlying organization in an image. For the sake of completeness, it is worth noting that CNNs have many other components besides convolutional layers. However, much of the motivation for their application to image analysis relies on the fact that spatial dependencies between pixels in an image can be captured with appropriate convolution functions.
The retrained CNNs achieved an accuracy that exceeded that of expert-guided classification by using a previously published cardiac rhythm device algorithm; the overall accuracy of the best performing CNN is above 99%. As the best performing CNN has more than 20 million modifiable parameters, understanding what the model has learned and how it arrives at a given classification is a daunting task. Nevertheless, to their credit, the authors help the reader understand when the model is most likely to fail and how it arrives at a given prediction (4). CNN accuracy is reduced on portable radiographs relative to sharper departmental radiographs, suggesting that the model should be used with caution when applied to portable studies. By contrast, accuracy does not seem to vary with the type of cardiac device (e.g., ICD vs. permanent pacemaker) or with the manufacturer. Additionally, in the publicly available version of the authors’ method, the CNN’s best guess for the device model is presented along with the 2 other similar alternatives corresponding to the model’s next best predictions. In principle, large differences among the 3 potential device models suggest that the CNN’s predictions are unreliable.
To understand what the model has learned, the authors relied on saliency maps, a visualization technique that identifies pixels in an image that are most responsible for the model arriving at a given classification (7). Saliency maps have lately received some fanfare among the machine learning papers, and applying them to this problem is a natural and welcome extension. The calculated saliency maps suggest that the CNNs find small circuit board components that are unique to different device models. The result, again, is that the CNN will be less likely to identify the correct device model if circuit board components are poorly visualized.
A potential deficiency of this study relates to the size of the training and test sets. CNNs constructed for image classification are typically trained on millions of visual images. Given that CNNs usually have millions of modifiable parameters, using a small dataset (where the number of patients is much smaller than the number of modifiable parameters) raises the concern of overfitting. Although evaluating the model on a test set that was not used to train the model helps to mitigate this concern, it does not eliminate it, especially given that the training/test set contains 1,451 of 225 images and the best performing CNN has almost 21 million modifiable parameters. Again, to their great credit, the authors used a series of standard machine learning techniques (dropout and regularization) that are known to minimize overfitting (4). More importantly, the authors made their method publicly available, thereby allowing users all over the world to apply the method to different radiographic images. These efforts will enable a more robust assessment of the model’s real-world accuracy.
Overall, the work of Howard et al. (4) is a tour de force and represents a nice example of how complex models intended for medical image classification can be constructed and tested in a manner that increases the likelihood that they will actually be used clinically. This study is an important stepping stone toward realizing the full potential that deep learning can have when applied to medical image data.
↵∗ Editorials published in JACC: Clinical Electrophysiology reflect the views of the authors and do not necessarily represent the views of JACC: Clinical Electrophysiology or the American College of Cardiology.
Dr. Stultz is a member of the scientific advisory board for UnlearnAI; is a consultant for Peach Intellihealth; and is an adviser for BloomerTech Inc.
The author attests he is in compliance with human studies committees and animal welfare regulations of the author’s institutions and Food and Drug Administration guidelines, including patient consent where appropriate. For more information, visit the JACC: Clinical Electrophysiology author instructions page.
- 2019 American College of Cardiology Foundation
- Singh J.P.
- Deo R.C.
- Ching T.,
- Himmelstein D.S.,
- Beaulieu-Jones B.K.,
- et al.
- Howard J.P.,
- Fisher L.,
- Shun-Shin M.J.,
- et al.
- Rawat W.,
- Wang Z.H.
- ↵Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv:13126034 [csCV] 2013.