Machine Learning Approaches to Breast Cancer Classification with All of Us Data

Purpose and Overview

All of Us Program Logo

This project was my Honors undergraduate thesis at the University of Hawai’i at Mānoa, in which I used Python to prepare and analyze publicly available breast cancer patient data to train machine learning (ML) models to predict the malignancy of these patients. The data was sourced using the All of Us program, a program organized by the (United States) National Institute of Health which serves to aggregate, anonymize, and make available patient health data for research projects. This project can be separated into three major stages: data cleaning and preparation, exploratory data analysis, and model training and evaluation. The machine learning model types used in this study were imported from the sklearn library and include: multilayer perceptron, support vector machine classifier, random forest, Adaboost classifier, and gradient boosting classifier. Please note that this webpage serves to summarize the main points of the project; for a more detailed description of background research, methodology, and results the full project write-up is linked at the bottom of the page. Lastly, I want to especially thank my primary mentor, Dr. Peter Washington, and my committee member, Dr. Mahdi Belcaid, for providing their scholarly insight and support during this project.

Specific Aims

Dataset Source

Data Cleaning and Preparation

Exploratory Data Analysis

Eosinophils and Basophils Levels - Graphs

The primary method of exploratory data analysis was the usage of kernel density estimation (KDE) plots to estimate the general distribution of each data feature. KDE plots were plotted for each feature/column in both the original-type and Fitbit-type datasets. While some features, such as eosinophil counts and basophil counts from the selected liquid biopsy data, showed high separation in the distribution of benign and malignant patient data, a large number of other features displayed high similarity between both benign and malignant patient data. This is a trait which foreshadowed the inconclusive findings after model evaluation.

Model Training and Evaluation

Model Performance Table



















Boxplot of Classification Scores: SVC Nonzero KNN Fitbit-Type
Boxplot of Classification Scores: SVC Nonzero Median Original-Type






















Conclusions and Personal Takeaways

Links