#################################################################################################
# NAME : Jeremiah Dy
# ASSIGNMENT: Final Project
# INSTRUCTOR: Peter Fuleky
# DATE: December 5, 2022
#################################################################################################
#################################################################################################
# Research Question:
# To what extent can we predict if a breast cancer mass is benign or malignant based off of size (FNAs)?
#
# Part 1 - Preparatory Analysis
# This is a dataset comprising of breast cancer tumor data from the UC Irvine Machine Learning
# repository. Features are computed from digitizing images of a fine needle aspirate (FNA) of a
# breast mass. These features are descriptors of the cell nuclei displayed in the image.
#
# This dataset is relatively small, comprising of only 569 total entries (357 benign, 212 malignant).
# It has 32 columns: the first two are ID number and classification (benign/malignant), with the
# other 30 being calculated mean, standard error, and largest values of each composite image. These
# values are pre-computed; the original dataset of individual image samples is unavailable. This
# dataset is free-to-use at:
# https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data?resource=download
#
# Because of the dataset's small size, it has relatively low external validity, and is prone to
# overfitting, something that will be increasingly prevalent throughout this report. Additionally,
# the data in the dataset is pre-calculated, so some round-off error can be expected.
# Only the mean feature values of this data set will be considered in model building (columns 3-12).
#
# Summaries and histograms of the ten features are available in this section.
#
# Part 2 - Model Building and Selection
# For the first section of Part 2, I built a LPM to get the average association the diagnosis has
# with the raw feature variables. From the coefficients, it appears that perimeter, area, and
# fractal_dimension are the only features out of the 10 raw variables to induce a lower probability
# of a malignant tumor. However, this interpretation is very naive, so take it with a grain of salt.
#
# For the second section of Part 2, I built 5 different logit models using extrapolation of the
# raw feature variables provided in the dataset. The predictor variables and their coefficients can
# be accessed in the code below. Of these five models, the logit2 model performed the best with a
# mean RMSE of 0.191 across 5-fold cross-validation. To double-check the model performance, I
# conducted a LASSO search algorithm with the full set of extrapolated feature variables. The LASSO
# algorithm then produced a model with 17 predictors (3 less than logit2 model's 20 predictors) with
# a 5-fold CV mean RSME of 0.139. The RSME on the holdout set was 0.165.
#
# Using the LASSO model, I calculated a confusion matrix with a default value of 0.5 and a calculated
# mean probability value of 0.416 as thresholds. These confusion matrices are cm1 and cm2, respectively.
# The confusion matrices were incredibly similar, with only 1 observation difference between the two.
# For the mean probability threshold, the LASSO model had an accuracy of 0.9819, a sensitivity of
# 0.9574, and a specificity of 0.9696. Fun fact: while coding on autopilot I made code for a calibration
# curve. On binary dependent data. I left the code in just as a reminder for myself if I ever look back
# at this file.
#
# After those 6 models, I built two random forest models (one for probability predictions and one
# for classification). I again fed all extrapolated variables to the random forest algorithm (same
# list as the LASSO model). The 5-fold CV RMSE of the probability random forest was 0.173. The RSME
# of the holdout set was 0.211. The overall best mtry (randomly selected features to split on) was 7,
# with best minimum node size of 10. A confusion matrix was calculated for the classification random
# forest as well, with an accuracy of 0.9292, a sensitivity of 0.8936, and a specificity of 0.9545.
# The overall best mtry for the classification tree was 6, and best minimum node size 10 as well.
#
# Part 3 - Conclusions & Write-Up
# Overall, the LASSO model and the random forest models, unsurprisingly, seem to be the best with
# predicting probabilities and classifying observations (followed by model logit2). Both the LASSO
# and the random forest model have high accuracy, sensitivity, and specificity. While this is
# theoretically great, it is important to note that this dataset is remarkably small (not even 1000
# observations), which puts a limit on its external validity. Interestingly, when looking at the
# LASSO coefficients, I noticed that the algorithm placed heavy weight on the radius_worst feature,
# implying that the larger the cancer mass is the more likely it is to metastasize and become
# malignant. Given my rudimentary medical domain knowledge, I have to agree. Furthermore, something
# I noticed in both the rudimentary LPM and the LASSO model was the emphasis placed on the concave.points
# feature. Perhaps this has something meaning in the oncology field? For future study, I
# would try coding in some ROC curves and maybe try a GBM. During the course of this project, I tried
# coding both in, but I had no success in either compiling. Utilizing ROC curves would enable me
# to optimize my threshold values, rather than picking naive values such as the default 0.5 for
# binary data, or the calculated mean of probabilities among the predicted data. Something else
# I'd love to try (but would be impractical to do so) is to try permutations of all of my variables
# (including extrapolated variables) to see if any combination of these would produce a better fit
# logit model. The LASSO model optimized itself to 17 predictors, and the logit2 model with 20
# predictors had the best performance out of the "handmade" logit models. Perhaps an upper limit
# of 20 predictor variables and testing out different models built on permutations of variables
# could have a better fit model. I realize, of course, that this is basically a NP-hard problem,
# in that my laptop would probably overheat and explode before it even came close to finishing the
# necessary permutations, model building, and cross-validation required for this problem. Still,
# food for thought. As for concluding thoughts, working this entire project and in-depth analysis
# really helped reinforce my learning and usage of R, as well as the theoretical data analysis
# concepts learned in the course. I even used a naive LPM for one of my other classes (ICS 484);
# of course, it isn't super robust because that's not the purpose of the project, but I think it's
# neat to tie in other subjects like this. Overall, I think this was an excellent exercise in data
# analysis and prepared me well for future projects in data science.
#################################################################################################
#### SET UP
# It is advised to start a new session for every case study
# CLEAR MEMORY
rm(list=ls())
library(haven)
library(modelsummary)
library(fixest)
library(haven)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
library(purrr)
library(margins)
library(skimr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:modelsummary':
##
## Mean
## The following objects are masked from 'package:base':
##
## format.pval, units
library(cowplot)
library(gmodels)
library(lspline)
library(sandwich)
library(modelsummary)
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
##
## Attaching package: 'bitops'
## The following object is masked from 'package:Matrix':
##
## %&%
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:gmodels':
##
## ci
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(ranger)
##
## Attaching package: 'ranger'
## The following object is masked from 'package:rattle':
##
## importance
library(rpart)
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(rpart.plot)
source("ch00-tech-prep/theme_bg.R")
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ✔ dplyr 1.0.9
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ caret::lift() masks purrr::lift()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ dplyr::src() masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ✖ tidyr::unpack() masks Matrix::unpack()
##
## Attaching package: 'scales'
##
##
## The following object is masked from 'package:readr':
##
## col_factor
##
##
## The following object is masked from 'package:purrr':
##
## discard
##
##
## The following object is masked from 'package:fixest':
##
## pvalue
source("ch00-tech-prep/da_helper_functions.R")
##
## Please cite as:
##
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
# Local path
data_raw <- read.csv("/Users/jeremiah/Desktop/UHM/C-Junior 2022-2023/ECON 427/data.csv")
############################################ Part 1 #############################################
### Data prep ###
# Expand raw variables for future model building
data_raw <- data_raw %>% mutate (log_radius_mean = log(radius_mean),
log_texture_mean = log(texture_mean),
log_perimeter_mean = log(perimeter_mean),
log_area_mean = log(area_mean),
log_smoothness_mean = log(smoothness_mean),
log_compactness_mean = log(compactness_mean),
log_concavity_mean = log(concavity_mean),
log_concave.points_mean = log(concave.points_mean),
log_symmetry_mean = log(symmetry_mean),
log_fractal_dimension_mean = log(fractal_dimension_mean))
data_raw <- data_raw %>% mutate (sq_radius_mean = (radius_mean)^2,
sq_texture_mean = (texture_mean)^2,
sq_perimeter_mean = (perimeter_mean)^2,
sq_area_mean = (area_mean)^2,
sq_smoothness_mean = (smoothness_mean)^2,
sq_compactness_mean = (compactness_mean),
sq_concavity_mean = (concavity_mean)^2,
sq_concave.points_mean = (concave.points_mean)^2,
sq_symmetry_mean = (symmetry_mean)^2,
sq_fractal_dimension_mean = (fractal_dimension_mean)^2)
data_raw <- data_raw %>% mutate (sqrt_radius_mean = sqrt(radius_mean),
sqrt_texture_mean = sqrt(texture_mean),
sqrt_perimeter_mean = sqrt(perimeter_mean),
sqrt_area_mean = sqrt(area_mean),
sqrt_smoothness_mean = sqrt(smoothness_mean),
sqrt_compactness_mean = sqrt(compactness_mean),
sqrt_concavity_mean = sqrt(concavity_mean),
sqrt_concave.points_mean = sqrt(concave.points_mean),
sqrt_symmetry_mean = sqrt(symmetry_mean),
sqrt_fractal_dimension_mean = sqrt(fractal_dimension_mean))
# Drop last column (contains no useful data; artifact from converting from csv file)
# Drop SE and largest sample values of composite image columns; they will not be considered for
# model building
keep_cols <- c("id","diagnosis","radius_mean","texture_mean","perimeter_mean", "area_mean",
"smoothness_mean","compactness_mean","concavity_mean","concave.points_mean",
"symmetry_mean","fractal_dimension_mean", "radius_worst","texture_worst","perimeter_worst", "area_worst",
"smoothness_worst","compactness_worst","concavity_worst","concave.points_worst",
"symmetry_worst","fractal_dimension_worst")
data <- select(data_raw, keep_cols)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_cols)
##
## # Now:
## data %>% select(all_of(keep_cols))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
# Convert binary character variable "diagnosis" into numeric binary variable for LPM
# Malignant: 'M' -> 1
# Benign : 'B' -> 0
lookup <- c('B' = 0, 'M' = 1)
data$diagnosis <- lookup[data_raw$diagnosis]
# Sanity check: check number of MALIGNANT rows (212 rows)
count(filter(data, diagnosis == 1))
## n
## 1 212
# Sanity check: check number of BENIGN rows (357 rows)
count(filter(data, diagnosis == 0))
## n
## 1 357
### Raw variable summary ###
# Radius
summary(data$radius_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.981 11.700 13.370 14.127 15.780 28.110
radius_hist <- ggplot(data=data, aes(x=radius_mean)) +
geom_histogram_da(type="percent", bins = 20) +
labs(x = "radius_mean",y = "Percent")+
theme_bg()
radius_hist
# Texture
summary(data$texture_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.71 16.17 18.84 19.29 21.80 39.28
texture_hist <- ggplot(data=data, aes(x=texture_mean)) +
geom_histogram_da(type="percent", bins = 20) +
labs(x = "texture_mean",y = "Percent")+
theme_bg()
texture_hist
# Perimeter
summary(data$perimeter_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.79 75.17 86.24 91.97 104.10 188.50
perimeter_hist <- ggplot(data=data, aes(x=perimeter_mean)) +
geom_histogram_da(type="percent", bins = 20) +
labs(x = "perimeter_mean",y = "Percent")+
theme_bg()
perimeter_hist
# Area
summary(data$area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 143.5 420.3 551.1 654.9 782.7 2501.0
area_hist <- ggplot(data=data, aes(x=area_mean)) +
geom_histogram_da(type="percent", bins = 20) +
labs(x = "area_mean",y = "Percent")+
theme_bg()
area_hist
# Smoothness
summary(data$smoothness_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05263 0.08637 0.09587 0.09636 0.10530 0.16340
smoothness_hist <- ggplot(data=data, aes(x=smoothness_mean)) +
geom_histogram_da(type="percent", bins = 20) +
labs(x = "smoothness_mean",y = "Percent")+
theme_bg()
smoothness_hist
# Compactness
summary(data$compactness_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01938 0.06492 0.09263 0.10434 0.13040 0.34540
compactness_hist <- ggplot(data=data, aes(x=compactness_mean)) +
geom_histogram_da(type="percent", bins = 20) +
labs(x = "compactness_mean",y = "Percent")+
theme_bg()
compactness_hist
# Concavity
summary(data$concavity_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.02956 0.06154 0.08880 0.13070 0.42680
concavity_hist <- ggplot(data=data, aes(x=concavity_mean)) +
geom_histogram_da(type="percent", bins = 20) +
labs(x = "concavity_mean",y = "Percent")+
theme_bg()
concavity_hist
# Concave.points
summary(data$concave.points_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.02031 0.03350 0.04892 0.07400 0.20120
concave.points_hist <- ggplot(data=data, aes(x=concave.points_mean)) +
geom_histogram_da(type="percent", bins = 20) +
labs(x = "concave.points_mean",y = "Percent")+
theme_bg()
concave.points_hist
# Symmetry
summary(data$symmetry_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1060 0.1619 0.1792 0.1812 0.1957 0.3040
symmetry_hist <- ggplot(data=data, aes(x=symmetry_mean)) +
geom_histogram_da(type="percent", bins = 20) +
labs(x = "symmetry_mean",y = "Percent")+
theme_bg()
symmetry_hist
# Fractal Dimension
summary(data$fractal_dimension_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04996 0.05770 0.06154 0.06280 0.06612 0.09744
fractal_dimension_hist <- ggplot(data=data, aes(x=fractal_dimension_mean)) +
geom_histogram_da(type="percent", bins = 20) +
labs(x = "fractal_dimension_mean",y = "Percent")+
theme_bg()
fractal_dimension_hist
############################################ Part 2 #############################################
# Build LPM to get average association between diagnosis and raw variables
lpm <- feols( diagnosis ~ radius_mean + texture_mean + perimeter_mean + area_mean + smoothness_mean +
compactness_mean + concavity_mean + concave.points_mean + symmetry_mean + fractal_dimension_mean,
data = data, vcov = 'hetero' )
etable(lpm, digits="r3")
## lpm
## Dependent Var.: diagnosis
##
## Constant -2.052*** (0.410)
## radius_mean 0.490*** (0.129)
## texture_mean 0.022*** (0.003)
## perimeter_mean -0.055** (0.020)
## area_mean -0.001*** (0.000)
## smoothness_mean 1.941 (1.277)
## compactness_mean 0.097 (1.127)
## concavity_mean 0.810 (0.634)
## concave.points_mean 6.431*** (1.426)
## symmetry_mean 1.012. (0.529)
## fractal_dimension_mean -0.119 (4.128)
## ______________________ _________________
## S.E. type Heteroskeda.-rob.
## Observations 569
## R2 0.68276
## Adj. R2 0.67708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Create list of variables to build models on
X1 <- c("radius_mean","texture_mean","perimeter_mean", "area_mean",
"smoothness_mean","compactness_mean","concavity_mean","concave.points_mean",
"symmetry_mean","fractal_dimension_mean")
X2 <- c("radius_worst","texture_worst","perimeter_worst", "area_worst",
"smoothness_worst","compactness_worst","concavity_worst","concave.points_worst",
"symmetry_worst","fractal_dimension_worst")
X3 <- c("sq_radius_mean","sq_texture_mean","sq_perimeter_mean", "sq_area_mean",
"sq_smoothness_mean","sq_compactness_mean","sq_concavity_mean","sq_concave.points_mean",
"sq_symmetry_mean","sq_fractal_dimension_mean")
X4 <- c("sqrt_radius_mean","sqrt_texture_mean","sqrt_perimeter_mean", "sqrt_area_mean",
"sqrt_smoothness_mean","sqrt_compactness_mean","sqrt_concavity_mean","sqrt_concave.points_mean",
"sqrt_symmetry_mean","sqrt_fractal_dimension_mean")
vars0.8 <- c("radius_mean","perimeter_mean",
"smoothness_mean","compactness_mean","concavity_mean","concave.points_mean",
"symmetry_mean","fractal_dimension_mean")
vars1 <- c(X1)
vars2 <- c(X1, X2)
vars3 <- c(X1, X2, X3)
vars4 <- c(X1, X2, X3, X4)
# Train-test split
set.seed(13505)
train_indices <- as.integer(createDataPartition(data$diagnosis, p = 0.8, list = FALSE))
data_train <- data_raw[train_indices, ]
data_holdout <- data_raw[-train_indices, ]
### Build logit models ###
train_control <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummaryExtended,
savePredictions = TRUE
)
logit_model_vars <- list("logit0" = vars0.8, "logit1" = vars1, "logit2" = vars2, "logit3" = vars3, "logit4" = vars4)
CV_RMSE_folds <- list()
logit_models <- list()
for (model_name in names(logit_model_vars)) {
features <- logit_model_vars[[model_name]]
set.seed(13505)
glm_model <- train(
formula(paste0("diagnosis ~", paste0(features, collapse = " + "))),
method = "glm",
data = data_train,
family = binomial,
trControl = train_control
)
logit_models[[model_name]] <- glm_model
# Calculate RMSE on test for each fold
CV_RMSE_folds[[model_name]] <- glm_model$resample[,c("Resample", "RMSE")]
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# LASSO Logit
lambda <- 10^seq(-1, -4, length = 10)
grid <- expand.grid("alpha" = 1, lambda = lambda)
set.seed(13505)
system.time({
logit_lasso_model <- train(
formula(paste0("diagnosis ~", paste0(vars4, collapse = " + "))),
data = data_train,
method = "glmnet",
preProcess = c("center", "scale"),
family = "binomial",
trControl = train_control,
tuneGrid = grid,
na.action=na.exclude
)
})
## user system elapsed
## 1.033 0.062 1.178
tuned_logit_lasso_model <- logit_lasso_model$finalModel
best_lambda <- logit_lasso_model$bestTune$lambda
logit_models[["LASSO"]] <- logit_lasso_model
lasso_coeffs <- as.matrix(coef(tuned_logit_lasso_model, best_lambda))
CV_RMSE_folds[["LASSO"]] <- logit_lasso_model$resample[,c("Resample", "RMSE")]
# Calculate means of both metrics (RMSE and AUC)
CV_RMSE <- list()
for (model_name in names(logit_models)) {
CV_RMSE[[model_name]] <- mean(CV_RMSE_folds[[model_name]]$RMSE)
}
# We have 6 models, (5 logit and the logit lasso). For each we have a 5-CV RMSE.
# We pick our preferred model based on that. -----------------------------------------------
data_holdout_numeric <- data_holdout
lookup <- c('B' = 0, 'M' = 1)
data_holdout_numeric$diagnosis <- lookup[data_holdout$diagnosis]
nvars <- lapply(logit_models, FUN = function(x) length(x$coefnames))
nvars[["LASSO"]] <- sum(lasso_coeffs != 0)
logit_summary1 <- data.frame("Number of predictors" = unlist(nvars),
"CV RMSE" = unlist(CV_RMSE))
print(logit_summary1)
## Number.of.predictors CV.RMSE
## logit0 8 0.2466331
## logit1 10 0.2303568
## logit2 20 0.1906512
## logit3 30 0.2589467
## logit4 40 0.2619661
## LASSO 17 0.1391337
best_logit_no_loss <- logit_models[["LASSO"]]
logit_predicted_probabilities_holdout <- predict(best_logit_no_loss, newdata = data_holdout_numeric, type = "prob")
data_holdout_numeric[,"best_logit_no_loss_pred"] <- logit_predicted_probabilities_holdout[,'M']
RMSE(data_holdout_numeric[, "best_logit_no_loss_pred", drop=TRUE], data_holdout_numeric$diagnosis)
## [1] 0.1651481
# Confusion Matrix for best model (LASSO) on holdout
logit_predicted_class_holdout <- predict(best_logit_no_loss, newdata = data_holdout, type = "raw")
cm_object1 <- confusionMatrix(logit_predicted_class_holdout, as.factor(data_holdout$diagnosis), positive = "M")
cm1 <- cm_object1$table
cm1
## Reference
## Prediction B M
## B 65 2
## M 1 45
# Confusion Matrix for best model (LASSO) on holdout with mean probability as threshold
mean_predicted_malignant_prob <- mean(data_holdout_numeric$best_logit_no_loss_pred)
mean_predicted_malignant_prob
## [1] 0.4155854
holdout_prediction <-
ifelse(data_holdout_numeric$best_logit_no_loss_pred < mean_predicted_malignant_prob, "B", "M") %>%
factor(levels = c("B", "M"))
cm_object2 <- confusionMatrix(holdout_prediction,as.factor(data_holdout$diagnosis))
cm2 <- cm_object2$table
cm2
## Reference
## Prediction B M
## B 64 2
## M 2 45
# Probably overfitting? Calibration Curve is a little wonky
output <- "/Users/jeremiah/Desktop/UHM/C-Junior 2022-2023/ECON 427/"
create_calibration_plot(data_holdout_numeric,
file_name = "427-calibration-curve-logit-no-loss",
prob_var = "best_logit_no_loss_pred",
actual_var = "diagnosis",
n_bins = 10)
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Removed 1 row(s) containing missing values (geom_path).
### Random Forests ###
# Probability Random Forest
train_control <- trainControl(
method = "cv",
n = 5,
classProbs = TRUE, # same as probability = TRUE in ranger
summaryFunction = twoClassSummaryExtended,
savePredictions = TRUE
)
train_control$verboseIter <- TRUE
tune_grid <- expand.grid(
.mtry = c(5, 6, 7),
.splitrule = "gini",
.min.node.size = c(10, 15)
)
# getModelInfo("ranger")
set.seed(13505)
rf_model_p <- train(
formula(paste0("diagnosis ~ ", paste0(vars4, collapse = " + "))),
method = "ranger",
data = data_train,
tuneGrid = tune_grid,
trControl = train_control
)
## + Fold1: mtry=5, splitrule=gini, min.node.size=10
## - Fold1: mtry=5, splitrule=gini, min.node.size=10
## + Fold1: mtry=6, splitrule=gini, min.node.size=10
## - Fold1: mtry=6, splitrule=gini, min.node.size=10
## + Fold1: mtry=7, splitrule=gini, min.node.size=10
## - Fold1: mtry=7, splitrule=gini, min.node.size=10
## + Fold1: mtry=5, splitrule=gini, min.node.size=15
## - Fold1: mtry=5, splitrule=gini, min.node.size=15
## + Fold1: mtry=6, splitrule=gini, min.node.size=15
## - Fold1: mtry=6, splitrule=gini, min.node.size=15
## + Fold1: mtry=7, splitrule=gini, min.node.size=15
## - Fold1: mtry=7, splitrule=gini, min.node.size=15
## + Fold2: mtry=5, splitrule=gini, min.node.size=10
## - Fold2: mtry=5, splitrule=gini, min.node.size=10
## + Fold2: mtry=6, splitrule=gini, min.node.size=10
## - Fold2: mtry=6, splitrule=gini, min.node.size=10
## + Fold2: mtry=7, splitrule=gini, min.node.size=10
## - Fold2: mtry=7, splitrule=gini, min.node.size=10
## + Fold2: mtry=5, splitrule=gini, min.node.size=15
## - Fold2: mtry=5, splitrule=gini, min.node.size=15
## + Fold2: mtry=6, splitrule=gini, min.node.size=15
## - Fold2: mtry=6, splitrule=gini, min.node.size=15
## + Fold2: mtry=7, splitrule=gini, min.node.size=15
## - Fold2: mtry=7, splitrule=gini, min.node.size=15
## + Fold3: mtry=5, splitrule=gini, min.node.size=10
## - Fold3: mtry=5, splitrule=gini, min.node.size=10
## + Fold3: mtry=6, splitrule=gini, min.node.size=10
## - Fold3: mtry=6, splitrule=gini, min.node.size=10
## + Fold3: mtry=7, splitrule=gini, min.node.size=10
## - Fold3: mtry=7, splitrule=gini, min.node.size=10
## + Fold3: mtry=5, splitrule=gini, min.node.size=15
## - Fold3: mtry=5, splitrule=gini, min.node.size=15
## + Fold3: mtry=6, splitrule=gini, min.node.size=15
## - Fold3: mtry=6, splitrule=gini, min.node.size=15
## + Fold3: mtry=7, splitrule=gini, min.node.size=15
## - Fold3: mtry=7, splitrule=gini, min.node.size=15
## + Fold4: mtry=5, splitrule=gini, min.node.size=10
## - Fold4: mtry=5, splitrule=gini, min.node.size=10
## + Fold4: mtry=6, splitrule=gini, min.node.size=10
## - Fold4: mtry=6, splitrule=gini, min.node.size=10
## + Fold4: mtry=7, splitrule=gini, min.node.size=10
## - Fold4: mtry=7, splitrule=gini, min.node.size=10
## + Fold4: mtry=5, splitrule=gini, min.node.size=15
## - Fold4: mtry=5, splitrule=gini, min.node.size=15
## + Fold4: mtry=6, splitrule=gini, min.node.size=15
## - Fold4: mtry=6, splitrule=gini, min.node.size=15
## + Fold4: mtry=7, splitrule=gini, min.node.size=15
## - Fold4: mtry=7, splitrule=gini, min.node.size=15
## + Fold5: mtry=5, splitrule=gini, min.node.size=10
## - Fold5: mtry=5, splitrule=gini, min.node.size=10
## + Fold5: mtry=6, splitrule=gini, min.node.size=10
## - Fold5: mtry=6, splitrule=gini, min.node.size=10
## + Fold5: mtry=7, splitrule=gini, min.node.size=10
## - Fold5: mtry=7, splitrule=gini, min.node.size=10
## + Fold5: mtry=5, splitrule=gini, min.node.size=15
## - Fold5: mtry=5, splitrule=gini, min.node.size=15
## + Fold5: mtry=6, splitrule=gini, min.node.size=15
## - Fold5: mtry=6, splitrule=gini, min.node.size=15
## + Fold5: mtry=7, splitrule=gini, min.node.size=15
## - Fold5: mtry=7, splitrule=gini, min.node.size=15
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 7, splitrule = gini, min.node.size = 10 on full training set
rf_model_p$results
## mtry splitrule min.node.size Accuracy Kappa RMSE AccuracySD
## 1 5 gini 10 0.9539417 0.8993422 0.1754368 0.03242173
## 2 5 gini 15 0.9517678 0.8945893 0.1788756 0.03256956
## 3 6 gini 10 0.9517917 0.8950499 0.1743855 0.03250347
## 4 6 gini 15 0.9517917 0.8957472 0.1780799 0.02958612
## 5 7 gini 10 0.9539895 0.9000295 0.1727440 0.03322637
## 6 7 gini 15 0.9495939 0.8905176 0.1776455 0.03603859
## KappaSD RMSESD
## 1 0.07173141 0.04928049
## 2 0.07180542 0.04821365
## 3 0.07060087 0.04968911
## 4 0.06331282 0.04807245
## 5 0.07210735 0.05031556
## 6 0.07798343 0.05263144
best_mtry_p <- rf_model_p$bestTune$mtry
best_min_node_size_p <- rf_model_p$bestTune$min.node.size
CV_RMSE_folds[["rf_p"]] <- rf_model_p$resample[,c("Resample", "RMSE")]
CV_RMSE[["rf_p"]] <- mean(CV_RMSE_folds[["rf_p"]]$RMSE)
# Calculate RMSE of Random Forest probability model on holdout set
rf_predicted_probabilities_holdout <- predict(rf_model_p, newdata = data_holdout_numeric, type = "prob")
data_holdout$rf_p_prediction <- rf_predicted_probabilities_holdout[,"M"]
RMSE(data_holdout$rf_p_prediction, data_holdout_numeric$diagnosis)
## [1] 0.2114045
# Classification Random Forest
train_control <- trainControl(
method = "cv",
n = 5
)
train_control$verboseIter <- TRUE
set.seed(13505)
rf_model_f <- train(
formula(paste0("diagnosis ~ ", paste0(vars4, collapse = " + "))),
method = "ranger",
data = data_train,
tuneGrid = tune_grid,
trControl = train_control
)
## + Fold1: mtry=5, splitrule=gini, min.node.size=10
## - Fold1: mtry=5, splitrule=gini, min.node.size=10
## + Fold1: mtry=6, splitrule=gini, min.node.size=10
## - Fold1: mtry=6, splitrule=gini, min.node.size=10
## + Fold1: mtry=7, splitrule=gini, min.node.size=10
## - Fold1: mtry=7, splitrule=gini, min.node.size=10
## + Fold1: mtry=5, splitrule=gini, min.node.size=15
## - Fold1: mtry=5, splitrule=gini, min.node.size=15
## + Fold1: mtry=6, splitrule=gini, min.node.size=15
## - Fold1: mtry=6, splitrule=gini, min.node.size=15
## + Fold1: mtry=7, splitrule=gini, min.node.size=15
## - Fold1: mtry=7, splitrule=gini, min.node.size=15
## + Fold2: mtry=5, splitrule=gini, min.node.size=10
## - Fold2: mtry=5, splitrule=gini, min.node.size=10
## + Fold2: mtry=6, splitrule=gini, min.node.size=10
## - Fold2: mtry=6, splitrule=gini, min.node.size=10
## + Fold2: mtry=7, splitrule=gini, min.node.size=10
## - Fold2: mtry=7, splitrule=gini, min.node.size=10
## + Fold2: mtry=5, splitrule=gini, min.node.size=15
## - Fold2: mtry=5, splitrule=gini, min.node.size=15
## + Fold2: mtry=6, splitrule=gini, min.node.size=15
## - Fold2: mtry=6, splitrule=gini, min.node.size=15
## + Fold2: mtry=7, splitrule=gini, min.node.size=15
## - Fold2: mtry=7, splitrule=gini, min.node.size=15
## + Fold3: mtry=5, splitrule=gini, min.node.size=10
## - Fold3: mtry=5, splitrule=gini, min.node.size=10
## + Fold3: mtry=6, splitrule=gini, min.node.size=10
## - Fold3: mtry=6, splitrule=gini, min.node.size=10
## + Fold3: mtry=7, splitrule=gini, min.node.size=10
## - Fold3: mtry=7, splitrule=gini, min.node.size=10
## + Fold3: mtry=5, splitrule=gini, min.node.size=15
## - Fold3: mtry=5, splitrule=gini, min.node.size=15
## + Fold3: mtry=6, splitrule=gini, min.node.size=15
## - Fold3: mtry=6, splitrule=gini, min.node.size=15
## + Fold3: mtry=7, splitrule=gini, min.node.size=15
## - Fold3: mtry=7, splitrule=gini, min.node.size=15
## + Fold4: mtry=5, splitrule=gini, min.node.size=10
## - Fold4: mtry=5, splitrule=gini, min.node.size=10
## + Fold4: mtry=6, splitrule=gini, min.node.size=10
## - Fold4: mtry=6, splitrule=gini, min.node.size=10
## + Fold4: mtry=7, splitrule=gini, min.node.size=10
## - Fold4: mtry=7, splitrule=gini, min.node.size=10
## + Fold4: mtry=5, splitrule=gini, min.node.size=15
## - Fold4: mtry=5, splitrule=gini, min.node.size=15
## + Fold4: mtry=6, splitrule=gini, min.node.size=15
## - Fold4: mtry=6, splitrule=gini, min.node.size=15
## + Fold4: mtry=7, splitrule=gini, min.node.size=15
## - Fold4: mtry=7, splitrule=gini, min.node.size=15
## + Fold5: mtry=5, splitrule=gini, min.node.size=10
## - Fold5: mtry=5, splitrule=gini, min.node.size=10
## + Fold5: mtry=6, splitrule=gini, min.node.size=10
## - Fold5: mtry=6, splitrule=gini, min.node.size=10
## + Fold5: mtry=7, splitrule=gini, min.node.size=10
## - Fold5: mtry=7, splitrule=gini, min.node.size=10
## + Fold5: mtry=5, splitrule=gini, min.node.size=15
## - Fold5: mtry=5, splitrule=gini, min.node.size=15
## + Fold5: mtry=6, splitrule=gini, min.node.size=15
## - Fold5: mtry=6, splitrule=gini, min.node.size=15
## + Fold5: mtry=7, splitrule=gini, min.node.size=15
## - Fold5: mtry=7, splitrule=gini, min.node.size=15
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 6, splitrule = gini, min.node.size = 10 on full training set
data_train$rf_f_prediction_class <- predict(rf_model_f,type = "raw")
data_holdout$rf_f_prediction_class <- predict(rf_model_f, newdata = data_holdout, type = "raw")
rf_model_f$results
## mtry splitrule min.node.size Accuracy Kappa AccuracySD KappaSD
## 1 5 gini 10 0.9539417 0.8993422 0.03242173 0.07173141
## 2 5 gini 15 0.9473961 0.8851581 0.03234154 0.07094166
## 3 6 gini 10 0.9539895 0.9000295 0.03322637 0.07210735
## 4 6 gini 15 0.9539656 0.9001988 0.02839942 0.06108399
## 5 7 gini 10 0.9539895 0.9000295 0.03322637 0.07210735
## 6 7 gini 15 0.9539895 0.9005328 0.03322637 0.07130251
best_mtry_f <- rf_model_f$bestTune$mtry
best_min_node_size_f <- rf_model_f$bestTune$min.node.size
cm_object3 <- confusionMatrix(data_holdout$rf_f_prediction_class,as.factor(data_holdout$diagnosis))
cm3 <- cm_object3$table
cm3
## Reference
## Prediction B M
## B 63 5
## M 3 42
# Generate overall summary table
nvars[["rf_p"]] <- length(vars4)
summary_results <- data.frame("Number of predictors" = unlist(nvars),"CV RMSE" = unlist(CV_RMSE))
model_names <- c("Logit 0","Logit 1", "Logit 2","Logit 3","Logit 4",
"Logit LASSO","RF probability")
summary_results <- summary_results %>%
filter(rownames(.) %in% c("logit0","logit1", "logit2","logit3","logit4", "LASSO", "rf_p"))
print(summary_results)
## Number.of.predictors CV.RMSE
## logit0 8 0.2466331
## logit1 10 0.2303568
## logit2 20 0.1906512
## logit3 30 0.2589467
## logit4 40 0.2619661
## LASSO 17 0.1391337
## rf_p 40 0.1727440
rownames(summary_results) <- model_names