dy_jeremiah_finalProject.R

#################################################################################################
# NAME :      Jeremiah Dy
# ASSIGNMENT: Final Project
# INSTRUCTOR: Peter Fuleky
# DATE:       December 5, 2022
#################################################################################################

#################################################################################################
# Research Question:
# To what extent can we predict if a breast cancer mass is benign or malignant based off of size (FNAs)?
#
# Part 1 - Preparatory Analysis
# This is a dataset comprising of breast cancer tumor data from the UC Irvine Machine Learning
# repository. Features are computed from digitizing images of a fine needle aspirate (FNA) of a
# breast mass. These features are descriptors of the cell nuclei displayed in the image.
#
# This dataset is relatively small, comprising of only 569 total entries (357 benign, 212 malignant).
# It has 32 columns: the first two are ID number and classification (benign/malignant), with the
# other 30 being calculated mean, standard error, and largest values of each composite image. These
# values are pre-computed; the original dataset of individual image samples is unavailable. This
# dataset is free-to-use at:
# https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data?resource=download
#
# Because of the dataset's small size, it has relatively low external validity, and is prone to
# overfitting, something that will be increasingly prevalent throughout this report. Additionally,
# the data in the dataset is pre-calculated, so some round-off error can be expected.
# Only the mean feature values of this data set will be considered in model building (columns 3-12).
#
# Summaries and histograms of the ten features are available in this section.
#
# Part 2 - Model Building and Selection
# For the first section of Part 2, I built a LPM to get the average association the diagnosis has
# with the raw feature variables. From the coefficients, it appears that perimeter, area, and
# fractal_dimension are the only features out of the 10 raw variables to induce a lower probability
# of a malignant tumor. However, this interpretation is very naive, so take it with a grain of salt.
#
# For the second section of Part 2, I built 5 different logit models using extrapolation of the
# raw feature variables provided in the dataset. The predictor variables and their coefficients can
# be accessed in the code below. Of these five models, the logit2 model performed the best with a
# mean RMSE of 0.191 across 5-fold cross-validation. To double-check the model performance, I
# conducted a LASSO search algorithm with the full set of extrapolated feature variables. The LASSO
# algorithm then produced a model with 17 predictors (3 less than logit2 model's 20 predictors) with
# a 5-fold CV mean RSME of 0.139. The RSME on the holdout set was 0.165.
#
# Using the LASSO model, I calculated a confusion matrix with a default value of 0.5 and a calculated
# mean probability value of 0.416 as thresholds. These confusion matrices are cm1 and cm2, respectively.
# The confusion matrices were incredibly similar, with only 1 observation difference between the two.
# For the mean probability threshold, the LASSO model had an accuracy of 0.9819, a sensitivity of
# 0.9574, and a specificity of 0.9696. Fun fact: while coding on autopilot I made code for a calibration
# curve. On binary dependent data. I left the code in just as a reminder for myself if I ever look back
# at this file.
#
# After those 6 models, I built two random forest models (one for probability predictions and one
# for classification). I again fed all extrapolated variables to the random forest algorithm (same
# list as the LASSO model). The 5-fold CV RMSE of the probability random forest was 0.173. The RSME
# of the holdout set was 0.211. The overall best mtry (randomly selected features to split on) was 7,
# with best minimum node size of 10. A confusion matrix was calculated for the classification random
# forest as well, with an accuracy of 0.9292, a sensitivity of 0.8936, and a specificity of 0.9545.
# The overall best mtry for the classification tree was 6, and best minimum node size 10 as well.
#
# Part 3 - Conclusions & Write-Up
# Overall, the LASSO model and the random forest models, unsurprisingly, seem to be the best with
# predicting probabilities and classifying observations (followed by model logit2). Both the LASSO
# and the random forest model have high accuracy, sensitivity, and specificity. While this is
# theoretically great, it is important to note that this dataset is remarkably small (not even 1000
# observations), which puts a limit on its external validity. Interestingly, when looking at the
# LASSO coefficients, I noticed that the algorithm placed heavy weight on the radius_worst feature,
# implying that the larger the cancer mass is the more likely it is to metastasize and become
# malignant. Given my rudimentary medical domain knowledge, I have to agree. Furthermore, something
# I noticed in both the rudimentary LPM and the LASSO model was the emphasis placed on the concave.points
# feature. Perhaps this has something meaning in the oncology field? For future study, I
# would try coding in some ROC curves and maybe try a GBM. During the course of this project, I tried
# coding both in, but I had no success in either compiling. Utilizing ROC curves would enable me
# to optimize my threshold values, rather than picking naive values such as the default 0.5 for
# binary data, or the calculated mean of probabilities among the predicted data. Something else
# I'd love to try (but would be impractical to do so) is to try permutations of all of my variables
# (including extrapolated variables) to see if any combination of these would produce a better fit
# logit model. The LASSO model optimized itself to 17 predictors, and the logit2 model with 20
# predictors had the best performance out of the "handmade" logit models. Perhaps an upper limit 
# of 20 predictor variables and testing out different models built on permutations of variables 
# could have a better fit model. I realize, of course, that this is basically a NP-hard problem,
# in that my laptop would probably overheat and explode before it even came close to finishing the
# necessary permutations, model building, and cross-validation required for this problem. Still,
# food for thought. As for concluding thoughts, working this entire project and in-depth analysis
# really helped reinforce my learning and usage of R, as well as the theoretical data analysis
# concepts learned in the course. I even used a naive LPM for one of my other classes (ICS 484);
# of course, it isn't super robust because that's not the purpose of the project, but I think it's
# neat to tie in other subjects like this. Overall, I think this was an excellent exercise in data
# analysis and prepared me well for future projects in data science.
#################################################################################################
#### SET UP
# It is advised to start a new session for every case study
# CLEAR MEMORY
rm(list=ls())

library(haven)
library(modelsummary)
library(fixest)
library(haven)
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-4

library(purrr)
library(margins)
library(skimr)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:modelsummary':
## 
##     Mean

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(cowplot)
library(gmodels) 
library(lspline)
library(sandwich)
library(modelsummary)
library(rattle)

## Loading required package: tibble

## Loading required package: bitops

## 
## Attaching package: 'bitops'

## The following object is masked from 'package:Matrix':
## 
##     %&%

## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

## The following object is masked from 'package:purrr':
## 
##     lift

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:gmodels':
## 
##     ci

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(ranger)

## 
## Attaching package: 'ranger'

## The following object is masked from 'package:rattle':
## 
##     importance

library(rpart)
library(partykit)

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

library(rpart.plot)

source("ch00-tech-prep/theme_bg.R")

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ✔ dplyr   1.0.9     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand()    masks Matrix::expand()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ caret::lift()      masks purrr::lift()
## ✖ tidyr::pack()      masks Matrix::pack()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ✖ tidyr::unpack()    masks Matrix::unpack()
## 
## Attaching package: 'scales'
## 
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## 
## The following object is masked from 'package:fixest':
## 
##     pvalue

source("ch00-tech-prep/da_helper_functions.R")

## 
## Please cite as: 
## 
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

# Local path
data_raw <- read.csv("/Users/jeremiah/Desktop/UHM/C-Junior 2022-2023/ECON 427/data.csv")

############################################ Part 1 #############################################

### Data prep ###
# Expand raw variables for future model building
data_raw <- data_raw %>% mutate (log_radius_mean = log(radius_mean),
                         log_texture_mean = log(texture_mean),
                         log_perimeter_mean = log(perimeter_mean),
                         log_area_mean = log(area_mean),
                         log_smoothness_mean = log(smoothness_mean),
                         log_compactness_mean = log(compactness_mean),
                         log_concavity_mean = log(concavity_mean),
                         log_concave.points_mean = log(concave.points_mean),
                         log_symmetry_mean = log(symmetry_mean),
                         log_fractal_dimension_mean = log(fractal_dimension_mean))

data_raw <- data_raw %>% mutate (sq_radius_mean = (radius_mean)^2,
                         sq_texture_mean = (texture_mean)^2,
                         sq_perimeter_mean = (perimeter_mean)^2,
                         sq_area_mean = (area_mean)^2,
                         sq_smoothness_mean = (smoothness_mean)^2,
                         sq_compactness_mean = (compactness_mean),
                         sq_concavity_mean = (concavity_mean)^2,
                         sq_concave.points_mean = (concave.points_mean)^2,
                         sq_symmetry_mean = (symmetry_mean)^2,
                         sq_fractal_dimension_mean = (fractal_dimension_mean)^2)

data_raw <- data_raw %>% mutate (sqrt_radius_mean = sqrt(radius_mean),
                         sqrt_texture_mean = sqrt(texture_mean),
                         sqrt_perimeter_mean = sqrt(perimeter_mean),
                         sqrt_area_mean = sqrt(area_mean),
                         sqrt_smoothness_mean = sqrt(smoothness_mean),
                         sqrt_compactness_mean = sqrt(compactness_mean),
                         sqrt_concavity_mean = sqrt(concavity_mean),
                         sqrt_concave.points_mean = sqrt(concave.points_mean),
                         sqrt_symmetry_mean = sqrt(symmetry_mean),
                         sqrt_fractal_dimension_mean = sqrt(fractal_dimension_mean))

# Drop last column (contains no useful data; artifact from converting from csv file)
# Drop SE and largest sample values of composite image columns; they will not be considered for
# model building
keep_cols <- c("id","diagnosis","radius_mean","texture_mean","perimeter_mean", "area_mean",
               "smoothness_mean","compactness_mean","concavity_mean","concave.points_mean",
               "symmetry_mean","fractal_dimension_mean", "radius_worst","texture_worst","perimeter_worst", "area_worst",
               "smoothness_worst","compactness_worst","concavity_worst","concave.points_worst",
               "symmetry_worst","fractal_dimension_worst")
data <- select(data_raw, keep_cols)

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_cols)
## 
##   # Now:
##   data %>% select(all_of(keep_cols))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

# Convert binary character variable "diagnosis" into numeric binary variable for LPM
# Malignant: 'M' -> 1
# Benign   : 'B' -> 0
lookup <- c('B' = 0, 'M' = 1)
data$diagnosis <- lookup[data_raw$diagnosis]

# Sanity check: check number of MALIGNANT rows (212 rows)
count(filter(data, diagnosis == 1))

##     n
## 1 212

# Sanity check: check number of BENIGN rows (357 rows)
count(filter(data, diagnosis == 0))

##     n
## 1 357

### Raw variable summary ###

# Radius
summary(data$radius_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.981  11.700  13.370  14.127  15.780  28.110

radius_hist <- ggplot(data=data, aes(x=radius_mean)) +
  geom_histogram_da(type="percent", bins = 20) +
  labs(x = "radius_mean",y = "Percent")+
  theme_bg() 
radius_hist

# Texture
summary(data$texture_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.71   16.17   18.84   19.29   21.80   39.28

texture_hist <- ggplot(data=data, aes(x=texture_mean)) +
  geom_histogram_da(type="percent", bins = 20) +
  labs(x = "texture_mean",y = "Percent")+
  theme_bg() 
texture_hist

# Perimeter
summary(data$perimeter_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43.79   75.17   86.24   91.97  104.10  188.50

perimeter_hist <- ggplot(data=data, aes(x=perimeter_mean)) +
  geom_histogram_da(type="percent", bins = 20) +
  labs(x = "perimeter_mean",y = "Percent")+
  theme_bg() 
perimeter_hist

# Area
summary(data$area_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   143.5   420.3   551.1   654.9   782.7  2501.0

area_hist <- ggplot(data=data, aes(x=area_mean)) +
  geom_histogram_da(type="percent", bins = 20) +
  labs(x = "area_mean",y = "Percent")+
  theme_bg() 
area_hist

# Smoothness
summary(data$smoothness_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05263 0.08637 0.09587 0.09636 0.10530 0.16340

smoothness_hist <- ggplot(data=data, aes(x=smoothness_mean)) +
  geom_histogram_da(type="percent", bins = 20) +
  labs(x = "smoothness_mean",y = "Percent")+
  theme_bg() 
smoothness_hist

# Compactness
summary(data$compactness_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01938 0.06492 0.09263 0.10434 0.13040 0.34540

compactness_hist <- ggplot(data=data, aes(x=compactness_mean)) +
  geom_histogram_da(type="percent", bins = 20) +
  labs(x = "compactness_mean",y = "Percent")+
  theme_bg() 
compactness_hist

# Concavity
summary(data$concavity_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.02956 0.06154 0.08880 0.13070 0.42680

concavity_hist <- ggplot(data=data, aes(x=concavity_mean)) +
  geom_histogram_da(type="percent", bins = 20) +
  labs(x = "concavity_mean",y = "Percent")+
  theme_bg() 
concavity_hist

# Concave.points
summary(data$concave.points_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.02031 0.03350 0.04892 0.07400 0.20120

concave.points_hist <- ggplot(data=data, aes(x=concave.points_mean)) +
  geom_histogram_da(type="percent", bins = 20) +
  labs(x = "concave.points_mean",y = "Percent")+
  theme_bg() 
concave.points_hist

# Symmetry
summary(data$symmetry_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1060  0.1619  0.1792  0.1812  0.1957  0.3040

symmetry_hist <- ggplot(data=data, aes(x=symmetry_mean)) +
  geom_histogram_da(type="percent", bins = 20) +
  labs(x = "symmetry_mean",y = "Percent")+
  theme_bg() 
symmetry_hist

# Fractal Dimension
summary(data$fractal_dimension_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04996 0.05770 0.06154 0.06280 0.06612 0.09744

fractal_dimension_hist <- ggplot(data=data, aes(x=fractal_dimension_mean)) +
  geom_histogram_da(type="percent", bins = 20) +
  labs(x = "fractal_dimension_mean",y = "Percent")+
  theme_bg() 
fractal_dimension_hist

############################################ Part 2 #############################################

# Build LPM to get average association between diagnosis and raw variables
lpm <- feols( diagnosis ~ radius_mean + texture_mean + perimeter_mean + area_mean + smoothness_mean + 
               compactness_mean + concavity_mean + concave.points_mean + symmetry_mean + fractal_dimension_mean,
               data = data, vcov = 'hetero' )
etable(lpm, digits="r3")

##                                      lpm
## Dependent Var.:                diagnosis
##                                         
## Constant               -2.052*** (0.410)
## radius_mean             0.490*** (0.129)
## texture_mean            0.022*** (0.003)
## perimeter_mean          -0.055** (0.020)
## area_mean              -0.001*** (0.000)
## smoothness_mean            1.941 (1.277)
## compactness_mean           0.097 (1.127)
## concavity_mean             0.810 (0.634)
## concave.points_mean     6.431*** (1.426)
## symmetry_mean             1.012. (0.529)
## fractal_dimension_mean    -0.119 (4.128)
## ______________________ _________________
## S.E. type              Heteroskeda.-rob.
## Observations                         569
## R2                               0.68276
## Adj. R2                          0.67708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Create list of variables to build models on
X1 <- c("radius_mean","texture_mean","perimeter_mean", "area_mean",
        "smoothness_mean","compactness_mean","concavity_mean","concave.points_mean",
        "symmetry_mean","fractal_dimension_mean")
X2 <- c("radius_worst","texture_worst","perimeter_worst", "area_worst",
        "smoothness_worst","compactness_worst","concavity_worst","concave.points_worst",
        "symmetry_worst","fractal_dimension_worst")
X3 <- c("sq_radius_mean","sq_texture_mean","sq_perimeter_mean", "sq_area_mean",
        "sq_smoothness_mean","sq_compactness_mean","sq_concavity_mean","sq_concave.points_mean",
        "sq_symmetry_mean","sq_fractal_dimension_mean")
X4 <- c("sqrt_radius_mean","sqrt_texture_mean","sqrt_perimeter_mean", "sqrt_area_mean",
        "sqrt_smoothness_mean","sqrt_compactness_mean","sqrt_concavity_mean","sqrt_concave.points_mean",
        "sqrt_symmetry_mean","sqrt_fractal_dimension_mean")

vars0.8 <- c("radius_mean","perimeter_mean",
             "smoothness_mean","compactness_mean","concavity_mean","concave.points_mean",
             "symmetry_mean","fractal_dimension_mean")
vars1 <- c(X1)
vars2 <- c(X1, X2)
vars3 <- c(X1, X2, X3)
vars4 <- c(X1, X2, X3, X4)

# Train-test split
set.seed(13505)

train_indices <- as.integer(createDataPartition(data$diagnosis, p = 0.8, list = FALSE))
data_train <- data_raw[train_indices, ]
data_holdout <- data_raw[-train_indices, ]


###  Build logit models ###

train_control <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummaryExtended,
  savePredictions = TRUE
)

logit_model_vars <- list("logit0" = vars0.8, "logit1" = vars1, "logit2" = vars2, "logit3" = vars3, "logit4" = vars4)

CV_RMSE_folds <- list()
logit_models <- list()

for (model_name in names(logit_model_vars)) {
  features <- logit_model_vars[[model_name]]
  
  set.seed(13505)
  glm_model <- train(
    formula(paste0("diagnosis ~", paste0(features, collapse = " + "))),
    method = "glm",
    data = data_train,
    family = binomial,
    trControl = train_control
  )
  
  logit_models[[model_name]] <- glm_model
  # Calculate RMSE on test for each fold
  CV_RMSE_folds[[model_name]] <- glm_model$resample[,c("Resample", "RMSE")]
}

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# LASSO Logit
lambda <- 10^seq(-1, -4, length = 10)
grid <- expand.grid("alpha" = 1, lambda = lambda)

set.seed(13505)
system.time({
  logit_lasso_model <- train(
    formula(paste0("diagnosis ~", paste0(vars4, collapse = " + "))),
    data = data_train,
    method = "glmnet",
    preProcess = c("center", "scale"),
    family = "binomial",
    trControl = train_control,
    tuneGrid = grid,
    na.action=na.exclude
  )
})

##    user  system elapsed 
##   1.033   0.062   1.178

tuned_logit_lasso_model <- logit_lasso_model$finalModel
best_lambda <- logit_lasso_model$bestTune$lambda
logit_models[["LASSO"]] <- logit_lasso_model
lasso_coeffs <- as.matrix(coef(tuned_logit_lasso_model, best_lambda))
CV_RMSE_folds[["LASSO"]] <- logit_lasso_model$resample[,c("Resample", "RMSE")]

# Calculate means of both metrics (RMSE and AUC)
CV_RMSE <- list()

for (model_name in names(logit_models)) {
  CV_RMSE[[model_name]] <- mean(CV_RMSE_folds[[model_name]]$RMSE)
}

# We have 6 models, (5 logit and the logit lasso). For each we have a 5-CV RMSE.
# We pick our preferred model based on that. -----------------------------------------------
data_holdout_numeric <- data_holdout
lookup <- c('B' = 0, 'M' = 1)
data_holdout_numeric$diagnosis <- lookup[data_holdout$diagnosis]
nvars <- lapply(logit_models, FUN = function(x) length(x$coefnames))
nvars[["LASSO"]] <- sum(lasso_coeffs != 0)

logit_summary1 <- data.frame("Number of predictors" = unlist(nvars),
                             "CV RMSE" = unlist(CV_RMSE))

print(logit_summary1)

##        Number.of.predictors   CV.RMSE
## logit0                    8 0.2466331
## logit1                   10 0.2303568
## logit2                   20 0.1906512
## logit3                   30 0.2589467
## logit4                   40 0.2619661
## LASSO                    17 0.1391337

best_logit_no_loss <- logit_models[["LASSO"]]
logit_predicted_probabilities_holdout <- predict(best_logit_no_loss, newdata = data_holdout_numeric, type = "prob")
data_holdout_numeric[,"best_logit_no_loss_pred"] <- logit_predicted_probabilities_holdout[,'M']
RMSE(data_holdout_numeric[, "best_logit_no_loss_pred", drop=TRUE], data_holdout_numeric$diagnosis)

## [1] 0.1651481

# Confusion Matrix for best model (LASSO) on holdout
logit_predicted_class_holdout <- predict(best_logit_no_loss, newdata = data_holdout, type = "raw")
cm_object1 <- confusionMatrix(logit_predicted_class_holdout, as.factor(data_holdout$diagnosis), positive = "M")
cm1 <- cm_object1$table
cm1

##           Reference
## Prediction  B  M
##          B 65  2
##          M  1 45

# Confusion Matrix for best model (LASSO) on holdout with mean probability as threshold
mean_predicted_malignant_prob <- mean(data_holdout_numeric$best_logit_no_loss_pred)
mean_predicted_malignant_prob

## [1] 0.4155854

holdout_prediction <-
  ifelse(data_holdout_numeric$best_logit_no_loss_pred < mean_predicted_malignant_prob, "B", "M") %>%
  factor(levels = c("B", "M"))
cm_object2 <- confusionMatrix(holdout_prediction,as.factor(data_holdout$diagnosis))
cm2 <- cm_object2$table
cm2

##           Reference
## Prediction  B  M
##          B 64  2
##          M  2 45

# Probably overfitting? Calibration Curve is a little wonky
output <- "/Users/jeremiah/Desktop/UHM/C-Junior 2022-2023/ECON 427/"
create_calibration_plot(data_holdout_numeric,
                        file_name = "427-calibration-curve-logit-no-loss",
                        prob_var = "best_logit_no_loss_pred", 
                        actual_var = "diagnosis",
                        n_bins = 10)

## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 1 row(s) containing missing values (geom_path).
## Removed 1 row(s) containing missing values (geom_path).

### Random Forests ###

# Probability Random Forest
train_control <- trainControl(
  method = "cv",
  n = 5,
  classProbs = TRUE, # same as probability = TRUE in ranger
  summaryFunction = twoClassSummaryExtended,
  savePredictions = TRUE
)
train_control$verboseIter <- TRUE

tune_grid <- expand.grid(
  .mtry = c(5, 6, 7),
  .splitrule = "gini",
  .min.node.size = c(10, 15)
)

# getModelInfo("ranger")
set.seed(13505)
rf_model_p <- train(
  formula(paste0("diagnosis ~ ", paste0(vars4, collapse = " + "))),
  method = "ranger",
  data = data_train,
  tuneGrid = tune_grid,
  trControl = train_control
)

## + Fold1: mtry=5, splitrule=gini, min.node.size=10 
## - Fold1: mtry=5, splitrule=gini, min.node.size=10 
## + Fold1: mtry=6, splitrule=gini, min.node.size=10 
## - Fold1: mtry=6, splitrule=gini, min.node.size=10 
## + Fold1: mtry=7, splitrule=gini, min.node.size=10 
## - Fold1: mtry=7, splitrule=gini, min.node.size=10 
## + Fold1: mtry=5, splitrule=gini, min.node.size=15 
## - Fold1: mtry=5, splitrule=gini, min.node.size=15 
## + Fold1: mtry=6, splitrule=gini, min.node.size=15 
## - Fold1: mtry=6, splitrule=gini, min.node.size=15 
## + Fold1: mtry=7, splitrule=gini, min.node.size=15 
## - Fold1: mtry=7, splitrule=gini, min.node.size=15 
## + Fold2: mtry=5, splitrule=gini, min.node.size=10 
## - Fold2: mtry=5, splitrule=gini, min.node.size=10 
## + Fold2: mtry=6, splitrule=gini, min.node.size=10 
## - Fold2: mtry=6, splitrule=gini, min.node.size=10 
## + Fold2: mtry=7, splitrule=gini, min.node.size=10 
## - Fold2: mtry=7, splitrule=gini, min.node.size=10 
## + Fold2: mtry=5, splitrule=gini, min.node.size=15 
## - Fold2: mtry=5, splitrule=gini, min.node.size=15 
## + Fold2: mtry=6, splitrule=gini, min.node.size=15 
## - Fold2: mtry=6, splitrule=gini, min.node.size=15 
## + Fold2: mtry=7, splitrule=gini, min.node.size=15 
## - Fold2: mtry=7, splitrule=gini, min.node.size=15 
## + Fold3: mtry=5, splitrule=gini, min.node.size=10 
## - Fold3: mtry=5, splitrule=gini, min.node.size=10 
## + Fold3: mtry=6, splitrule=gini, min.node.size=10 
## - Fold3: mtry=6, splitrule=gini, min.node.size=10 
## + Fold3: mtry=7, splitrule=gini, min.node.size=10 
## - Fold3: mtry=7, splitrule=gini, min.node.size=10 
## + Fold3: mtry=5, splitrule=gini, min.node.size=15 
## - Fold3: mtry=5, splitrule=gini, min.node.size=15 
## + Fold3: mtry=6, splitrule=gini, min.node.size=15 
## - Fold3: mtry=6, splitrule=gini, min.node.size=15 
## + Fold3: mtry=7, splitrule=gini, min.node.size=15 
## - Fold3: mtry=7, splitrule=gini, min.node.size=15 
## + Fold4: mtry=5, splitrule=gini, min.node.size=10 
## - Fold4: mtry=5, splitrule=gini, min.node.size=10 
## + Fold4: mtry=6, splitrule=gini, min.node.size=10 
## - Fold4: mtry=6, splitrule=gini, min.node.size=10 
## + Fold4: mtry=7, splitrule=gini, min.node.size=10 
## - Fold4: mtry=7, splitrule=gini, min.node.size=10 
## + Fold4: mtry=5, splitrule=gini, min.node.size=15 
## - Fold4: mtry=5, splitrule=gini, min.node.size=15 
## + Fold4: mtry=6, splitrule=gini, min.node.size=15 
## - Fold4: mtry=6, splitrule=gini, min.node.size=15 
## + Fold4: mtry=7, splitrule=gini, min.node.size=15 
## - Fold4: mtry=7, splitrule=gini, min.node.size=15 
## + Fold5: mtry=5, splitrule=gini, min.node.size=10 
## - Fold5: mtry=5, splitrule=gini, min.node.size=10 
## + Fold5: mtry=6, splitrule=gini, min.node.size=10 
## - Fold5: mtry=6, splitrule=gini, min.node.size=10 
## + Fold5: mtry=7, splitrule=gini, min.node.size=10 
## - Fold5: mtry=7, splitrule=gini, min.node.size=10 
## + Fold5: mtry=5, splitrule=gini, min.node.size=15 
## - Fold5: mtry=5, splitrule=gini, min.node.size=15 
## + Fold5: mtry=6, splitrule=gini, min.node.size=15 
## - Fold5: mtry=6, splitrule=gini, min.node.size=15 
## + Fold5: mtry=7, splitrule=gini, min.node.size=15 
## - Fold5: mtry=7, splitrule=gini, min.node.size=15 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 7, splitrule = gini, min.node.size = 10 on full training set

rf_model_p$results

##   mtry splitrule min.node.size  Accuracy     Kappa      RMSE AccuracySD
## 1    5      gini            10 0.9539417 0.8993422 0.1754368 0.03242173
## 2    5      gini            15 0.9517678 0.8945893 0.1788756 0.03256956
## 3    6      gini            10 0.9517917 0.8950499 0.1743855 0.03250347
## 4    6      gini            15 0.9517917 0.8957472 0.1780799 0.02958612
## 5    7      gini            10 0.9539895 0.9000295 0.1727440 0.03322637
## 6    7      gini            15 0.9495939 0.8905176 0.1776455 0.03603859
##      KappaSD     RMSESD
## 1 0.07173141 0.04928049
## 2 0.07180542 0.04821365
## 3 0.07060087 0.04968911
## 4 0.06331282 0.04807245
## 5 0.07210735 0.05031556
## 6 0.07798343 0.05263144

best_mtry_p <- rf_model_p$bestTune$mtry
best_min_node_size_p <- rf_model_p$bestTune$min.node.size
CV_RMSE_folds[["rf_p"]] <- rf_model_p$resample[,c("Resample", "RMSE")]
CV_RMSE[["rf_p"]] <- mean(CV_RMSE_folds[["rf_p"]]$RMSE)

# Calculate RMSE of Random Forest probability model on holdout set
rf_predicted_probabilities_holdout <- predict(rf_model_p, newdata = data_holdout_numeric, type = "prob")
data_holdout$rf_p_prediction <- rf_predicted_probabilities_holdout[,"M"]
RMSE(data_holdout$rf_p_prediction, data_holdout_numeric$diagnosis)

## [1] 0.2114045

# Classification Random Forest
train_control <- trainControl(
  method = "cv",
  n = 5
)
train_control$verboseIter <- TRUE

set.seed(13505)
rf_model_f <- train(
  formula(paste0("diagnosis ~ ", paste0(vars4, collapse = " + "))),
  method = "ranger",
  data = data_train,
  tuneGrid = tune_grid,
  trControl = train_control
)

## + Fold1: mtry=5, splitrule=gini, min.node.size=10 
## - Fold1: mtry=5, splitrule=gini, min.node.size=10 
## + Fold1: mtry=6, splitrule=gini, min.node.size=10 
## - Fold1: mtry=6, splitrule=gini, min.node.size=10 
## + Fold1: mtry=7, splitrule=gini, min.node.size=10 
## - Fold1: mtry=7, splitrule=gini, min.node.size=10 
## + Fold1: mtry=5, splitrule=gini, min.node.size=15 
## - Fold1: mtry=5, splitrule=gini, min.node.size=15 
## + Fold1: mtry=6, splitrule=gini, min.node.size=15 
## - Fold1: mtry=6, splitrule=gini, min.node.size=15 
## + Fold1: mtry=7, splitrule=gini, min.node.size=15 
## - Fold1: mtry=7, splitrule=gini, min.node.size=15 
## + Fold2: mtry=5, splitrule=gini, min.node.size=10 
## - Fold2: mtry=5, splitrule=gini, min.node.size=10 
## + Fold2: mtry=6, splitrule=gini, min.node.size=10 
## - Fold2: mtry=6, splitrule=gini, min.node.size=10 
## + Fold2: mtry=7, splitrule=gini, min.node.size=10 
## - Fold2: mtry=7, splitrule=gini, min.node.size=10 
## + Fold2: mtry=5, splitrule=gini, min.node.size=15 
## - Fold2: mtry=5, splitrule=gini, min.node.size=15 
## + Fold2: mtry=6, splitrule=gini, min.node.size=15 
## - Fold2: mtry=6, splitrule=gini, min.node.size=15 
## + Fold2: mtry=7, splitrule=gini, min.node.size=15 
## - Fold2: mtry=7, splitrule=gini, min.node.size=15 
## + Fold3: mtry=5, splitrule=gini, min.node.size=10 
## - Fold3: mtry=5, splitrule=gini, min.node.size=10 
## + Fold3: mtry=6, splitrule=gini, min.node.size=10 
## - Fold3: mtry=6, splitrule=gini, min.node.size=10 
## + Fold3: mtry=7, splitrule=gini, min.node.size=10 
## - Fold3: mtry=7, splitrule=gini, min.node.size=10 
## + Fold3: mtry=5, splitrule=gini, min.node.size=15 
## - Fold3: mtry=5, splitrule=gini, min.node.size=15 
## + Fold3: mtry=6, splitrule=gini, min.node.size=15 
## - Fold3: mtry=6, splitrule=gini, min.node.size=15 
## + Fold3: mtry=7, splitrule=gini, min.node.size=15 
## - Fold3: mtry=7, splitrule=gini, min.node.size=15 
## + Fold4: mtry=5, splitrule=gini, min.node.size=10 
## - Fold4: mtry=5, splitrule=gini, min.node.size=10 
## + Fold4: mtry=6, splitrule=gini, min.node.size=10 
## - Fold4: mtry=6, splitrule=gini, min.node.size=10 
## + Fold4: mtry=7, splitrule=gini, min.node.size=10 
## - Fold4: mtry=7, splitrule=gini, min.node.size=10 
## + Fold4: mtry=5, splitrule=gini, min.node.size=15 
## - Fold4: mtry=5, splitrule=gini, min.node.size=15 
## + Fold4: mtry=6, splitrule=gini, min.node.size=15 
## - Fold4: mtry=6, splitrule=gini, min.node.size=15 
## + Fold4: mtry=7, splitrule=gini, min.node.size=15 
## - Fold4: mtry=7, splitrule=gini, min.node.size=15 
## + Fold5: mtry=5, splitrule=gini, min.node.size=10 
## - Fold5: mtry=5, splitrule=gini, min.node.size=10 
## + Fold5: mtry=6, splitrule=gini, min.node.size=10 
## - Fold5: mtry=6, splitrule=gini, min.node.size=10 
## + Fold5: mtry=7, splitrule=gini, min.node.size=10 
## - Fold5: mtry=7, splitrule=gini, min.node.size=10 
## + Fold5: mtry=5, splitrule=gini, min.node.size=15 
## - Fold5: mtry=5, splitrule=gini, min.node.size=15 
## + Fold5: mtry=6, splitrule=gini, min.node.size=15 
## - Fold5: mtry=6, splitrule=gini, min.node.size=15 
## + Fold5: mtry=7, splitrule=gini, min.node.size=15 
## - Fold5: mtry=7, splitrule=gini, min.node.size=15 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 6, splitrule = gini, min.node.size = 10 on full training set

data_train$rf_f_prediction_class <-  predict(rf_model_f,type = "raw")
data_holdout$rf_f_prediction_class <- predict(rf_model_f, newdata = data_holdout, type = "raw")
rf_model_f$results

##   mtry splitrule min.node.size  Accuracy     Kappa AccuracySD    KappaSD
## 1    5      gini            10 0.9539417 0.8993422 0.03242173 0.07173141
## 2    5      gini            15 0.9473961 0.8851581 0.03234154 0.07094166
## 3    6      gini            10 0.9539895 0.9000295 0.03322637 0.07210735
## 4    6      gini            15 0.9539656 0.9001988 0.02839942 0.06108399
## 5    7      gini            10 0.9539895 0.9000295 0.03322637 0.07210735
## 6    7      gini            15 0.9539895 0.9005328 0.03322637 0.07130251

best_mtry_f <- rf_model_f$bestTune$mtry
best_min_node_size_f <- rf_model_f$bestTune$min.node.size

cm_object3 <- confusionMatrix(data_holdout$rf_f_prediction_class,as.factor(data_holdout$diagnosis))
cm3 <- cm_object3$table
cm3

##           Reference
## Prediction  B  M
##          B 63  5
##          M  3 42

# Generate overall summary table
nvars[["rf_p"]] <- length(vars4)

summary_results <- data.frame("Number of predictors" = unlist(nvars),"CV RMSE" = unlist(CV_RMSE))

model_names <- c("Logit 0","Logit 1", "Logit 2","Logit 3","Logit 4",
                 "Logit LASSO","RF probability")
summary_results <- summary_results %>%
  filter(rownames(.) %in% c("logit0","logit1", "logit2","logit3","logit4", "LASSO", "rf_p"))
print(summary_results)

##        Number.of.predictors   CV.RMSE
## logit0                    8 0.2466331
## logit1                   10 0.2303568
## logit2                   20 0.1906512
## logit3                   30 0.2589467
## logit4                   40 0.2619661
## LASSO                    17 0.1391337
## rf_p                     40 0.1727440

rownames(summary_results) <- model_names

dy_jeremiah_finalProject.R

jeremiah

2022-12-07