• 1 Introduction
    • 1.1 Identify Prediction Task
    • 1.2 Prepare Data
    • 1.3 Construct Formula
    • 1.4 Select Model
    • 1.5 Build Pipeline
    • 1.6 Now…Creating Effective Models
  • 2 Resampling
    • 2.1 Overview
    • 2.2 Cross Validation
    • 2.3 Variations
      • 2.3.1 Repeated CV
      • 2.3.2 Bootstrapping
      • 2.3.3 Rolling Origin
    • 2.4 Summary
  • 3 Performance
    • 3.0.1 Parallelization
  • 4 Models
    • 4.1 Note on Random Forest Explainability
    • 4.2 Gradient Boosting
    • 4.3 Neural Networks
  • 5 Ch 11 Comparing Models
    • 5.1 11.2 Comparing Resample Performance Stats

1 Introduction

Now that we have begun to see different models and concepts, hopefully all the pieces are starting to come together. Prediction problems are hard and there’s no single way which works for every solution but generally here steps which will be applicable for most of our work.

1.1 Identify Prediction Task

Since we are only focused on supervised learning here, to identify the task we need adequate labeled data and an idea of if our outcome is a class or a continuous variable.

1.2 Prepare Data

This task varies depending on the requirements of our model. Some models may only be able to ingest certain types of data (one hot encoded matrices for example). Other variables may need to be rescaled or imputed. We can codify all of our preprocessing steps into a recipe.

1.3 Construct Formula

There are a variety of tests to determine which variables to include in the formula. This step is very arbitrary and generally EDA is the answer.

1.4 Select Model

There are hundreds of model types. In this class, we are going to use boosted trees, decision trees, linear regression, logistic regression, nearest neighbors, and random forests from parnsip.

1.5 Build Pipeline

Your pipeline is where you bring the steps above into a cohesive object and is the foundation of our process. A pipeline consists of a workflow, recipe, and model. In typical ML applications, hundreds or thousands of models will be built on this foundation in order to find the best model specification.

1.6 Now…Creating Effective Models

Ch 10-15 of tidymodels offer methods to create numerous models and select the best for predictive performance. In combination, these methods will artificially create hundreds of different datasets by resampling, and try hundreds of different hyperparameter specifications. We will then select the ‘best’ model based on an additional objective functions (such as accuracy or RMSE).

2 Resampling

Section 10.1 in tidymodels give an example of a linear model and a random forest model which perform differently on training and testing to frame why resampling is important.

RMSE Estimates
object train test
lm_fit 0.0754 0.0736
rf_fit 0.0365 0.0704

In this scenario, the random forest model performed much better on training than testing since it overfit the training data. If we were going to pick which model to use based only on training RMSE performance, this could have led to some issues!

2.1 Overview

Resampling involves splitting, slicing, and bootstrapping the training set. In tidymodels, this could involve creating ten sets of analysis and assessment subsamples from the training data. Then model performance would then be generated by averaging the performance across these samples.

Recall our example from week 6,

data(cells, package = "modeldata")

cell_split <- initial_split(cells %>% select(-case), 
                            strata = class)
cell_train <- training(cell_split)
folds <- rsample::vfold_cv(cell_train, v = 10)

cell_test  <- testing(cell_split)


rf_mod <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

rf_wf <- 
  workflow() %>%
  add_model(rf_mod) %>%
  add_formula(class ~ .)

2.2 Cross Validation

V=3, 3-fold cross-validation

1/3rd of the folds are held out for assessment.

Our folds, using 5-fold cross validation:

folds <- rsample::vfold_cv(cell_train, v = 5)
folds
ABCDEFGHIJ0123456789
splits
<list>
id
<chr>
<S3: vfold_split>Fold1
<S3: vfold_split>Fold2
<S3: vfold_split>Fold3
<S3: vfold_split>Fold4
<S3: vfold_split>Fold5

Fold 1:

folds[[1]][[1]]
## <Analysis/Assess/Total>
## <1211/303/1514>

We then fit the model on each resample.

rf_fit_rs <- 
  rf_wf %>% 
  fit_resamples(folds, control=control_resamples(save_pred=TRUE))

You can see the model performance on each fold…

collect_metrics(rf_fit_rs, summarize=FALSE)
ABCDEFGHIJ0123456789
id
<chr>
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
.config
<chr>
Fold1accuracybinary0.8547855Preprocessor1_Model1
Fold1roc_aucbinary0.9241195Preprocessor1_Model1
Fold2accuracybinary0.8184818Preprocessor1_Model1
Fold2roc_aucbinary0.8877670Preprocessor1_Model1
Fold3accuracybinary0.8448845Preprocessor1_Model1
Fold3roc_aucbinary0.9141798Preprocessor1_Model1
Fold4accuracybinary0.8382838Preprocessor1_Model1
Fold4roc_aucbinary0.9195122Preprocessor1_Model1
Fold5accuracybinary0.8079470Preprocessor1_Model1
Fold5roc_aucbinary0.9031015Preprocessor1_Model1

Or aggregated.

collect_metrics(rf_fit_rs, summarize=TRUE)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
mean
<dbl>
n
<int>
std_err
<dbl>
.config
<chr>
accuracybinary0.832876550.008608482Preprocessor1_Model1
roc_aucbinary0.909736050.006514820Preprocessor1_Model1

We can also aggregate the folds to get predictions from our training data.

rf_testing_pred <- collect_predictions(rf_fit_rs, summarize = TRUE)

rf_testing_pred %>%                   
  accuracy(class, .pred_class)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
accuracybinary0.832893
rf_testing_pred %>%                   
  roc_curve(class, .pred_PS) %>%
  autoplot()

2.3 Variations

2.3.1 Repeated CV

In the thread of using resampling to get more data on our model performance, a key variation is repeated cross-validation.

vfold_cv(data, v=10, repeats=5)

2.3.2 Bootstrapping

Bootstrapping creates samples drawn with replacement.

bootstraps(data, times=5)

This code would create 50 folds.

2.3.3 Rolling Origin

2.4 Summary

During resampling, the analysis set is used to preprocess the data, apply the preprocessing to itself, and use these processed data to fit the model.

The preprocessing statistics produced by the analysis set are applied to the assessment set. The predictions from the assessment set estimate performance.

3 Performance

control_options <- control_resamples(
  save_pred= TRUE,
  save_workflow = TRUE,
  verbose = TRUE
)

rf_fit_rs <- 
  rf_wf %>% 
  fit_resamples(folds, control=control_options)

rf_preds <- rf_fit_rs %>% collect_predictions()

3.0.1 Parallelization

library(doParallel)

# Create a cluster object and then register: 
cl <- makePSOCKcluster(2)
registerDoParallel(cl)

# Now run fit_resamples()...

stopCluster(cl)

4 Models

4.1 Note on Random Forest Explainability

Partial dependence: A way of determining which variables are the most important to the model. This is not the same as a model coefficient but roughly a way of seeing what the model relies on. This can be useful to understand your model and explain it to others. For example, from Table 10.8 in DSPP, numbers of hours worked provides the most information in that random forest model to predict wages. If instead, the most important variable was something unexpected you would expect to be completed random like the weather, you might want to respecify your model.

There are our methods to decompose effects mentioned in the textbook such as Locally Interpretable Model-Agnostic Explanations (LIME). If we have time at the end of the course, we will revisit these.

4.2 Gradient Boosting

The CCAO presented talked about gradient boosting (lightGBM). LightGBM is not as well integrated into tidymodels as the other boosted trees.

Gradient boosting is very similar to decision trees which gradually improves itself. I will show an example using XGBoost later on. Part of making a good model here involves cleverly creating predictor variables based on the context.

4.3 Neural Networks

Figure 10.14.

5 Ch 11 Comparing Models

Workflow sets.

Let’s bring out a new toy data set—precinct level mayoral election results.

We can construct a few different prediction questions here which make varying levels of sense.

For example:

  • Predicting the total number of votes cast in a precinct
  • Predicting which candidate would receive the most votes in a precinct
  • Predicting how many votes a candidate would receive in a precinct

Let’s make three different models to try and ‘predict’ the total number of votes cast.

election <- read_csv('https://github.com/erhla/pa470spring2023/raw/main/static/lectures/week_8_data.csv') %>%
  mutate(total_votes = rowSums(.[18:26]),
         ward = as.character(ward),
         precinct = as.character(precinct))

ele_split <- initial_split(election)
ele_train <- training(ele_split)
folds <- rsample::vfold_cv(ele_train, v = 10)

#recipe
simple_rec <- recipe(total_votes ~ ward + total_pop + total_units + occupied +
                       vacant + total_hisp + total_nonhisp + white_alone +
                       black_alone + aian_alone + nhpi_alone + asian_alone +
                       other_alone + two_or_more + long + lat,
                     data=ele_train) %>% 
  step_dummy(all_nominal_predictors())


interaction_rec <- simple_rec %>% 
  step_interact( ~ total_pop:contains("alone") ) %>%
  step_interact( ~ total_units:occupied) %>%
  step_interact( ~ total_units:vacant)

interaction_geo_rec <- interaction_rec %>%
    step_ns(lat, long, deg_free = 50)

preproc <- list(
  simple = simple_rec,
  interactions = interaction_rec,
  inter_geo = interaction_geo_rec
)

#model
init_mod <- list(lm = linear_reg())

#workflow_set
lm_models <- workflow_set(
  preproc=preproc,
  models=init_mod,
)

lm_models
ABCDEFGHIJ0123456789
wflow_id
<chr>
info
<list>
option
<list>
result
<list>
simple_lm<tibble[,4]><wrkflw__><list [0]>
interactions_lm<tibble[,4]><wrkflw__><list [0]>
inter_geo_lm<tibble[,4]><wrkflw__><list [0]>

Now we need to resample each of the models. This is a similar task to ‘mapping.’

keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)

lm_models <- 
  lm_models %>% 
  workflow_map("fit_resamples", 
               # Options to `workflow_map()`: 
               seed = 1101, verbose = TRUE,
               # Options to `fit_resamples()`: 
               resamples = folds, control = keep_pred)

lm_models
ABCDEFGHIJ0123456789
wflow_id
<chr>
info
<list>
option
<list>
result
<list>
simple_lm<tibble[,4]><wrkflw__><rsmpl_rs[,5]>
interactions_lm<tibble[,4]><wrkflw__><rsmpl_rs[,5]>
inter_geo_lm<tibble[,4]><wrkflw__><rsmpl_rs[,5]>
collect_metrics(lm_models) %>% 
  filter(.metric == "rmse")
ABCDEFGHIJ0123456789
wflow_id
<chr>
.config
<chr>
preproc
<chr>
model
<chr>
.metric
<chr>
.estimator
<chr>
mean
<dbl>
n
<int>
simple_lmPreprocessor1_Model1recipelinear_regrmsestandard125.972110
interactions_lmPreprocessor1_Model1recipelinear_regrmsestandard116.621610
inter_geo_lmPreprocessor1_Model1recipelinear_regrmsestandard118.369110
library(ggrepel)
autoplot(lm_models, metric = "rsq") +
  geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
  theme(legend.position = "none")

5.1 11.2 Comparing Resample Performance Stats

Let’s try and see if the differences between models are statistically significant.

rsq_indiv_estimates <- 
  collect_metrics(lm_models, summarize = FALSE) %>% 
  filter(.metric == "rsq") 

rsq_wider <- 
  rsq_indiv_estimates %>% 
  select(wflow_id, .estimate, id) %>% 
  pivot_wider(id_cols = "id", names_from = "wflow_id", values_from = ".estimate")

corrr::correlate(rsq_wider %>% select(-id), quiet = TRUE)
ABCDEFGHIJ0123456789
term
<chr>
simple_lm
<dbl>
interactions_lm
<dbl>
inter_geo_lm
<dbl>
simple_lmNA0.77794500.7813525
interactions_lm0.7779450NA0.8765089
inter_geo_lm0.78135250.8765089NA
rsq_indiv_estimates %>% 
  mutate(wflow_id = reorder(wflow_id, .estimate)) %>% 
  ggplot(aes(x = wflow_id, y = .estimate, group = id, color = id)) + 
  geom_line(alpha = .5, lwd = 1.25) + 
  theme(legend.position = "none")

This compares model performance on each resample. If the resample-to-resample effect was not real, there would not be any parallel lines. A statistical test for the correlations evaluates whether the magnitudes of these correlations are not simply noise. For the linear models:

rsq_wider %>% 
  with( cor.test(simple_lm, interactions_lm ) ) %>% 
  tidy() %>% 
  select(estimate, starts_with("conf"))
ABCDEFGHIJ0123456789
estimate
<dbl>
conf.low
<dbl>
conf.high
<dbl>
0.7779450.29071520.9447963

The results of the correlation test (the estimate of the correlation and the confidence intervals) show us that the within-resample correlation appears to be real.

See other techniques in 11.3 and 11.4.

p_max <- election %>% select(precinct, ward, Green:Garcia) %>%
  pivot_longer(-c(precinct, ward)) %>%
  group_by(precinct, ward) %>%
  slice_max(order_by=`value`, n=1)

election_class <- election %>% left_join(
  p_max %>% rename(top=name)
)

ele_split2 <- initial_split(election_class)
ele_train2 <- training(ele_split2)
folds2 <- rsample::vfold_cv(election_class, v = 3)

#recipe
simple_rec <- recipe(top ~ ward + total_pop + total_units + occupied +
                       vacant + total_hisp + total_nonhisp + white_alone +
                       black_alone + aian_alone + nhpi_alone + asian_alone +
                       other_alone + two_or_more + long + lat,
                     data=ele_train2) %>% 
  step_dummy(all_nominal_predictors())


interaction_rec <- simple_rec %>% 
  step_interact( ~ total_pop:contains("alone") ) %>%
  step_interact( ~ total_units:occupied) %>%
  step_interact( ~ total_units:vacant)

interaction_geo_rec <- interaction_rec %>%
    step_ns(lat, long, deg_free = 50)

preproc <- list(
  simple = simple_rec,
  interactions = interaction_rec,
  inter_geo = interaction_geo_rec
)

#model
init_mod <- list(dt = decision_tree() %>% set_mode('classification'))

#workflowset

dt_models <- workflow_set(
  preproc=preproc,
  models=init_mod,
)
dt_models
ABCDEFGHIJ0123456789
wflow_id
<chr>
info
<list>
option
<list>
result
<list>
simple_dt<tibble[,4]><wrkflw__><list [0]>
interactions_dt<tibble[,4]><wrkflw__><list [0]>
inter_geo_dt<tibble[,4]><wrkflw__><list [0]>
keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)

dt_models <- 
  dt_models %>% 
  workflow_map("fit_resamples", 
               # Options to `workflow_map()`: 
               seed = 1101, verbose = TRUE,
               # Options to `fit_resamples()`: 
               resamples = folds2, control = keep_pred)

collect_metrics(dt_models)
ABCDEFGHIJ0123456789
wflow_id
<chr>
.config
<chr>
preproc
<chr>
model
<chr>
.metric
<chr>
.estimator
<chr>
mean
<dbl>
simple_dtPreprocessor1_Model1recipedecision_treeaccuracymulticlass0.8182521
simple_dtPreprocessor1_Model1recipedecision_treeroc_auchand_till0.8847618
interactions_dtPreprocessor1_Model1recipedecision_treeaccuracymulticlass0.8205723
interactions_dtPreprocessor1_Model1recipedecision_treeroc_auchand_till0.8861958
inter_geo_dtPreprocessor1_Model1recipedecision_treeaccuracymulticlass0.8058778
inter_geo_dtPreprocessor1_Model1recipedecision_treeroc_auchand_till0.8831686
autoplot(dt_models, metric = "roc_auc") +
  geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
  theme(legend.position = "none")

dt_preds <- collect_predictions(dt_models)

autoplot(dt_models, metric = "accuracy") +
  geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
  theme(legend.position = "none")

roc_auc(dt_preds, top, c(.pred_Garcia, .pred_Johnson, .pred_Lightfoot, .pred_Vallas, .pred_Wilson))
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
roc_auchand_till0.8782879
roc_curve(dt_preds, top, c(.pred_Garcia, .pred_Johnson, .pred_Lightfoot, .pred_Vallas, .pred_Wilson)) %>%
  autoplot()

selected_workflow <- 
  workflow() %>%
  add_model(decision_tree() %>% 
              set_engine("rpart") %>%
              set_mode('classification')) %>%
  add_recipe(simple_rec)

ft_tree <- selected_workflow %>% fit(ele_train2)

tree_fit <- ft_tree %>% 
  extract_fit_engine()
rpart.plot::rpart.plot(tree_fit)