1 Introduction

Now that we have begun to see different models and concepts, hopefully all the pieces are starting to come together. Prediction problems are hard and there’s no single way which works for every solution but generally here steps which will be applicable for most of our work.

1.1 Identify Prediction Task

Since we are only focused on supervised learning here, to identify the task we need adequate labeled data and an idea of if our outcome is a class or a continuous variable.

1.2 Prepare Data

This task varies depending on the requirements of our model. Some models may only be able to ingest certain types of data (one hot encoded matrices for example). Other variables may need to be rescaled or imputed. We can codify all of our preprocessing steps into a recipe.

1.3 Construct Formula

There are a variety of tests to determine which variables to include in the formula. This step is very arbitrary and generally EDA is the answer.

1.4 Select Model

There are hundreds of model types. In this class, we are going to use boosted trees, decision trees, linear regression, logistic regression, nearest neighbors, and random forests from parnsip.

1.5 Build Pipeline

Your pipeline is where you bring the steps above into a cohesive object and is the foundation of our process. A pipeline consists of a workflow, recipe, and model. In typical ML applications, hundreds or thousands of models will be built on this foundation in order to find the best model specification.

1.6 Now…Creating Effective Models

Ch 10-15 of tidymodels offer methods to create numerous models and select the best for predictive performance. In combination, these methods will artificially create hundreds of different datasets by resampling, and try hundreds of different hyperparameter specifications. We will then select the ‘best’ model based on an additional objective functions (such as accuracy or RMSE).

2 Resampling

Section 10.1 in tidymodels give an example of a linear model and a random forest model which perform differently on training and testing to frame why resampling is important.

RMSE Estimates
object train test
lm_fit 0.0754 0.0736
rf_fit 0.0365 0.0704

In this scenario, the random forest model performed much better on training than testing since it overfit the training data. If we were going to pick which model to use based only on training RMSE performance, this could have led to some issues!

2.1 Overview

Resampling involves splitting, slicing, and bootstrapping the training set. In tidymodels, this could involve creating ten sets of analysis and assessment subsamples from the training data. Then model performance would then be generated by averaging the performance across these samples.

Recall our example from week 6,

data(cells, package = "modeldata")

cell_split <- initial_split(cells %>% select(-case), 
                            strata = class)
cell_train <- training(cell_split)
folds <- rsample::vfold_cv(cell_train, v = 10)

cell_test  <- testing(cell_split)


rf_mod <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

rf_wf <- 
  workflow() %>%
  add_model(rf_mod) %>%
  add_formula(class ~ .)

2.2 Cross Validation

V=3, 3-fold cross-validation

1/3rd of the folds are held out for assessment.

Our folds, using 5-fold cross validation:

folds <- rsample::vfold_cv(cell_train, v = 5)
folds

Fold 1:

folds[[1]][[1]]
## <Analysis/Assess/Total>
## <1211/303/1514>

We then fit the model on each resample.

rf_fit_rs <- 
  rf_wf %>% 
  fit_resamples(folds, control=control_resamples(save_pred=TRUE))

You can see the model performance on each fold…

collect_metrics(rf_fit_rs, summarize=FALSE)

Or aggregated.

collect_metrics(rf_fit_rs, summarize=TRUE)

We can also aggregate the folds to get predictions from our training data.

rf_testing_pred <- collect_predictions(rf_fit_rs, summarize = TRUE)

rf_testing_pred %>%                   
  accuracy(class, .pred_class)
rf_testing_pred %>%                   
  roc_curve(class, .pred_PS) %>%
  autoplot()

2.3 Variations

2.3.1 Repeated CV

In the thread of using resampling to get more data on our model performance, a key variation is repeated cross-validation.

vfold_cv(data, v=10, repeats=5)

2.3.2 Bootstrapping

Bootstrapping creates samples drawn with replacement.

bootstraps(data, times=5)

This code would create 50 folds.

2.3.3 Rolling Origin

2.4 Summary

During resampling, the analysis set is used to preprocess the data, apply the preprocessing to itself, and use these processed data to fit the model.

The preprocessing statistics produced by the analysis set are applied to the assessment set. The predictions from the assessment set estimate performance.

3 Performance

control_options <- control_resamples(
  save_pred= TRUE,
  save_workflow = TRUE,
  verbose = TRUE
)

rf_fit_rs <- 
  rf_wf %>% 
  fit_resamples(folds, control=control_options)

rf_preds <- rf_fit_rs %>% collect_predictions()

3.0.1 Parallelization

library(doParallel)

# Create a cluster object and then register: 
cl <- makePSOCKcluster(2)
registerDoParallel(cl)

# Now run fit_resamples()...

stopCluster(cl)

4 Models

4.1 Note on Random Forest Explainability

Partial dependence: A way of determining which variables are the most important to the model. This is not the same as a model coefficient but roughly a way of seeing what the model relies on. This can be useful to understand your model and explain it to others. For example, from Table 10.8 in DSPP, numbers of hours worked provides the most information in that random forest model to predict wages. If instead, the most important variable was something unexpected you would expect to be completed random like the weather, you might want to respecify your model.

There are our methods to decompose effects mentioned in the textbook such as Locally Interpretable Model-Agnostic Explanations (LIME). If we have time at the end of the course, we will revisit these.

4.2 Gradient Boosting

The CCAO presented talked about gradient boosting (lightGBM). LightGBM is not as well integrated into tidymodels as the other boosted trees.

Gradient boosting is very similar to decision trees which gradually improves itself. I will show an example using XGBoost later on. Part of making a good model here involves cleverly creating predictor variables based on the context.

4.3 Neural Networks

Figure 10.14.

5 Ch 11 Comparing Models

Workflow sets.

Let’s bring out a new toy data set—precinct level mayoral election results.

We can construct a few different prediction questions here which make varying levels of sense.

For example:

  • Predicting the total number of votes cast in a precinct
  • Predicting which candidate would receive the most votes in a precinct
  • Predicting how many votes a candidate would receive in a precinct

Let’s make three different models to try and ‘predict’ the total number of votes cast.

election <- read_csv('https://github.com/erhla/pa470spring2023/raw/main/static/lectures/week_8_data.csv') %>%
  mutate(total_votes = rowSums(.[18:26]),
         ward = as.character(ward),
         precinct = as.character(precinct))

ele_split <- initial_split(election)
ele_train <- training(ele_split)
folds <- rsample::vfold_cv(ele_train, v = 10)

#recipe
simple_rec <- recipe(total_votes ~ ward + total_pop + total_units + occupied +
                       vacant + total_hisp + total_nonhisp + white_alone +
                       black_alone + aian_alone + nhpi_alone + asian_alone +
                       other_alone + two_or_more + long + lat,
                     data=ele_train) %>% 
  step_dummy(all_nominal_predictors())


interaction_rec <- simple_rec %>% 
  step_interact( ~ total_pop:contains("alone") ) %>%
  step_interact( ~ total_units:occupied) %>%
  step_interact( ~ total_units:vacant)

interaction_geo_rec <- interaction_rec %>%
    step_ns(lat, long, deg_free = 50)

preproc <- list(
  simple = simple_rec,
  interactions = interaction_rec,
  inter_geo = interaction_geo_rec
)

#model
init_mod <- list(lm = linear_reg())

#workflow_set
lm_models <- workflow_set(
  preproc=preproc,
  models=init_mod,
)

lm_models

Now we need to resample each of the models. This is a similar task to ‘mapping.’

keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)

lm_models <- 
  lm_models %>% 
  workflow_map("fit_resamples", 
               # Options to `workflow_map()`: 
               seed = 1101, verbose = TRUE,
               # Options to `fit_resamples()`: 
               resamples = folds, control = keep_pred)

lm_models
collect_metrics(lm_models) %>% 
  filter(.metric == "rmse")
library(ggrepel)
autoplot(lm_models, metric = "rsq") +
  geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
  theme(legend.position = "none")

5.1 11.2 Comparing Resample Performance Stats

Let’s try and see if the differences between models are statistically significant.

rsq_indiv_estimates <- 
  collect_metrics(lm_models, summarize = FALSE) %>% 
  filter(.metric == "rsq") 

rsq_wider <- 
  rsq_indiv_estimates %>% 
  select(wflow_id, .estimate, id) %>% 
  pivot_wider(id_cols = "id", names_from = "wflow_id", values_from = ".estimate")

corrr::correlate(rsq_wider %>% select(-id), quiet = TRUE)
rsq_indiv_estimates %>% 
  mutate(wflow_id = reorder(wflow_id, .estimate)) %>% 
  ggplot(aes(x = wflow_id, y = .estimate, group = id, color = id)) + 
  geom_line(alpha = .5, lwd = 1.25) + 
  theme(legend.position = "none")

This compares model performance on each resample. If the resample-to-resample effect was not real, there would not be any parallel lines. A statistical test for the correlations evaluates whether the magnitudes of these correlations are not simply noise. For the linear models:

rsq_wider %>% 
  with( cor.test(simple_lm, interactions_lm ) ) %>% 
  tidy() %>% 
  select(estimate, starts_with("conf"))

The results of the correlation test (the estimate of the correlation and the confidence intervals) show us that the within-resample correlation appears to be real.

See other techniques in 11.3 and 11.4.

p_max <- election %>% select(precinct, ward, Green:Garcia) %>%
  pivot_longer(-c(precinct, ward)) %>%
  group_by(precinct, ward) %>%
  slice_max(order_by=`value`, n=1)

election_class <- election %>% left_join(
  p_max %>% rename(top=name)
)

ele_split2 <- initial_split(election_class)
ele_train2 <- training(ele_split2)
folds2 <- rsample::vfold_cv(election_class, v = 3)

#recipe
simple_rec <- recipe(top ~ ward + total_pop + total_units + occupied +
                       vacant + total_hisp + total_nonhisp + white_alone +
                       black_alone + aian_alone + nhpi_alone + asian_alone +
                       other_alone + two_or_more + long + lat,
                     data=ele_train2) %>% 
  step_dummy(all_nominal_predictors())


interaction_rec <- simple_rec %>% 
  step_interact( ~ total_pop:contains("alone") ) %>%
  step_interact( ~ total_units:occupied) %>%
  step_interact( ~ total_units:vacant)

interaction_geo_rec <- interaction_rec %>%
    step_ns(lat, long, deg_free = 50)

preproc <- list(
  simple = simple_rec,
  interactions = interaction_rec,
  inter_geo = interaction_geo_rec
)

#model
init_mod <- list(dt = decision_tree() %>% set_mode('classification'))

#workflowset

dt_models <- workflow_set(
  preproc=preproc,
  models=init_mod,
)
dt_models
keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)

dt_models <- 
  dt_models %>% 
  workflow_map("fit_resamples", 
               # Options to `workflow_map()`: 
               seed = 1101, verbose = TRUE,
               # Options to `fit_resamples()`: 
               resamples = folds2, control = keep_pred)

collect_metrics(dt_models)
autoplot(dt_models, metric = "roc_auc") +
  geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
  theme(legend.position = "none")

dt_preds <- collect_predictions(dt_models)

autoplot(dt_models, metric = "accuracy") +
  geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
  theme(legend.position = "none")

roc_auc(dt_preds, top, c(.pred_Garcia, .pred_Johnson, .pred_Lightfoot, .pred_Vallas, .pred_Wilson))
roc_curve(dt_preds, top, c(.pred_Garcia, .pred_Johnson, .pred_Lightfoot, .pred_Vallas, .pred_Wilson)) %>%
  autoplot()

selected_workflow <- 
  workflow() %>%
  add_model(decision_tree() %>% 
              set_engine("rpart") %>%
              set_mode('classification')) %>%
  add_recipe(simple_rec)

ft_tree <- selected_workflow %>% fit(ele_train2)

tree_fit <- ft_tree %>% 
  extract_fit_engine()
rpart.plot::rpart.plot(tree_fit)