1 Introduction

1.1 Identify Prediction Task
1.2 Prepare Data
1.3 Construct Formula
1.4 Select Model
1.5 Build Pipeline
1.6 Now…Creating Effective Models

2 Resampling

2.1 Overview
2.2 Cross Validation
2.3 Variations

2.3.1 Repeated CV
2.3.2 Bootstrapping
2.3.3 Rolling Origin

2.4 Summary

3 Performance

3.0.1 Parallelization

4 Models

4.1 Note on Random Forest Explainability
4.2 Gradient Boosting
4.3 Neural Networks

5 Ch 11 Comparing Models

5.1 11.2 Comparing Resample Performance Stats

1 Introduction

Now that we have begun to see different models and concepts, hopefully all the pieces are starting to come together. Prediction problems are hard and there’s no single way which works for every solution but generally here steps which will be applicable for most of our work.

1.1 Identify Prediction Task

Since we are only focused on supervised learning here, to identify the task we need adequate labeled data and an idea of if our outcome is a class or a continuous variable.

1.2 Prepare Data

This task varies depending on the requirements of our model. Some models may only be able to ingest certain types of data (one hot encoded matrices for example). Other variables may need to be rescaled or imputed. We can codify all of our preprocessing steps into a recipe.

1.3 Construct Formula

There are a variety of tests to determine which variables to include in the formula. This step is very arbitrary and generally EDA is the answer.

1.4 Select Model

There are hundreds of model types. In this class, we are going to use boosted trees, decision trees, linear regression, logistic regression, nearest neighbors, and random forests from parnsip.

1.5 Build Pipeline

Your pipeline is where you bring the steps above into a cohesive object and is the foundation of our process. A pipeline consists of a workflow, recipe, and model. In typical ML applications, hundreds or thousands of models will be built on this foundation in order to find the best model specification.

1.6 Now…Creating Effective Models

Ch 10-15 of tidymodels offer methods to create numerous models and select the best for predictive performance. In combination, these methods will artificially create hundreds of different datasets by resampling, and try hundreds of different hyperparameter specifications. We will then select the ‘best’ model based on an additional objective functions (such as accuracy or RMSE).

2 Resampling

Section 10.1 in tidymodels give an example of a linear model and a random forest model which perform differently on training and testing to frame why resampling is important.

	RMSE Estimates
object	train	test
lm_fit	0.0754	0.0736
rf_fit	0.0365	0.0704

In this scenario, the random forest model performed much better on training than testing since it overfit the training data. If we were going to pick which model to use based only on training RMSE performance, this could have led to some issues!

2.1 Overview

Resampling involves splitting, slicing, and bootstrapping the training set. In tidymodels, this could involve creating ten sets of analysis and assessment subsamples from the training data. Then model performance would then be generated by averaging the performance across these samples.

Recall our example from week 6,

data(cells, package = "modeldata")

cell_split <- initial_split(cells %>% select(-case), 
                            strata = class)
cell_train <- training(cell_split)
folds <- rsample::vfold_cv(cell_train, v = 10)

cell_test  <- testing(cell_split)


rf_mod <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

rf_wf <- 
  workflow() %>%
  add_model(rf_mod) %>%
  add_formula(class ~ .)

2.2 Cross Validation

V=3, 3-fold cross-validation

1/3rd of the folds are held out for assessment.

Our folds, using 5-fold cross validation:

folds <- rsample::vfold_cv(cell_train, v = 5)
folds

ABCDEFGHIJ0123456789

splits <list>	id <chr>
<S3: vfold_split>	Fold1
<S3: vfold_split>	Fold2
<S3: vfold_split>	Fold3
<S3: vfold_split>	Fold4
<S3: vfold_split>	Fold5

Fold 1:

folds[[1]][[1]]

## <Analysis/Assess/Total>
## <1211/303/1514>

We then fit the model on each resample.

rf_fit_rs <- 
  rf_wf %>% 
  fit_resamples(folds, control=control_resamples(save_pred=TRUE))

You can see the model performance on each fold…

collect_metrics(rf_fit_rs, summarize=FALSE)

ABCDEFGHIJ0123456789

id <chr>	.metric <chr>	.estimator <chr>	.estimate <dbl>	.config <chr>
Fold1	accuracy	binary	0.8547855	Preprocessor1_Model1
Fold1	roc_auc	binary	0.9241195	Preprocessor1_Model1
Fold2	accuracy	binary	0.8184818	Preprocessor1_Model1
Fold2	roc_auc	binary	0.8877670	Preprocessor1_Model1
Fold3	accuracy	binary	0.8448845	Preprocessor1_Model1
Fold3	roc_auc	binary	0.9141798	Preprocessor1_Model1
Fold4	accuracy	binary	0.8382838	Preprocessor1_Model1
Fold4	roc_auc	binary	0.9195122	Preprocessor1_Model1
Fold5	accuracy	binary	0.8079470	Preprocessor1_Model1
Fold5	roc_auc	binary	0.9031015	Preprocessor1_Model1

Or aggregated.

collect_metrics(rf_fit_rs, summarize=TRUE)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	mean <dbl>	n <int>	std_err <dbl>	.config <chr>
accuracy	binary	0.8328765	5	0.008608482	Preprocessor1_Model1
roc_auc	binary	0.9097360	5	0.006514820	Preprocessor1_Model1

We can also aggregate the folds to get predictions from our training data.

rf_testing_pred <- collect_predictions(rf_fit_rs, summarize = TRUE)

rf_testing_pred %>%                   
  accuracy(class, .pred_class)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
accuracy	binary	0.832893

rf_testing_pred %>%                   
  roc_curve(class, .pred_PS) %>%
  autoplot()

2.3 Variations

2.3.1 Repeated CV

In the thread of using resampling to get more data on our model performance, a key variation is repeated cross-validation.

vfold_cv(data, v=10, repeats=5)

2.3.2 Bootstrapping

Bootstrapping creates samples drawn with replacement.

bootstraps(data, times=5)

This code would create 50 folds.

2.3.3 Rolling Origin

2.4 Summary

During resampling, the analysis set is used to preprocess the data, apply the preprocessing to itself, and use these processed data to fit the model.

The preprocessing statistics produced by the analysis set are applied to the assessment set. The predictions from the assessment set estimate performance.

3 Performance

control_options <- control_resamples(
  save_pred= TRUE,
  save_workflow = TRUE,
  verbose = TRUE
)

rf_fit_rs <- 
  rf_wf %>% 
  fit_resamples(folds, control=control_options)

rf_preds <- rf_fit_rs %>% collect_predictions()

3.0.1 Parallelization

library(doParallel)

# Create a cluster object and then register: 
cl <- makePSOCKcluster(2)
registerDoParallel(cl)

# Now run fit_resamples()...

stopCluster(cl)

4 Models

4.1 Note on Random Forest Explainability

Partial dependence: A way of determining which variables are the most important to the model. This is not the same as a model coefficient but roughly a way of seeing what the model relies on. This can be useful to understand your model and explain it to others. For example, from Table 10.8 in DSPP, numbers of hours worked provides the most information in that random forest model to predict wages. If instead, the most important variable was something unexpected you would expect to be completed random like the weather, you might want to respecify your model.

There are our methods to decompose effects mentioned in the textbook such as Locally Interpretable Model-Agnostic Explanations (LIME). If we have time at the end of the course, we will revisit these.

4.2 Gradient Boosting

The CCAO presented talked about gradient boosting (lightGBM). LightGBM is not as well integrated into tidymodels as the other boosted trees.

Gradient boosting is very similar to decision trees which gradually improves itself. I will show an example using XGBoost later on. Part of making a good model here involves cleverly creating predictor variables based on the context.

4.3 Neural Networks

Figure 10.14.

5 Ch 11 Comparing Models

Workflow sets.

Let’s bring out a new toy data set—precinct level mayoral election results.

We can construct a few different prediction questions here which make varying levels of sense.

For example:

Predicting the total number of votes cast in a precinct
Predicting which candidate would receive the most votes in a precinct
Predicting how many votes a candidate would receive in a precinct

Let’s make three different models to try and ‘predict’ the total number of votes cast.

election <- read_csv('https://github.com/erhla/pa470spring2023/raw/main/static/lectures/week_8_data.csv') %>%
  mutate(total_votes = rowSums(.[18:26]),
         ward = as.character(ward),
         precinct = as.character(precinct))

ele_split <- initial_split(election)
ele_train <- training(ele_split)
folds <- rsample::vfold_cv(ele_train, v = 10)

#recipe
simple_rec <- recipe(total_votes ~ ward + total_pop + total_units + occupied +
                       vacant + total_hisp + total_nonhisp + white_alone +
                       black_alone + aian_alone + nhpi_alone + asian_alone +
                       other_alone + two_or_more + long + lat,
                     data=ele_train) %>% 
  step_dummy(all_nominal_predictors())


interaction_rec <- simple_rec %>% 
  step_interact( ~ total_pop:contains("alone") ) %>%
  step_interact( ~ total_units:occupied) %>%
  step_interact( ~ total_units:vacant)

interaction_geo_rec <- interaction_rec %>%
    step_ns(lat, long, deg_free = 50)

preproc <- list(
  simple = simple_rec,
  interactions = interaction_rec,
  inter_geo = interaction_geo_rec
)

#model
init_mod <- list(lm = linear_reg())

#workflow_set
lm_models <- workflow_set(
  preproc=preproc,
  models=init_mod,
)

lm_models

ABCDEFGHIJ0123456789

wflow_id <chr>	info <list>	option <list>	result <list>
simple_lm	<tibble[,4]>	<wrkflw__>	<list [0]>
interactions_lm	<tibble[,4]>	<wrkflw__>	<list [0]>
inter_geo_lm	<tibble[,4]>	<wrkflw__>	<list [0]>

Now we need to resample each of the models. This is a similar task to ‘mapping.’

keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)

lm_models <- 
  lm_models %>% 
  workflow_map("fit_resamples", 
               # Options to `workflow_map()`: 
               seed = 1101, verbose = TRUE,
               # Options to `fit_resamples()`: 
               resamples = folds, control = keep_pred)

lm_models

ABCDEFGHIJ0123456789

wflow_id <chr>	info <list>	option <list>	result <list>
simple_lm	<tibble[,4]>	<wrkflw__>	<rsmpl_rs[,5]>
interactions_lm	<tibble[,4]>	<wrkflw__>	<rsmpl_rs[,5]>
inter_geo_lm	<tibble[,4]>	<wrkflw__>	<rsmpl_rs[,5]>

collect_metrics(lm_models) %>% 
  filter(.metric == "rmse")

ABCDEFGHIJ0123456789

wflow_id <chr>	.config <chr>	preproc <chr>	model <chr>	.metric <chr>	.estimator <chr>	mean <dbl>	n <int>
simple_lm	Preprocessor1_Model1	recipe	linear_reg	rmse	standard	125.9721	10
interactions_lm	Preprocessor1_Model1	recipe	linear_reg	rmse	standard	116.6216	10
inter_geo_lm	Preprocessor1_Model1	recipe	linear_reg	rmse	standard	118.3691	10

library(ggrepel)
autoplot(lm_models, metric = "rsq") +
  geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
  theme(legend.position = "none")

5.1 11.2 Comparing Resample Performance Stats

Let’s try and see if the differences between models are statistically significant.

rsq_indiv_estimates <- 
  collect_metrics(lm_models, summarize = FALSE) %>% 
  filter(.metric == "rsq") 

rsq_wider <- 
  rsq_indiv_estimates %>% 
  select(wflow_id, .estimate, id) %>% 
  pivot_wider(id_cols = "id", names_from = "wflow_id", values_from = ".estimate")

corrr::correlate(rsq_wider %>% select(-id), quiet = TRUE)

ABCDEFGHIJ0123456789

term <chr>	simple_lm <dbl>	interactions_lm <dbl>	inter_geo_lm <dbl>
simple_lm	NA	0.7779450	0.7813525
interactions_lm	0.7779450	NA	0.8765089
inter_geo_lm	0.7813525	0.8765089	NA

rsq_indiv_estimates %>% 
  mutate(wflow_id = reorder(wflow_id, .estimate)) %>% 
  ggplot(aes(x = wflow_id, y = .estimate, group = id, color = id)) + 
  geom_line(alpha = .5, lwd = 1.25) + 
  theme(legend.position = "none")

This compares model performance on each resample. If the resample-to-resample effect was not real, there would not be any parallel lines. A statistical test for the correlations evaluates whether the magnitudes of these correlations are not simply noise. For the linear models:

rsq_wider %>% 
  with( cor.test(simple_lm, interactions_lm ) ) %>% 
  tidy() %>% 
  select(estimate, starts_with("conf"))

ABCDEFGHIJ0123456789

estimate <dbl>	conf.low <dbl>	conf.high <dbl>
0.777945	0.2907152	0.9447963

The results of the correlation test (the estimate of the correlation and the confidence intervals) show us that the within-resample correlation appears to be real.

See other techniques in 11.3 and 11.4.

p_max <- election %>% select(precinct, ward, Green:Garcia) %>%
  pivot_longer(-c(precinct, ward)) %>%
  group_by(precinct, ward) %>%
  slice_max(order_by=`value`, n=1)

election_class <- election %>% left_join(
  p_max %>% rename(top=name)
)

ele_split2 <- initial_split(election_class)
ele_train2 <- training(ele_split2)
folds2 <- rsample::vfold_cv(election_class, v = 3)

#recipe
simple_rec <- recipe(top ~ ward + total_pop + total_units + occupied +
                       vacant + total_hisp + total_nonhisp + white_alone +
                       black_alone + aian_alone + nhpi_alone + asian_alone +
                       other_alone + two_or_more + long + lat,
                     data=ele_train2) %>% 
  step_dummy(all_nominal_predictors())


interaction_rec <- simple_rec %>% 
  step_interact( ~ total_pop:contains("alone") ) %>%
  step_interact( ~ total_units:occupied) %>%
  step_interact( ~ total_units:vacant)

interaction_geo_rec <- interaction_rec %>%
    step_ns(lat, long, deg_free = 50)

preproc <- list(
  simple = simple_rec,
  interactions = interaction_rec,
  inter_geo = interaction_geo_rec
)

#model
init_mod <- list(dt = decision_tree() %>% set_mode('classification'))

#workflowset

dt_models <- workflow_set(
  preproc=preproc,
  models=init_mod,
)
dt_models

ABCDEFGHIJ0123456789

wflow_id <chr>	info <list>	option <list>	result <list>
simple_dt	<tibble[,4]>	<wrkflw__>	<list [0]>
interactions_dt	<tibble[,4]>	<wrkflw__>	<list [0]>
inter_geo_dt	<tibble[,4]>	<wrkflw__>	<list [0]>

keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)

dt_models <- 
  dt_models %>% 
  workflow_map("fit_resamples", 
               # Options to `workflow_map()`: 
               seed = 1101, verbose = TRUE,
               # Options to `fit_resamples()`: 
               resamples = folds2, control = keep_pred)

collect_metrics(dt_models)

ABCDEFGHIJ0123456789

wflow_id <chr>	.config <chr>	preproc <chr>	model <chr>	.metric <chr>	.estimator <chr>	mean <dbl>
simple_dt	Preprocessor1_Model1	recipe	decision_tree	accuracy	multiclass	0.8182521
simple_dt	Preprocessor1_Model1	recipe	decision_tree	roc_auc	hand_till	0.8847618
interactions_dt	Preprocessor1_Model1	recipe	decision_tree	accuracy	multiclass	0.8205723
interactions_dt	Preprocessor1_Model1	recipe	decision_tree	roc_auc	hand_till	0.8861958
inter_geo_dt	Preprocessor1_Model1	recipe	decision_tree	accuracy	multiclass	0.8058778
inter_geo_dt	Preprocessor1_Model1	recipe	decision_tree	roc_auc	hand_till	0.8831686

autoplot(dt_models, metric = "roc_auc") +
  geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
  theme(legend.position = "none")

dt_preds <- collect_predictions(dt_models)

autoplot(dt_models, metric = "accuracy") +
  geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
  theme(legend.position = "none")

roc_auc(dt_preds, top, c(.pred_Garcia, .pred_Johnson, .pred_Lightfoot, .pred_Vallas, .pred_Wilson))

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
roc_auc	hand_till	0.8782879

roc_curve(dt_preds, top, c(.pred_Garcia, .pred_Johnson, .pred_Lightfoot, .pred_Vallas, .pred_Wilson)) %>%
  autoplot()

selected_workflow <- 
  workflow() %>%
  add_model(decision_tree() %>% 
              set_engine("rpart") %>%
              set_mode('classification')) %>%
  add_recipe(simple_rec)

ft_tree <- selected_workflow %>% fit(ele_train2)

tree_fit <- ft_tree %>% 
  extract_fit_engine()
rpart.plot::rpart.plot(tree_fit)

Week 8

1 Introduction

1.1 Identify Prediction Task

1.2 Prepare Data

1.3 Construct Formula

1.4 Select Model

1.5 Build Pipeline

1.6 Now…Creating Effective Models

2 Resampling

2.1 Overview

2.2 Cross Validation

2.3 Variations

2.3.1 Repeated CV

2.3.2 Bootstrapping

2.3.3 Rolling Origin

2.4 Summary

3 Performance

3.0.1 Parallelization

4 Models

4.1 Note on Random Forest Explainability

4.2 Gradient Boosting

4.3 Neural Networks

5 Ch 11 Comparing Models

5.1 11.2 Comparing Resample Performance Stats