Now that we have begun to see different models and concepts, hopefully all the pieces are starting to come together. Prediction problems are hard and there’s no single way which works for every solution but generally here steps which will be applicable for most of our work.
Since we are only focused on supervised learning here, to identify the task we need adequate labeled data and an idea of if our outcome is a class or a continuous variable.
This task varies depending on the requirements of our model. Some models may only be able to ingest certain types of data (one hot encoded matrices for example). Other variables may need to be rescaled or imputed. We can codify all of our preprocessing steps into a recipe.
There are a variety of tests to determine which variables to include in the formula. This step is very arbitrary and generally EDA is the answer.
There are hundreds of model types. In this class, we are going to use boosted trees, decision trees, linear regression, logistic regression, nearest neighbors, and random forests from parnsip.
Your pipeline is where you bring the steps above into a cohesive object and is the foundation of our process. A pipeline consists of a workflow, recipe, and model. In typical ML applications, hundreds or thousands of models will be built on this foundation in order to find the best model specification.
Ch 10-15 of tidymodels offer methods to create numerous models and select the best for predictive performance. In combination, these methods will artificially create hundreds of different datasets by resampling, and try hundreds of different hyperparameter specifications. We will then select the ‘best’ model based on an additional objective functions (such as accuracy or RMSE).
Section 10.1 in tidymodels give an example of a linear model and a random forest model which perform differently on training and testing to frame why resampling is important.
RMSE Estimates | ||
---|---|---|
object | train | test |
lm_fit | 0.0754 | 0.0736 |
rf_fit | 0.0365 | 0.0704 |
In this scenario, the random forest model performed much better on training than testing since it overfit the training data. If we were going to pick which model to use based only on training RMSE performance, this could have led to some issues!
Resampling involves splitting, slicing, and bootstrapping the training set. In tidymodels, this could involve creating ten sets of analysis and assessment subsamples from the training data. Then model performance would then be generated by averaging the performance across these samples.
Recall our example from week 6,
data(cells, package = "modeldata")
cell_split <- initial_split(cells %>% select(-case),
strata = class)
cell_train <- training(cell_split)
folds <- rsample::vfold_cv(cell_train, v = 10)
cell_test <- testing(cell_split)
rf_mod <-
rand_forest(trees = 1000) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_wf <-
workflow() %>%
add_model(rf_mod) %>%
add_formula(class ~ .)
V=3, 3-fold cross-validation
1/3rd of the folds are held out for assessment.
Our folds, using 5-fold cross validation:
folds <- rsample::vfold_cv(cell_train, v = 5)
folds
Fold 1:
folds[[1]][[1]]
## <Analysis/Assess/Total>
## <1211/303/1514>
We then fit the model on each resample.
rf_fit_rs <-
rf_wf %>%
fit_resamples(folds, control=control_resamples(save_pred=TRUE))
You can see the model performance on each fold…
collect_metrics(rf_fit_rs, summarize=FALSE)
Or aggregated.
collect_metrics(rf_fit_rs, summarize=TRUE)
We can also aggregate the folds to get predictions from our training data.
rf_testing_pred <- collect_predictions(rf_fit_rs, summarize = TRUE)
rf_testing_pred %>%
accuracy(class, .pred_class)
rf_testing_pred %>%
roc_curve(class, .pred_PS) %>%
autoplot()
In the thread of using resampling to get more data on our model performance, a key variation is repeated cross-validation.
vfold_cv(data, v=10, repeats=5)
Bootstrapping creates samples drawn with replacement.
bootstraps(data, times=5)
This code would create 50 folds.
During resampling, the analysis set is used to preprocess the data, apply the preprocessing to itself, and use these processed data to fit the model.
The preprocessing statistics produced by the analysis set are applied to the assessment set. The predictions from the assessment set estimate performance.
control_options <- control_resamples(
save_pred= TRUE,
save_workflow = TRUE,
verbose = TRUE
)
rf_fit_rs <-
rf_wf %>%
fit_resamples(folds, control=control_options)
rf_preds <- rf_fit_rs %>% collect_predictions()
library(doParallel)
# Create a cluster object and then register:
cl <- makePSOCKcluster(2)
registerDoParallel(cl)
# Now run fit_resamples()...
stopCluster(cl)
Partial dependence: A way of determining which variables are the most important to the model. This is not the same as a model coefficient but roughly a way of seeing what the model relies on. This can be useful to understand your model and explain it to others. For example, from Table 10.8 in DSPP, numbers of hours worked provides the most information in that random forest model to predict wages. If instead, the most important variable was something unexpected you would expect to be completed random like the weather, you might want to respecify your model.
There are our methods to decompose effects mentioned in the textbook such as Locally Interpretable Model-Agnostic Explanations (LIME). If we have time at the end of the course, we will revisit these.
The CCAO presented talked about gradient boosting (lightGBM). LightGBM is not as well integrated into tidymodels as the other boosted trees.
Gradient boosting is very similar to decision trees which gradually improves itself. I will show an example using XGBoost later on. Part of making a good model here involves cleverly creating predictor variables based on the context.
Figure 10.14.
Workflow sets.
Let’s bring out a new toy data set—precinct level mayoral election results.
We can construct a few different prediction questions here which make varying levels of sense.
For example:
Let’s make three different models to try and ‘predict’ the total number of votes cast.
election <- read_csv('https://github.com/erhla/pa470spring2023/raw/main/static/lectures/week_8_data.csv') %>%
mutate(total_votes = rowSums(.[18:26]),
ward = as.character(ward),
precinct = as.character(precinct))
ele_split <- initial_split(election)
ele_train <- training(ele_split)
folds <- rsample::vfold_cv(ele_train, v = 10)
#recipe
simple_rec <- recipe(total_votes ~ ward + total_pop + total_units + occupied +
vacant + total_hisp + total_nonhisp + white_alone +
black_alone + aian_alone + nhpi_alone + asian_alone +
other_alone + two_or_more + long + lat,
data=ele_train) %>%
step_dummy(all_nominal_predictors())
interaction_rec <- simple_rec %>%
step_interact( ~ total_pop:contains("alone") ) %>%
step_interact( ~ total_units:occupied) %>%
step_interact( ~ total_units:vacant)
interaction_geo_rec <- interaction_rec %>%
step_ns(lat, long, deg_free = 50)
preproc <- list(
simple = simple_rec,
interactions = interaction_rec,
inter_geo = interaction_geo_rec
)
#model
init_mod <- list(lm = linear_reg())
#workflow_set
lm_models <- workflow_set(
preproc=preproc,
models=init_mod,
)
lm_models
Now we need to resample each of the models. This is a similar task to ‘mapping.’
keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)
lm_models <-
lm_models %>%
workflow_map("fit_resamples",
# Options to `workflow_map()`:
seed = 1101, verbose = TRUE,
# Options to `fit_resamples()`:
resamples = folds, control = keep_pred)
lm_models
collect_metrics(lm_models) %>%
filter(.metric == "rmse")
library(ggrepel)
autoplot(lm_models, metric = "rsq") +
geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
theme(legend.position = "none")
Let’s try and see if the differences between models are statistically significant.
rsq_indiv_estimates <-
collect_metrics(lm_models, summarize = FALSE) %>%
filter(.metric == "rsq")
rsq_wider <-
rsq_indiv_estimates %>%
select(wflow_id, .estimate, id) %>%
pivot_wider(id_cols = "id", names_from = "wflow_id", values_from = ".estimate")
corrr::correlate(rsq_wider %>% select(-id), quiet = TRUE)
rsq_indiv_estimates %>%
mutate(wflow_id = reorder(wflow_id, .estimate)) %>%
ggplot(aes(x = wflow_id, y = .estimate, group = id, color = id)) +
geom_line(alpha = .5, lwd = 1.25) +
theme(legend.position = "none")
This compares model performance on each resample. If the resample-to-resample effect was not real, there would not be any parallel lines. A statistical test for the correlations evaluates whether the magnitudes of these correlations are not simply noise. For the linear models:
rsq_wider %>%
with( cor.test(simple_lm, interactions_lm ) ) %>%
tidy() %>%
select(estimate, starts_with("conf"))
The results of the correlation test (the estimate of the correlation and the confidence intervals) show us that the within-resample correlation appears to be real.
See other techniques in 11.3 and 11.4.
p_max <- election %>% select(precinct, ward, Green:Garcia) %>%
pivot_longer(-c(precinct, ward)) %>%
group_by(precinct, ward) %>%
slice_max(order_by=`value`, n=1)
election_class <- election %>% left_join(
p_max %>% rename(top=name)
)
ele_split2 <- initial_split(election_class)
ele_train2 <- training(ele_split2)
folds2 <- rsample::vfold_cv(election_class, v = 3)
#recipe
simple_rec <- recipe(top ~ ward + total_pop + total_units + occupied +
vacant + total_hisp + total_nonhisp + white_alone +
black_alone + aian_alone + nhpi_alone + asian_alone +
other_alone + two_or_more + long + lat,
data=ele_train2) %>%
step_dummy(all_nominal_predictors())
interaction_rec <- simple_rec %>%
step_interact( ~ total_pop:contains("alone") ) %>%
step_interact( ~ total_units:occupied) %>%
step_interact( ~ total_units:vacant)
interaction_geo_rec <- interaction_rec %>%
step_ns(lat, long, deg_free = 50)
preproc <- list(
simple = simple_rec,
interactions = interaction_rec,
inter_geo = interaction_geo_rec
)
#model
init_mod <- list(dt = decision_tree() %>% set_mode('classification'))
#workflowset
dt_models <- workflow_set(
preproc=preproc,
models=init_mod,
)
dt_models
keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)
dt_models <-
dt_models %>%
workflow_map("fit_resamples",
# Options to `workflow_map()`:
seed = 1101, verbose = TRUE,
# Options to `fit_resamples()`:
resamples = folds2, control = keep_pred)
collect_metrics(dt_models)
autoplot(dt_models, metric = "roc_auc") +
geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
theme(legend.position = "none")
dt_preds <- collect_predictions(dt_models)
autoplot(dt_models, metric = "accuracy") +
geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
theme(legend.position = "none")
roc_auc(dt_preds, top, c(.pred_Garcia, .pred_Johnson, .pred_Lightfoot, .pred_Vallas, .pred_Wilson))
roc_curve(dt_preds, top, c(.pred_Garcia, .pred_Johnson, .pred_Lightfoot, .pred_Vallas, .pred_Wilson)) %>%
autoplot()
selected_workflow <-
workflow() %>%
add_model(decision_tree() %>%
set_engine("rpart") %>%
set_mode('classification')) %>%
add_recipe(simple_rec)
ft_tree <- selected_workflow %>% fit(ele_train2)
tree_fit <- ft_tree %>%
extract_fit_engine()
rpart.plot::rpart.plot(tree_fit)