Causal inference focuses on extracting treatment effects with a high degree of confidence, focusing on answering why. Prediction, in contrast, answers questions about who, what, when, and where—anticipating what will happen.
“Given what the model has seen before and if we assume the new data follows the same paradigm, what will the outcomes be in this new dataset”
\(Error = Reducible + Irreducible\)
\(Error = (Bias + Variance) + Irreducible\)
Irreducible error: natural uncertainty/sampling error
Reducible error: bias + variance
Bias: difference between model and theoretical true model (e.g. error due to erroneous assumptions)
Variance: model learns random, irrelevant patters in data (e.g. model predictions change/vary widely when trained)
Underfit: large bias and misses variability (e.g. straight line on quadratic process) Overfit: high variance (e.g. focus on the noise and therefore fails out of sample)
Figure 9.7
Table 9.3
Resampling example from tidymodels: https://www.tidymodels.org/start/resampling/
Splitting data
data(cells, package = "modeldata")
cell_split <- initial_split(cells %>% select(-case),
strata = class)
cell_train <- training(cell_split)
cell_test <- testing(cell_split)
cell_train %>%
count(class) %>%
mutate(prop = n/sum(n))
Initial model
rf_mod <-
rand_forest(trees = 1000) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_fit <-
rf_mod %>%
fit(class ~ ., data = cell_train)
rf_fit
## parsnip model object
##
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 1000
## Sample size: 1514
## Number of independent variables: 56
## Mtry: 7
## Target node size: 10
## Variable importance mode: none
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1200332
Differing performance on training versus testing. E.g. ‘memorizing’ the training set
rf_training_pred <-
rf_fit %>% augment(cell_train)
rf_training_pred %>%
roc_auc(truth = class, .pred_PS)
rf_training_pred %>%
roc_curve(truth = class, .pred_PS) %>%
autoplot()
rf_training_pred %>%
accuracy(truth = class, .pred_class)
rf_testing_pred <-
rf_fit %>% augment(cell_test)
rf_testing_pred %>%
roc_auc(truth = class, .pred_PS)
rf_testing_pred %>%
accuracy(truth = class, .pred_class)
rf_testing_pred %>%
roc_curve(truth = class, .pred_PS) %>%
autoplot()
folds <- rsample::vfold_cv(cell_train, v = 10)
rf_wf <-
workflow() %>%
add_model(rf_mod) %>%
add_formula(class ~ .)
control <- control_resamples(
verbose = TRUE,
save_pred = FALSE #this is the default but we will use save_pred=TRUE later on
)
rf_fit_rs <-
rf_wf %>%
fit_resamples(folds,
control=control)
collect_metrics(rf_fit_rs, summarize=FALSE)
collect_metrics(rf_fit_rs, summarize=TRUE)
rf_testing_pred %>%
roc_auc(truth = class, .pred_PS)
rf_testing_pred %>%
accuracy(truth = class, .pred_class)