• 1 What is prediction?
  • 2 Bias-Variance Trade-off
    • 2.1 The Trade-off
  • 3 Objective/Loss functions
  • 4 Cross Validation

1 What is prediction?

Causal inference focuses on extracting treatment effects with a high degree of confidence, focusing on answering why. Prediction, in contrast, answers questions about who, what, when, and where—anticipating what will happen.

“Given what the model has seen before and if we assume the new data follows the same paradigm, what will the outcomes be in this new dataset”

2 Bias-Variance Trade-off

  • Accuracy: how close are you to prediction a value
  • Error: how far are you from the value

Error=Reducible+Irreducible

Error=(Bias+Variance)+Irreducible

  • Irreducible error: natural uncertainty/sampling error

  • Reducible error: bias + variance

  • Bias: difference between model and theoretical true model (e.g. error due to erroneous assumptions)

  • Variance: model learns random, irrelevant patters in data (e.g. model predictions change/vary widely when trained)

2.1 The Trade-off

Underfit: large bias and misses variability (e.g. straight line on quadratic process) Overfit: high variance (e.g. focus on the noise and therefore fails out of sample)

Figure 9.7

3 Objective/Loss functions

Table 9.3

4 Cross Validation

Resampling example from tidymodels: https://www.tidymodels.org/start/resampling/

Splitting data

data(cells, package = "modeldata")

cell_split <- initial_split(cells %>% select(-case), 
                            strata = class)
cell_train <- training(cell_split)
cell_test  <- testing(cell_split)

cell_train %>% 
  count(class) %>% 
  mutate(prop = n/sum(n))
ABCDEFGHIJ0123456789
class
<fct>
n
<int>
prop
<dbl>
PS9750.6439894
WS5390.3560106

Initial model

rf_mod <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

rf_fit <- 
  rf_mod %>% 
  fit(class ~ ., data = cell_train)

rf_fit
## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      1514 
## Number of independent variables:  56 
## Mtry:                             7 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1200332

Differing performance on training versus testing. E.g. ‘memorizing’ the training set

rf_training_pred <- 
  rf_fit %>% augment(cell_train)

rf_training_pred %>%                
  roc_auc(truth = class, .pred_PS)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
roc_aucbinary0.9999087
rf_training_pred %>%                
  roc_curve(truth = class, .pred_PS) %>%
  autoplot()

rf_training_pred %>%                
  accuracy(truth = class, .pred_class)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
accuracybinary0.994716
rf_testing_pred <- 
  rf_fit %>% augment(cell_test)

rf_testing_pred %>%                   
  roc_auc(truth = class, .pred_PS)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
roc_aucbinary0.8968889
rf_testing_pred %>%                  
  accuracy(truth = class, .pred_class)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
accuracybinary0.8316832
rf_testing_pred %>%                
  roc_curve(truth = class, .pred_PS) %>%
  autoplot()

folds <- rsample::vfold_cv(cell_train, v = 10)

rf_wf <- 
  workflow() %>%
  add_model(rf_mod) %>%
  add_formula(class ~ .)

control <- control_resamples(
  verbose = TRUE,
  save_pred = FALSE #this is the default but we will use save_pred=TRUE later on
)

rf_fit_rs <- 
  rf_wf %>% 
  fit_resamples(folds,
                control=control)

collect_metrics(rf_fit_rs, summarize=FALSE)
ABCDEFGHIJ0123456789
id
<chr>
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
.config
<chr>
Fold01accuracybinary0.8157895Preprocessor1_Model1
Fold01roc_aucbinary0.8885512Preprocessor1_Model1
Fold02accuracybinary0.7960526Preprocessor1_Model1
Fold02roc_aucbinary0.8580392Preprocessor1_Model1
Fold03accuracybinary0.7894737Preprocessor1_Model1
Fold03roc_aucbinary0.8775287Preprocessor1_Model1
Fold04accuracybinary0.8223684Preprocessor1_Model1
Fold04roc_aucbinary0.8924851Preprocessor1_Model1
Fold05accuracybinary0.8543046Preprocessor1_Model1
Fold05roc_aucbinary0.9280303Preprocessor1_Model1
collect_metrics(rf_fit_rs, summarize=TRUE)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
mean
<dbl>
n
<int>
std_err
<dbl>
.config
<chr>
accuracybinary0.8276664100.007943601Preprocessor1_Model1
roc_aucbinary0.9064980100.008290691Preprocessor1_Model1
rf_testing_pred %>%                   
  roc_auc(truth = class, .pred_PS)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
roc_aucbinary0.8968889
rf_testing_pred %>%                   
  accuracy(truth = class, .pred_class)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
accuracybinary0.8316832