1 What is prediction?

Causal inference focuses on extracting treatment effects with a high degree of confidence, focusing on answering why. Prediction, in contrast, answers questions about who, what, when, and where—anticipating what will happen.

“Given what the model has seen before and if we assume the new data follows the same paradigm, what will the outcomes be in this new dataset”

2 Bias-Variance Trade-off

Accuracy: how close are you to prediction a value
Error: how far are you from the value

Irreducible error: natural uncertainty/sampling error
Reducible error: bias + variance
Bias: difference between model and theoretical true model (e.g. error due to erroneous assumptions)
Variance: model learns random, irrelevant patters in data (e.g. model predictions change/vary widely when trained)

2.1 The Trade-off

Underfit: large bias and misses variability (e.g. straight line on quadratic process) Overfit: high variance (e.g. focus on the noise and therefore fails out of sample)

Figure 9.7

3 Objective/Loss functions

Table 9.3

4 Cross Validation

Resampling example from tidymodels: https://www.tidymodels.org/start/resampling/

Splitting data

data(cells, package = "modeldata")

cell_split <- initial_split(cells %>% select(-case), 
                            strata = class)
cell_train <- training(cell_split)
cell_test  <- testing(cell_split)

cell_train %>% 
  count(class) %>% 
  mutate(prop = n/sum(n))

ABCDEFGHIJ0123456789

class <fct>	n <int>	prop <dbl>
PS	975	0.6439894
WS	539	0.3560106

Initial model

rf_mod <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

rf_fit <- 
  rf_mod %>% 
  fit(class ~ ., data = cell_train)

rf_fit

## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      1514 
## Number of independent variables:  56 
## Mtry:                             7 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1200332

Differing performance on training versus testing. E.g. ‘memorizing’ the training set

rf_training_pred <- 
  rf_fit %>% augment(cell_train)

rf_training_pred %>%                
  roc_auc(truth = class, .pred_PS)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
roc_auc	binary	0.9999087

rf_training_pred %>%                
  roc_curve(truth = class, .pred_PS) %>%
  autoplot()

rf_training_pred %>%                
  accuracy(truth = class, .pred_class)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
accuracy	binary	0.994716

rf_testing_pred <- 
  rf_fit %>% augment(cell_test)

rf_testing_pred %>%                   
  roc_auc(truth = class, .pred_PS)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
roc_auc	binary	0.8968889

rf_testing_pred %>%                  
  accuracy(truth = class, .pred_class)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
accuracy	binary	0.8316832

rf_testing_pred %>%                
  roc_curve(truth = class, .pred_PS) %>%
  autoplot()

folds <- rsample::vfold_cv(cell_train, v = 10)

rf_wf <- 
  workflow() %>%
  add_model(rf_mod) %>%
  add_formula(class ~ .)

control <- control_resamples(
  verbose = TRUE,
  save_pred = FALSE #this is the default but we will use save_pred=TRUE later on
)

rf_fit_rs <- 
  rf_wf %>% 
  fit_resamples(folds,
                control=control)

collect_metrics(rf_fit_rs, summarize=FALSE)

ABCDEFGHIJ0123456789

id <chr>	.metric <chr>	.estimator <chr>	.estimate <dbl>	.config <chr>
Fold01	accuracy	binary	0.8157895	Preprocessor1_Model1
Fold01	roc_auc	binary	0.8885512	Preprocessor1_Model1
Fold02	accuracy	binary	0.7960526	Preprocessor1_Model1
Fold02	roc_auc	binary	0.8580392	Preprocessor1_Model1
Fold03	accuracy	binary	0.7894737	Preprocessor1_Model1
Fold03	roc_auc	binary	0.8775287	Preprocessor1_Model1
Fold04	accuracy	binary	0.8223684	Preprocessor1_Model1
Fold04	roc_auc	binary	0.8924851	Preprocessor1_Model1
Fold05	accuracy	binary	0.8543046	Preprocessor1_Model1
Fold05	roc_auc	binary	0.9280303	Preprocessor1_Model1

collect_metrics(rf_fit_rs, summarize=TRUE)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	mean <dbl>	n <int>	std_err <dbl>	.config <chr>
accuracy	binary	0.8276664	10	0.007943601	Preprocessor1_Model1
roc_auc	binary	0.9064980	10	0.008290691	Preprocessor1_Model1

rf_testing_pred %>%                   
  roc_auc(truth = class, .pred_PS)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
roc_auc	binary	0.8968889

rf_testing_pred %>%                   
  accuracy(truth = class, .pred_class)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
accuracy	binary	0.8316832

Week 6

1 What is prediction?

2 Bias-Variance Trade-off

2.1 The Trade-off

3 Objective/Loss functions

4 Cross Validation