Last week we looked at the tidymodels pipeline for the first time. broom (tidy, glance, augment), rsample (splitting data), parsnip (specifying models), and yardstick (evaluating models). For a brief review look at the solution to coding warmup 3 part 4.
Some more helpful tools.
Checking model specification
parsnip::rand_forest() %>%
set_engine("ranger") %>%
set_mode("regression") %>%
translate()
## Random Forest Model Specification (regression)
##
## Computational engine: ranger
##
## Model fit template:
## ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(),
## num.threads = 1, verbose = FALSE, seed = sample.int(10^5,
## 1))
#?rand_forest
rand_forest(trees = 1000, min_n = 5) %>%
set_engine("ranger", verbose = TRUE) %>%
set_mode("regression") %>%
translate()
## Random Forest Model Specification (regression)
##
## Main Arguments:
## trees = 1000
## min_n = 5
##
## Engine-Specific Arguments:
## verbose = TRUE
##
## Computational engine: ranger
##
## Model fit template:
## ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(),
## num.trees = 1000, min.node.size = min_rows(~5, x), verbose = TRUE,
## num.threads = 1, seed = sample.int(10^5, 1))
#parsnip_addin()
As an aside, what is a random forest? It is an ensemble model (meaning it aggregates information from multiple models) which uses numerous decision trees to make predictions (many trees make a forest). In a classification case, usually the prediction is the class selected by the most trees. In the regression case, the average prediction from individual trees is returned.
Let’s run a very basic example on Lending Club data.
library(modeldata)
data("lending_club")
glimpse(lending_club)
## Rows: 9,857
## Columns: 23
## $ funded_amnt <int> 16100, 32000, 10000, 16800, 3500, 10000, 11…
## $ term <fct> term_36, term_60, term_36, term_60, term_36…
## $ int_rate <dbl> 13.99, 11.99, 16.29, 13.67, 7.39, 11.47, 5.…
## $ sub_grade <fct> C4, C1, D1, C3, A4, B5, A1, B2, B3, C2, C3,…
## $ addr_state <fct> CT, MN, OH, NV, CA, TX, KY, MO, NY, GA, NH,…
## $ verification_status <fct> Not_Verified, Verified, Source_Verified, Ve…
## $ annual_inc <dbl> 35000, 72000, 72000, 101000, 50100, 32000, …
## $ emp_length <fct> emp_5, emp_ge_10, emp_ge_10, emp_lt_1, emp_…
## $ delinq_2yrs <int> 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0…
## $ inq_last_6mths <int> 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 3, 3, 0, 0, 1…
## $ revol_util <dbl> 67.4, 82.2, 74.2, 64.0, 78.3, 68.1, 54.8, 3…
## $ acc_now_delinq <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ open_il_6m <int> 1, 5, 3, 6, 0, 1, 2, 8, 1, 0, 3, 1, 2, 1, 2…
## $ open_il_12m <int> 0, 1, 2, 1, 0, 0, 0, 2, 0, 0, 1, 2, 1, 0, 4…
## $ open_il_24m <int> 0, 3, 3, 2, 0, 0, 0, 5, 0, 0, 1, 5, 1, 0, 6…
## $ total_bal_il <int> 1099, 49187, 33378, 55445, 0, 7574, 20998, …
## $ all_util <int> 48, 77, 79, 64, 78, 66, 54, 80, 75, 36, 43,…
## $ inq_fi <int> 0, 0, 1, 1, 0, 0, 0, 0, 2, 1, 1, 1, 0, 1, 7…
## $ inq_last_12m <int> 3, 0, 4, 4, 0, 0, 0, 4, 6, 1, 3, 5, 1, 0, 5…
## $ delinq_amnt <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_il_tl <int> 3, 9, 9, 10, 3, 9, 7, 21, 7, 1, 7, 19, 7, 6…
## $ total_il_high_credit_limit <int> 13193, 65945, 39387, 60188, 0, 12131, 39486…
## $ Class <fct> good, good, good, good, good, good, good, g…
lending_club <- lending_club %>%
mutate(Class = relevel(Class, "good"))
lending_club_mini <- lending_club %>%
select(Class, annual_inc, verification_status, sub_grade)
lending_club_mini %>% count(Class)
Each row is a loan which is either Classified as good or bad.
show_engines("logistic_reg")
split <- initial_split(lending_club_mini)
train <- training(split)
test <- testing(split)
train %>% count(Class)
We can see that there is a really large class imbalance. Most loans are good. This is something to watch out for! If our model ‘predicts’ every situation as good it will perform deceptively well.
log_model <- logistic_reg() %>%
set_engine('glm') %>%
set_mode('classification') %>%
fit(Class ~ annual_inc + verification_status + sub_grade,
data = train)
log_model %>% tidy()
# log_model %>%
# car::vif()
log_model %>%
extract_fit_engine() %>%
car::vif()
## GVIF Df GVIF^(1/(2*Df))
## annual_inc 1.031247 1 1.015503
## verification_status 1.095046 2 1.022959
## sub_grade 1.119974 34 1.001668
Multicollinearity is a big issue. If variables are correlated to both the target variable(s) and themselves, the model will not be able to distuingish between their influence. Variance Inflation Factors, (vif 1, orthogonal/uncorrelated, >5 issue).
preds <- log_model %>%
predict(test, type='prob')
preds <- log_model %>%
predict(test, type='class')
test_preds <- log_model %>% augment(test)
test_preds %>% select(Class, .pred_class, .pred_good, .pred_bad)
We can predict a class metric or a class probability metric (recall from yardstick).
test_preds %>% count(Class, .pred_class)
ggplot(test_preds, aes(x=.pred_good, color=Class)) + geom_density()
test_preds %>% group_by(Class) %>% summarize(mean(.pred_good), median(.pred_good))
roc_auc(test_preds,
Class,
.pred_good)
roc_curve(test_preds,
Class,
.pred_good) %>%
autoplot()
specificity(test_preds,
truth =Class,
estimate = .pred_class )
sensitivity(test_preds,
truth = Class,
estimate = .pred_class)
test_preds %>% count(.pred_class)
We can see that this model always predicts good, but this model still has some ability to distinguish between classes. We just need to look at setting different probability thresholds. This will be something we explore more as the class goes on so don’t worry if this seems complicated!
The probably
package lets us find the threshold which
maximizes sensitivity (true positives) and specificity (true
negatives).
threshold_data <- test_preds %>%
probably::threshold_perf(Class, .pred_good, thresholds = seq(0.5, 1, by = 0.0025))
ggplot(threshold_data %>% filter(.metric != 'distance'),
aes(x = .threshold, y = .estimate, color = .metric)) +
geom_line() +
scale_color_viridis_d(end = 0.9) +
scale_alpha_manual(values = c(.4, 1), guide = "none") +
labs(
x = "'Good' Threshold\n(above this value is considered 'good')",
y = "Metric Estimate",
title = "Balancing performance by varying the threshold",
subtitle = "Sensitivity or specificity alone might not be enough!\nVertical line = Max J-Index"
)
max_threshold <- threshold_data %>% filter(.metric == 'j_index') %>%
filter(.estimate==max(.estimate)) %>%
pull(.threshold)
test_preds2 <- test_preds %>%
mutate(.pred = probably::make_two_class_pred(.pred_good, levels(Class), threshold = max_threshold)) %>%
select(Class, contains(".pred"))
test_preds2 %>% count(.pred)
precision(test_preds2,
truth = Class,
estimate = .pred_class)
recall(test_preds2,
truth = Class,
estimate = .pred_class)
f_meas(test_preds2,
truth = Class,
estimate = .pred_class)
split <- initial_split(lending_club)
train <- training(split)
test <- testing(split)
log_model2 <- logistic_reg(penalty = 0.1) %>%
set_engine('glmnet') %>%
set_mode('classification') %>%
fit(Class ~ .,
data = train)
log_model2 %>% tidy() %>% filter(abs(estimate) > 0)
log_model3 <- logistic_reg(penalty = 0.01) %>%
set_engine('glmnet') %>%
set_mode('classification') %>%
fit(Class ~ .,
data = train)
log_model3 %>% tidy() %>% filter(abs(estimate) > 0)
test_preds_l3 <- log_model3 %>% augment(test)
roc_curve(test_preds_l3,
Class,
.pred_good) %>%
autoplot()
roc_auc(test_preds_l3,
Class,
.pred_good)
log_model4 <- logistic_reg(penalty = 0.001) %>%
set_engine('glmnet') %>%
set_mode('classification') %>%
fit(Class ~ .,
data = train)
log_model4 %>% tidy() %>% filter(abs(estimate) > 0)
test_preds_l4 <- log_model4 %>% augment(test)
roc_curve(test_preds_l4,
Class,
.pred_good) %>%
autoplot()
roc_auc(test_preds_l4,
Class,
.pred_good)