This lecture is heavily based on Supervised Machine Learning for Text Analysis in R.
Similar to our previous work we can have both regression and classification text based problems. A regression model would predict a numeric/continuous output ‘such as predicting the year of a United States Supreme Court opinion from the text of that opinion.’ A classification model would predict a discrete class ‘such as predicting whether a GitHub issue is about documentation or not from the text of the issue.’
Natural language needs to be standardized and transformed to numeric
representations for modeling. We will use the textrecipes
package to do this.
These are different types of prediction problems, but in both, we can use the tools of supervised machine learning to connect our input, which may exist entirely or partly as text data, with our outcome of interest. Most supervised models for text data are built with one of three purposes in mind:
The main goal of a predictive model is to generate the most accurate predictions possible.
An inferential model is created to test a hypothesis or draw conclusions about a population.
The main purpose of a descriptive model is to describe the properties of the observed data.
What language is and how language works is key to creating modeling features from natural language. Words in English are made of prefixes, suffixes, and root words. Defining a word can be quite difficult with compound words (like real estate or dining room). Preprocessing natural language has three primary steps: tokenization, removal of stop words, and even stemming.
Tokenization can broadly be thought of as taking an input (such as a
string) and a token type (such as a word) and splitting the input into
pieces/tokens. This process is generally much more complex than you
might think (e.g. more than splitting on non-alphanumeric characters)
and spaCy
through the (tokenizers
package)
implements a fast tool set.
Tokens have a variety of units including characters, words, sentences, lines, paragraphs, and n-grams.
sample_vector <- c("Far down in the forest",
"grew a pretty little fir-tree")
sample_tibble <- tibble(text = sample_vector)
tokenize_words(sample_vector)
## [[1]]
## [1] "far" "down" "in" "the" "forest"
##
## [[2]]
## [1] "grew" "a" "pretty" "little" "fir" "tree"
sample_tibble %>%
unnest_tokens(word, text, token = "words")
sample_tibble %>%
unnest_tokens(word, text, token = "words", strip_punct = FALSE)
pride <- tibble(line=janeaustenr::prideprejudice)
pride %>%
unnest_tokens(word, line) %>%
count(word) %>%
arrange(desc(n))
n-gram is a contiguous sequence of n items from a given sequence of text. Examples:
token_ngram <- tokenize_ngrams(x = pride %>% pull(line),
lowercase = TRUE,
n = 3L,
n_min = 3L,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE)
token_ngram[[100]]
## [1] "are my old" "my old friends"
## [3] "old friends i" "friends i have"
## [5] "i have heard" "have heard you"
## [7] "heard you mention" "you mention them"
## [9] "mention them with" "them with consideration"
library(jiebaR)
## Loading required package: jiebaRD
words <- c("下面是不分行输出的结果", "下面是不输出的结果")
engine1 <- worker(bylines = TRUE)
segment(words, engine1)
## [[1]]
## [1] "下面" "是" "不" "分行" "输出" "的" "结果"
##
## [[2]]
## [1] "下面" "是" "不" "输出" "的" "结果"
Some words carry less information than others. For example, a, the,
or of. These common words are called stop words and are generally
removed entirely. Let’s use the stopwords
package here to
provide some lists.
pride_words <-
pride %>%
unnest_tokens(word, line)
pride_words %>%
semi_join(get_stopwords(source = "snowball")) %>%
distinct() # present stop words
## Joining with `by = join_by(word)`
pride_words %>%
anti_join(get_stopwords(source = "snowball")) %>%
distinct() # unique non stop words
## Joining with `by = join_by(word)`
What if we aren’t interested in the difference between banana and bananas? The core sentiment of a word is often the same (e.g. ‘banana’).
pride_words %>%
anti_join(get_stopwords(source = "snowball")) %>%
mutate(word_stem = wordStem(word))
## Joining with `by = join_by(word)`
pride_words %>%
anti_join(get_stopwords(source = "snowball")) %>%
mutate(word_stem = wordStem(word)) %>%
summarize(nword = n_distinct(word),
nstem = n_distinct(word_stem))
## Joining with `by = join_by(word)`
Stemming reduces the feature space of text data but may change the underlying meaning of some sentences. It may or may not improve models.
A data structure for text data.
complaints <- read_csv("https://github.com/EmilHvitfeldt/smltar/raw/master/data/complaints.csv.gz")
## Rows: 117214 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (15): product, sub_product, issue, sub_issue, consumer_complaint_narrat...
## dbl (1): complaint_id
## date (2): date_received, date_sent_to_company
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
complaints %>%
slice_sample(n=10000) %>%
unnest_tokens(word, consumer_complaint_narrative) %>%
anti_join(get_stopwords(), by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(complaint_id, stem) %>%
cast_dfm(complaint_id, stem, n)
## Document-feature matrix of: 10,000 documents, 13,364 features (99.58% sparse) and 0 docvars.
## features
## docs 18 attornei better bureau contact creat credit desper famili fcra
## 3113833 1 1 1 1 1 1 3 1 1 1
## 3113839 0 0 0 0 0 0 0 0 0 0
## 3113848 0 0 0 0 0 0 1 0 0 4
## 3113851 0 0 0 0 0 0 0 0 0 0
## 3113859 0 0 0 0 0 0 0 0 1 0
## 3113897 0 0 0 0 0 0 1 0 0 0
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 13,354 more features ]
This is a sparse matrix (where most elements are zero). This is because most documents do not contain most words.
We could also represent text data with weighted counts. The term frequency of a word is how frequently a word occurs in a document, and the inverse document frequency of a word decreases the weight for commonly-used words and increases the weight for words that are not used often in a collection of documents.
\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\] These two quantities can be combined to calculate a term’s tf-idf (the two quantities multiplied together). This statistic measures the frequency of a term adjusted for how rarely it is used, and it is an example of a weighting scheme that can often work better than counts for predictive modeling with text features.
complaints %>%
slice_sample(n=10000) %>%
unnest_tokens(word, consumer_complaint_narrative) %>%
anti_join(get_stopwords(), by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(complaint_id, stem) %>%
bind_tf_idf(stem, complaint_id, n) %>%
cast_dfm(complaint_id, stem, tf_idf)
## Document-feature matrix of: 10,000 documents, 13,440 features (99.58% sparse) and 0 docvars.
## features
## docs 4 account bureau consecut credit id
## 3113826 0.1106372 0.050527763 0.16290465 0.2471296 0.041442195 0.1384554
## 3113833 0 0 0.04356752 0 0.033250133 0
## 3113842 0 0.064563253 0 0 0.052953916 0
## 3113850 0 0.008421294 0 0 0.003453516 0
## 3113884 0 0.005533993 0.03568388 0 0.009077814 0
## 3113930 0 0.022348818 0 0 0.018330202 0
## features
## docs lender made payment post
## 3113826 0.1545369 0.070965 0.1119439 0.1425259
## 3113833 0 0 0 0
## 3113842 0 0 0 0
## 3113850 0 0 0 0
## 3113884 0 0 0 0
## 3113930 0 0 0 0
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 13,430 more features ]
Creating these matrices is very memory intensive!
While you can create your own embeddings, pre-trained word embeddings such as GloVe which is training on Wikipedia and news sources are readily available.
library(textdata)
glove6b <- embedding_glove6b(
dir = "C:/Users/erhla/Box/pa470spring2024/static/lectures/",
manual_download = TRUE
)
glove6b
tidy_glove <- glove6b %>%
pivot_longer(contains("d"),
names_to = "dimension") %>%
rename(item1 = token)
rm(glove6b)
tidy_glove
nearest_neighbors <- function(df, token) {
df %>%
widely(
~ {
y <- .[rep(token, nrow(.)), ]
res <- rowSums(. * y) /
(sqrt(rowSums(. ^ 2)) * sqrt(sum(.[token, ] ^ 2)))
matrix(res, ncol = 1, dimnames = list(x = names(res)))
},
sort = TRUE,
maximum_size = NULL
)(item1, dimension, value) %>%
select(-item2)
}
Let’s look at words which are ‘synonyms’ or nearby in the embedding space…
tidy_glove %>%
nearest_neighbors("error")
tidy_glove %>%
nearest_neighbors("school")
tidy_glove %>%
nearest_neighbors("fee")
tidy_glove %>%
nearest_neighbors("chocolate")
Do word embeddings create systemic unfairness or bias? Yes! Here’s some examples from GloVe:
Typically Black first names are associated with more unpleasant feelings than typically white first names.
Women’s first names are more associated with family and men’s first names are more associated with career.
Terms associated with women are more associated with the arts and terms associated with men are more associated with science.
Sentiment analysis attempts to assign scores or moods to text. Recall Figure 13.3 from the textbook.
From Chapter 13
In this DIY, we turn to a Wikipedia entry about the 1979 Oil Crisis that had a substantial effect on the Western World’s economy. The article on the Oil Crisis presents both positive and negative effects – a perfect example of how sentiment analysis can summarize a body of text. The first paragraph describes the magnitude of the crisis:
The 1970s energy crisis occurred when the Western world, particularly the United States, Canada, Western Europe, Australia, and New Zealand, faced substantial petroleum shortages, real and perceived, as well as elevated prices. The two worst crises of this period were the 1973 oil crisis and the 1979 energy crisis, when the Yom Kippur War and the Iranian Revolution triggered interruptions in Middle Eastern oil exports.
Whereas the sixth paragraph softens the implications, turning to instances where the crisis had a far less adverse impact:
The period was not uniformly negative for all economies. Petroleum-rich countries in the Middle East benefited from increased prices and the slowing production in other areas of the world. Some other countries, such as Norway, Mexico, and Venezuela, benefited as well. In the United States, Texas and Alaska, as well as some other oil-producing areas, experienced major economic booms due to soaring oil prices even as most of the rest of the nation struggled with the stagnant economy. Many of these economic gains, however, came to a halt as prices stabilized and dropped in the 1980s.
Our objective with this DIY is to show how sentiment evolves over the
first six paragraphs as scored by the Bing Lexicon implemented in the
tidytext
package. To start, we read the raw text from the
wiki-article.txt
file.
#Read text article
wiki_article <- readLines("https://raw.githubusercontent.com/DataScienceForPublicPolicy/diys/main/data/wiki-article.txt")
To make use of the text, we tokenize the six paragraphs of text,
remove stop words, then apply a left join with the Bing lexicon. The
resulting table contains both matching and non-matching terms and
aggregates term frequencies by paragraph para
,
word
, and sentiment
. In total, \(n = 18\) words are labeled negative and
\(n=11\) words are positive.
#Set up in data frame
wiki_df <- data.frame(para = 1:length(wiki_article),
text = wiki_article,
stringsAsFactors = FALSE)
#Join tokens to lexicon
sent_df <- wiki_df %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords(language = "en")) %>%
left_join(get_sentiments("bing")) %>%
count(para, word, sentiment, sort = TRUE)
#Label terms without sentiment
sent_df$sentiment[is.na(sent_df$sentiment)] <- "none"
Next, we write a basic function to calculate various sentiment
metrics such as polarity and expressiveness. Packages such as
sentimentr
implement scoring, but as rules-based sentiment
analysis is fairly simple, we directly implement metrics in the formula
to illustrate their accessibility.
sentMetrics <- function(n, s){
#Set input metrics
N <- sum(n, na.rm = T)
x_p <- sum(n[s =="positive"], na.rm = T)
x_n <- sum(n[s =="negative"], na.rm = T)
#Calculate scores
return(data.frame(count.positive = x_p,
count.negative = x_n,
net.sentiment = (x_p - x_n) / N,
expressiveness = (x_p + x_n) / N,
positivity = x_p / N,
negativity = x_n / N,
polarity = (x_p - x_n) / (x_p + x_n))
)
}
The function is designed to work on one document at a time, requiring
a loop to score each paragraph in the article. We iterate over each
paragraph using lapply
, returning the results in a data
frame rated
.
#Apply sentiment scores to each paragraph
rated <- lapply(sort(unique(sent_df$para)), function(p){
para_df <- sent_df %>% filter(para == p)
output <- sentMetrics(n = para_df$n,
s = para_df$sentiment)
return(data.frame(para = p,
output))
})
#Bind into a data frame
rated <- do.call(rbind, rated)
Let’s examine the results by plotting sentiment by paragraph in Figure @ref(fig:sentoil). In the first line graph (plot (A)), we see that the first four paragraphs are net negative, then turn net positive in the last two paragraphs. Notice that both polarity and net sentiment tell the same story, but the magnitudes of their values are quite different. In fact, the Pearson’s correlation is \(\rho = 0.964\). When we dive deeper into the positivity and negativity, we see that the switch in tone is widespread – earlier paragraphs have mostly negative terms while the tone softens in later paragraphs that describe less dire conditions in the world economy.
#devtools::install_github("EmilHvitfeldt/scotus")
library(scotus)
scotus_filtered %>%
slice_sample(n=5) %>%
as_tibble()
scotus_filtered %>%
mutate(year = as.numeric(year),
year = 10 * (year %/% 10)) %>%
count(year) %>%
ggplot(aes(year, n)) +
geom_col() +
labs(x = "Year", y = "Number of opinions per decade")
scotus_split <- scotus_filtered %>%
mutate(year = as.numeric(year),
text = str_remove_all(text, "'")) %>%
slice_sample(n=100) %>%
initial_split()
scotus_train <- training(scotus_split)
scotus_test <- testing(scotus_split)
First, we must specify in our initial recipe()
statement the form of our model (with the formula
year ~ text
, meaning we will predict the year of each
opinion from the text of that opinion) and what our training data
is.
Then, we tokenize the text of the court opinions.
Next, we filter to only keep the top 1000 tokens by term frequency. We filter out those less frequent words because we expect them to be too rare to be reliable, at least for our first attempt. (We are not removing stop words yet; we’ll explore removing them in Section.)
The recipe step step_tfidf()
, used with defaults
here, weights each token frequency by the inverse document
frequency.
As a last step, we normalize (center and scale) these tf-idf values. This centering and scaling is needed because we’re going to use a support vector machine model.
final_rec <- recipe(year ~ text, data = scotus_train) %>%
step_tokenize(text) %>%
step_tokenfilter(text, max_tokens = 1e3) %>%
step_tfidf(text) %>%
step_normalize(all_predictors())
svm_spec <- svm_linear() %>%
set_mode("regression") %>%
set_engine("LiblineaR")
final_workflow <- workflow() %>%
add_recipe(final_rec) %>%
add_model(svm_spec)
scotus_folds <- vfold_cv(scotus_train, v = 5)
final_rs <- fit_resamples(
final_workflow,
scotus_folds,
metrics = metric_set(rmse, mae, mape),
control = control_resamples(save_pred = TRUE, verbose=TRUE)
)
## i Fold1: preprocessor 1/1
## ✓ Fold1: preprocessor 1/1
## i Fold1: preprocessor 1/1, model 1/1
## ✓ Fold1: preprocessor 1/1, model 1/1
## i Fold1: preprocessor 1/1, model 1/1 (extracts)
## i Fold1: preprocessor 1/1, model 1/1 (predictions)
## i Fold2: preprocessor 1/1
## ✓ Fold2: preprocessor 1/1
## i Fold2: preprocessor 1/1, model 1/1
## ✓ Fold2: preprocessor 1/1, model 1/1
## i Fold2: preprocessor 1/1, model 1/1 (extracts)
## i Fold2: preprocessor 1/1, model 1/1 (predictions)
## i Fold3: preprocessor 1/1
## ✓ Fold3: preprocessor 1/1
## i Fold3: preprocessor 1/1, model 1/1
## ✓ Fold3: preprocessor 1/1, model 1/1
## i Fold3: preprocessor 1/1, model 1/1 (extracts)
## i Fold3: preprocessor 1/1, model 1/1 (predictions)
## i Fold4: preprocessor 1/1
## ✓ Fold4: preprocessor 1/1
## i Fold4: preprocessor 1/1, model 1/1
## ✓ Fold4: preprocessor 1/1, model 1/1
## i Fold4: preprocessor 1/1, model 1/1 (extracts)
## i Fold4: preprocessor 1/1, model 1/1 (predictions)
## i Fold5: preprocessor 1/1
## ✓ Fold5: preprocessor 1/1
## i Fold5: preprocessor 1/1, model 1/1
## ✓ Fold5: preprocessor 1/1, model 1/1
## i Fold5: preprocessor 1/1, model 1/1 (extracts)
## i Fold5: preprocessor 1/1, model 1/1 (predictions)
final_rs %>%
collect_metrics() %>%
ggplot(aes( x=.metric, y=mean)) +
geom_bar(stat='identity')
final_fitted <- last_fit(final_workflow, scotus_split)
collect_metrics(final_fitted)
scotus_fit <- extract_fit_parsnip(final_fitted$.workflow[[1]])
scotus_fit %>%
tidy() %>%
filter(term != "Bias") %>%
mutate(
sign = case_when(estimate > 0 ~ "Later (after mean year)",
TRUE ~ "Earlier (before mean year)"),
estimate = abs(estimate),
term = str_remove_all(term, "tfidf_text_")
) %>%
group_by(sign) %>%
top_n(20, estimate) %>%
ungroup() %>%
ggplot(aes(x = estimate,
y = fct_reorder(term, estimate),
fill = sign)) +
geom_col(show.legend = FALSE) +
scale_x_continuous(expand = c(0, 0)) +
facet_wrap(~sign, scales = "free") +
labs(
y = NULL,
title = paste("Variable importance for predicting year of",
"Supreme Court opinions"),
subtitle = paste("These features are the most importance",
"in predicting the year of an opinion")
)
final_fitted %>%
collect_predictions() %>%
ggplot(aes(year, .pred)) +
geom_abline(lty = 2, color = "gray80", size = 1.5) +
geom_point(alpha = 0.3) +
labs(
x = "Truth",
y = "Predicted year",
title = paste("Predicted and true years for the testing set of",
"Supreme Court opinions"),
subtitle = "For the testing set, predictions are more reliable after 1850"
)
Dimensionality reduction transforms a data set from a high-dimensional space into a low-dimensional space, and can be a good choice when you suspect there are “too many” variables. An excess of variables, usually predictors, can be a problem because it is difficult to understand or visualize data in higher dimensions.
“It tries to find components that simultaneously maximize the variation in the predictors while also maximizing the relationship between those components and the outcome.”
How do we avoid inappropriate predictions?
Equivocal zones use the predicted values to alert the user that the results may be suspect.
Applicability uses the predictors to measure the amount of extrapolation (if any) for new samples.
library(tidymodels)
tidymodels_prefer()
simulate_two_classes <-
function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) {
# Slightly correlated predictors
sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2)
dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma)
colnames(dat) <- c("x", "y")
cls <- paste0("class_", 1:2)
dat <-
as_tibble(dat) %>%
mutate(
linear_pred = !!eqn,
# Add some misclassification noise
linear_pred = linear_pred + rnorm(n, sd = error),
prob = binomial()$linkinv(linear_pred),
class = ifelse(prob > runif(n), cls[1], cls[2]),
class = factor(class, levels = cls)
)
dplyr::select(dat, x, y, class)
}
set.seed(1901)
training_set <- simulate_two_classes(200)
testing_set <- simulate_two_classes(50)
We estimate a logistic regression model using Bayesian methods (using the default Gaussian prior distributions for the parameters):
two_class_mod <-
logistic_reg() %>%
set_engine("stan", seed = 1902) %>%
fit(class ~ . + I(x^2)+ I(y^2), data = training_set)
print(two_class_mod, digits = 3)
## parsnip model object
##
## stan_glm
## family: binomial [logit]
## formula: class ~ . + I(x^2) + I(y^2)
## observations: 200
## predictors: 5
## ------
## Median MAD_SD
## (Intercept) 1.092 0.287
## x 2.290 0.423
## y 0.314 0.354
## I(x^2) 0.077 0.307
## I(y^2) -2.465 0.424
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
The fitted class boundary is overlaid onto the test set. The data points closest to the class boundary are the most uncertain. If their values changed slightly, their predicted class might change. One simple method for disqualifying some results is to call them “equivocal” if the values are within some range around 50% (or whatever the appropriate probability cutoff might be for a certain situation). Depending on the problem that the model is being applied to, this might indicate that another measurement should be collected or that we require more information before a trustworthy prediction is possible.
We could base the width of the band around the cutoff on how performance improves when the uncertain results are removed. However, we should also estimate the reportable rate (the expected proportion of usable results). For example, it would not be useful in real-world situations to have perfect performance but only release predictions on 2% of the samples passed to the model.
Let’s use the test set to determine the balance between improving performance and having enough reportable results. The predictions are created using:
test_pred <- augment(two_class_mod, testing_set)
test_pred %>% head()
With tidymodels, the probably
package contains functions
for equivocal zones. For cases with two classes, the
make_two_class_pred()
function creates a factor-like column
that has the predicted classes with an equivocal zone:
library(probably)
lvls <- levels(training_set$class)
test_pred <-
test_pred %>%
mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = 0.15))
test_pred %>% count(.pred_with_eqz)
Rows that are within \(0.50\pm0.15\)
are given a value of [EQ]
.
It is important to realize that [EQ]
in this example is
not a factor level, but an attribute of that column.
Since the factor levels are the same as the original data, confusion
matrices and other statistics can be computed without error. When using
standard functions from the yardstick
package, the
equivocal results are converted to NA
and are not used in
the calculations that use the hard class predictions. Notice the
differences in these confusion matrices:
# All data
test_pred %>% conf_mat(class, .pred_class)
## Truth
## Prediction class_1 class_2
## class_1 20 6
## class_2 5 19
# Reportable results only:
test_pred %>% conf_mat(class, .pred_with_eqz)
## Truth
## Prediction class_1 class_2
## class_1 17 3
## class_2 5 16
There is also an is_equivocal()
function available for
filtering these rows from the data.
Does the equivocal zone help improve accuracy? Let’s look over different buffer sizes:
# A function to change the buffer then compute performance.
eq_zone_results <- function(buffer) {
test_pred <-
test_pred %>%
mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = buffer))
acc <- test_pred %>% accuracy(class, .pred_with_eqz)
rep_rate <- reportable_rate(test_pred$.pred_with_eqz)
tibble(accuracy = acc$.estimate, reportable = rep_rate, buffer = buffer)
}
# Evaluate a sequence of buffers and plot the results.
map_dfr(seq(0, .1, length.out = 40), eq_zone_results) %>%
pivot_longer(c(-buffer), names_to = "statistic", values_to = "value") %>%
ggplot(aes(x = buffer, y = value, lty = statistic)) +
geom_step(size = 1.2, alpha = 0.8) +
labs(y = NULL, lty = NULL)
The figure shows us that accuracy improves by a few percentage points but at the cost of nearly 10% of predictions being unusable! The value of such a compromise depends on how the model predictions will be used.
This analysis focused on using the predicted class probability to disqualify points, since this is a fundamental measure of uncertainty in classification models. A slightly better approach would be to use the standard error of the class probability. Since we used a Bayesian model, the probability estimates we found are actually the mean of the posterior predictive distribution. In other words, the Bayesian model gives us a distribution for the class probability. Measuring the standard deviation of this distribution gives us a standard error of prediction of the probability. In most cases, this value is directly related to the mean class probability. You might recall that, for a Bernoulli random variable with probability \(p\), the variance is \(p(1-p)\). Because of this relationship, the standard error is largest when the probability is 50%. Instead of assigning an equivocal result using the class probability, we could instead use a cutoff on the standard error of prediction.
One important aspect of the standard error of prediction is that it
takes into account more than just the class probability. In cases where
there is significant extrapolation or aberrant predictor values, the
standard error might increase. The benefit of using the standard error
of prediction is that it might also flag predictions that are
problematic (as opposed to simply uncertain). One reason that we used
the Bayesian model is that it naturally estimates the standard error of
prediction; not many models can calculate this. For our test set, using
type = "pred_int"
will produce upper and lower limits and
the std_error
adds a column for that quantity. For 80%
intervals:
test_pred <-
test_pred %>%
bind_cols(
predict(two_class_mod, testing_set, type = "pred_int", std_error = TRUE)
)
For our example where the model and data are well-behaved, Figure @ref(fig:std-errors) shows the standard error of prediction across the space:
Using the standard error as a measure to preclude samples from being predicted can also be applied to models with numeric outcomes.