1 Natural Language Processing (NLP) / Supervised Modeling for Text

This lecture is heavily based on Supervised Machine Learning for Text Analysis in R.

Similar to our previous work we can have both regression and classification text based problems. A regression model would predict a numeric/continuous output ‘such as predicting the year of a United States Supreme Court opinion from the text of that opinion.’ A classification model would predict a discrete class ‘such as predicting whether a GitHub issue is about documentation or not from the text of the issue.’

Natural language needs to be standardized and transformed to numeric representations for modeling. We will use the textrecipes package to do this.

These are different types of prediction problems, but in both, we can use the tools of supervised machine learning to connect our input, which may exist entirely or partly as text data, with our outcome of interest. Most supervised models for text data are built with one of three purposes in mind:

  • The main goal of a predictive model is to generate the most accurate predictions possible.

  • An inferential model is created to test a hypothesis or draw conclusions about a population.

  • The main purpose of a descriptive model is to describe the properties of the observed data.

2 Preprocessing

What language is and how language works is key to creating modeling features from natural language. Words in English are made of prefixes, suffixes, and root words. Defining a word can be quite difficult with compound words (like real estate or dining room). Preprocessing natural language has three primary steps: tokenization, removal of stop words, and even stemming.

2.1 Tokenization

Tokenization can broadly be thought of as taking an input (such as a string) and a token type (such as a word) and splitting the input into pieces/tokens. This process is generally much more complex than you might think (e.g. more than splitting on non-alphanumeric characters) and spaCy through the (tokenizers package) implements a fast tool set.

Tokens have a variety of units including characters, words, sentences, lines, paragraphs, and n-grams.

sample_vector <- c("Far down in the forest",
                   "grew a pretty little fir-tree")
sample_tibble <- tibble(text = sample_vector)

tokenize_words(sample_vector)
## [[1]]
## [1] "far"    "down"   "in"     "the"    "forest"
## 
## [[2]]
## [1] "grew"   "a"      "pretty" "little" "fir"    "tree"
sample_tibble %>%
  unnest_tokens(word, text, token = "words")
sample_tibble %>%
  unnest_tokens(word, text, token = "words", strip_punct = FALSE)
pride <- tibble(line=janeaustenr::prideprejudice)

pride %>%
  unnest_tokens(word, line) %>%
  count(word) %>%
  arrange(desc(n))

2.1.1 n-grams

n-gram is a contiguous sequence of n items from a given sequence of text. Examples:

  • unigram: “Hello,” “day,” “my,” “little”
  • bigram: “fir tree,” “fresh air,” “to be,” “Robin Hood”
  • trigram: “You and I,” “please let go,” “no time like,” “the little mermaid”
token_ngram <- tokenize_ngrams(x = pride %>% pull(line),
                                   lowercase = TRUE,
                                   n = 3L,
                                   n_min = 3L,
                                   stopwords = character(),
                                   ngram_delim = " ",
                                   simplify = FALSE)

token_ngram[[100]]
##  [1] "are my old"              "my old friends"         
##  [3] "old friends i"           "friends i have"         
##  [5] "i have heard"            "have heard you"         
##  [7] "heard you mention"       "you mention them"       
##  [9] "mention them with"       "them with consideration"

2.1.2 Chinese

library(jiebaR)
## Loading required package: jiebaRD
words <- c("下面是不分行输出的结果", "下面是不输出的结果")

engine1 <- worker(bylines = TRUE)

segment(words, engine1)
## [[1]]
## [1] "下面" "是"   "不"   "分行" "输出" "的"   "结果"
## 
## [[2]]
## [1] "下面" "是"   "不"   "输出" "的"   "结果"

2.2 Stop Words

Some words carry less information than others. For example, a, the, or of. These common words are called stop words and are generally removed entirely. Let’s use the stopwords package here to provide some lists.

pride_words <- 
  pride %>%
  unnest_tokens(word, line)

pride_words %>%
  semi_join(get_stopwords(source = "snowball")) %>%
  distinct() # present stop words
## Joining with `by = join_by(word)`
pride_words %>%
  anti_join(get_stopwords(source = "snowball")) %>%
  distinct() # unique non stop words
## Joining with `by = join_by(word)`

2.3 Stemming

What if we aren’t interested in the difference between banana and bananas? The core sentiment of a word is often the same (e.g. ‘banana’).

pride_words %>%
  anti_join(get_stopwords(source = "snowball")) %>%
  mutate(word_stem = wordStem(word))
## Joining with `by = join_by(word)`
pride_words %>%
  anti_join(get_stopwords(source = "snowball")) %>%
  mutate(word_stem = wordStem(word)) %>%
  summarize(nword = n_distinct(word),
            nstem = n_distinct(word_stem))
## Joining with `by = join_by(word)`

Stemming reduces the feature space of text data but may change the underlying meaning of some sentences. It may or may not improve models.

3 Word Embeddings

A data structure for text data.

complaints <- read_csv("https://github.com/EmilHvitfeldt/smltar/raw/master/data/complaints.csv.gz")
## Rows: 117214 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (15): product, sub_product, issue, sub_issue, consumer_complaint_narrat...
## dbl   (1): complaint_id
## date  (2): date_received, date_sent_to_company
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
complaints %>%
  slice_sample(n=10000) %>%
  unnest_tokens(word, consumer_complaint_narrative) %>%
  anti_join(get_stopwords(), by = "word") %>%
  mutate(stem = wordStem(word)) %>%
  count(complaint_id, stem) %>%
  cast_dfm(complaint_id, stem, n)
## Document-feature matrix of: 10,000 documents, 13,364 features (99.58% sparse) and 0 docvars.
##          features
## docs      18 attornei better bureau contact creat credit desper famili fcra
##   3113833  1        1      1      1       1     1      3      1      1    1
##   3113839  0        0      0      0       0     0      0      0      0    0
##   3113848  0        0      0      0       0     0      1      0      0    4
##   3113851  0        0      0      0       0     0      0      0      0    0
##   3113859  0        0      0      0       0     0      0      0      1    0
##   3113897  0        0      0      0       0     0      1      0      0    0
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 13,354 more features ]

This is a sparse matrix (where most elements are zero). This is because most documents do not contain most words.

We could also represent text data with weighted counts. The term frequency of a word is how frequently a word occurs in a document, and the inverse document frequency of a word decreases the weight for commonly-used words and increases the weight for words that are not used often in a collection of documents.

\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\] These two quantities can be combined to calculate a term’s tf-idf (the two quantities multiplied together). This statistic measures the frequency of a term adjusted for how rarely it is used, and it is an example of a weighting scheme that can often work better than counts for predictive modeling with text features.

complaints %>%
  slice_sample(n=10000) %>%
  unnest_tokens(word, consumer_complaint_narrative) %>%
  anti_join(get_stopwords(), by = "word") %>%
  mutate(stem = wordStem(word)) %>%
  count(complaint_id, stem) %>%
  bind_tf_idf(stem, complaint_id, n) %>%
  cast_dfm(complaint_id, stem, tf_idf)
## Document-feature matrix of: 10,000 documents, 13,440 features (99.58% sparse) and 0 docvars.
##          features
## docs              4     account     bureau  consecut      credit        id
##   3113826 0.1106372 0.050527763 0.16290465 0.2471296 0.041442195 0.1384554
##   3113833 0         0           0.04356752 0         0.033250133 0        
##   3113842 0         0.064563253 0          0         0.052953916 0        
##   3113850 0         0.008421294 0          0         0.003453516 0        
##   3113884 0         0.005533993 0.03568388 0         0.009077814 0        
##   3113930 0         0.022348818 0          0         0.018330202 0        
##          features
## docs         lender     made   payment      post
##   3113826 0.1545369 0.070965 0.1119439 0.1425259
##   3113833 0         0        0         0        
##   3113842 0         0        0         0        
##   3113850 0         0        0         0        
##   3113884 0         0        0         0        
##   3113930 0         0        0         0        
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 13,430 more features ]

Creating these matrices is very memory intensive!

While you can create your own embeddings, pre-trained word embeddings such as GloVe which is training on Wikipedia and news sources are readily available.

library(textdata)

glove6b <- embedding_glove6b(
  dir = "C:/Users/erhla/Box/pa470spring2024/static/lectures/",
  manual_download = TRUE
)

glove6b
tidy_glove <- glove6b %>%
  pivot_longer(contains("d"),
               names_to = "dimension") %>%
  rename(item1 = token)

rm(glove6b)

tidy_glove
nearest_neighbors <- function(df, token) {
  df %>%
    widely(
      ~ {
        y <- .[rep(token, nrow(.)), ]
        res <- rowSums(. * y) / 
          (sqrt(rowSums(. ^ 2)) * sqrt(sum(.[token, ] ^ 2)))
        matrix(res, ncol = 1, dimnames = list(x = names(res)))
      },
      sort = TRUE,
      maximum_size = NULL
    )(item1, dimension, value) %>%
    select(-item2)
}

Let’s look at words which are ‘synonyms’ or nearby in the embedding space…

tidy_glove %>%
  nearest_neighbors("error")
tidy_glove %>%
  nearest_neighbors("school")
tidy_glove %>%
  nearest_neighbors("fee")
tidy_glove %>%
  nearest_neighbors("chocolate")

3.1 Fairness/ethics

Do word embeddings create systemic unfairness or bias? Yes! Here’s some examples from GloVe:

  • Typically Black first names are associated with more unpleasant feelings than typically white first names.

  • Women’s first names are more associated with family and men’s first names are more associated with career.

  • Terms associated with women are more associated with the arts and terms associated with men are more associated with science.

4 Sentiment

Sentiment analysis attempts to assign scores or moods to text. Recall Figure 13.3 from the textbook.

4.1 13.2.3 DIY

From Chapter 13

In this DIY, we turn to a Wikipedia entry about the 1979 Oil Crisis that had a substantial effect on the Western World’s economy. The article on the Oil Crisis presents both positive and negative effects – a perfect example of how sentiment analysis can summarize a body of text. The first paragraph describes the magnitude of the crisis:

The 1970s energy crisis occurred when the Western world, particularly the United States, Canada, Western Europe, Australia, and New Zealand, faced substantial petroleum shortages, real and perceived, as well as elevated prices. The two worst crises of this period were the 1973 oil crisis and the 1979 energy crisis, when the Yom Kippur War and the Iranian Revolution triggered interruptions in Middle Eastern oil exports.

Whereas the sixth paragraph softens the implications, turning to instances where the crisis had a far less adverse impact:

The period was not uniformly negative for all economies. Petroleum-rich countries in the Middle East benefited from increased prices and the slowing production in other areas of the world. Some other countries, such as Norway, Mexico, and Venezuela, benefited as well. In the United States, Texas and Alaska, as well as some other oil-producing areas, experienced major economic booms due to soaring oil prices even as most of the rest of the nation struggled with the stagnant economy. Many of these economic gains, however, came to a halt as prices stabilized and dropped in the 1980s.

Our objective with this DIY is to show how sentiment evolves over the first six paragraphs as scored by the Bing Lexicon implemented in the tidytext package. To start, we read the raw text from the wiki-article.txt file.

#Read text article
  wiki_article <- readLines("https://raw.githubusercontent.com/DataScienceForPublicPolicy/diys/main/data/wiki-article.txt")

To make use of the text, we tokenize the six paragraphs of text, remove stop words, then apply a left join with the Bing lexicon. The resulting table contains both matching and non-matching terms and aggregates term frequencies by paragraph para, word, and sentiment. In total, \(n = 18\) words are labeled negative and \(n=11\) words are positive.

#Set up in data frame
  wiki_df <- data.frame(para = 1:length(wiki_article),
                        text = wiki_article, 
                        stringsAsFactors = FALSE)

#Join tokens to lexicon
  sent_df <- wiki_df %>%
    unnest_tokens(word, text) %>% 
    anti_join(get_stopwords(language = "en")) %>%
    left_join(get_sentiments("bing")) %>%
    count(para, word, sentiment, sort = TRUE)
  
#Label terms without sentiment
  sent_df$sentiment[is.na(sent_df$sentiment)] <- "none"

Next, we write a basic function to calculate various sentiment metrics such as polarity and expressiveness. Packages such as sentimentr implement scoring, but as rules-based sentiment analysis is fairly simple, we directly implement metrics in the formula to illustrate their accessibility.

  sentMetrics <- function(n, s){

    #Set input metrics
      N <- sum(n, na.rm = T)
      x_p <- sum(n[s =="positive"], na.rm = T)
      x_n <- sum(n[s =="negative"], na.rm = T)
      
    #Calculate scores
      return(data.frame(count.positive = x_p,
                        count.negative = x_n,
                        net.sentiment = (x_p - x_n) / N,
                        expressiveness = (x_p + x_n) / N,
                        positivity = x_p / N,
                        negativity = x_n / N,
                        polarity = (x_p - x_n) / (x_p + x_n))
      )
  
  }

The function is designed to work on one document at a time, requiring a loop to score each paragraph in the article. We iterate over each paragraph using lapply, returning the results in a data frame rated.

#Apply sentiment scores to each paragraph
  rated <- lapply(sort(unique(sent_df$para)), function(p){
                    para_df <- sent_df %>% filter(para == p)
                    output <- sentMetrics(n = para_df$n, 
                                          s = para_df$sentiment)
                    return(data.frame(para = p, 
                                      output))
                  })

#Bind into a data frame
  rated <- do.call(rbind, rated)

Let’s examine the results by plotting sentiment by paragraph in Figure @ref(fig:sentoil). In the first line graph (plot (A)), we see that the first four paragraphs are net negative, then turn net positive in the last two paragraphs. Notice that both polarity and net sentiment tell the same story, but the magnitudes of their values are quite different. In fact, the Pearson’s correlation is \(\rho = 0.964\). When we dive deeper into the positivity and negativity, we see that the switch in tone is widespread – earlier paragraphs have mostly negative terms while the tone softens in later paragraphs that describe less dire conditions in the world economy.

Sentiment scores for the first six paragraphs of the Wikipedia article on 1979 Oil Crisis. Graph (A) illustrates net sentiment and polarity. Graph (B) plots positivity and negativity.

Sentiment scores for the first six paragraphs of the Wikipedia article on 1979 Oil Crisis. Graph (A) illustrates net sentiment and polarity. Graph (B) plots positivity and negativity.

5 Regression Example

#devtools::install_github("EmilHvitfeldt/scotus")
library(scotus)

scotus_filtered %>%
  slice_sample(n=5) %>%
  as_tibble()
scotus_filtered %>%
  mutate(year = as.numeric(year),
         year = 10 * (year %/% 10)) %>%
  count(year) %>%
  ggplot(aes(year, n)) +
  geom_col() +
  labs(x = "Year", y = "Number of opinions per decade")

scotus_split <- scotus_filtered %>%
  mutate(year = as.numeric(year),
         text = str_remove_all(text, "'")) %>%
  slice_sample(n=100) %>%
  initial_split()

scotus_train <- training(scotus_split)
scotus_test <- testing(scotus_split)
  • First, we must specify in our initial recipe() statement the form of our model (with the formula year ~ text, meaning we will predict the year of each opinion from the text of that opinion) and what our training data is.

  • Then, we tokenize the text of the court opinions.

  • Next, we filter to only keep the top 1000 tokens by term frequency. We filter out those less frequent words because we expect them to be too rare to be reliable, at least for our first attempt. (We are not removing stop words yet; we’ll explore removing them in Section.)

  • The recipe step step_tfidf(), used with defaults here, weights each token frequency by the inverse document frequency.

  • As a last step, we normalize (center and scale) these tf-idf values. This centering and scaling is needed because we’re going to use a support vector machine model.

final_rec <- recipe(year ~ text, data = scotus_train) %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 1e3) %>%
  step_tfidf(text) %>%
  step_normalize(all_predictors())
svm_spec <- svm_linear() %>%
  set_mode("regression") %>%
  set_engine("LiblineaR")

final_workflow <- workflow() %>%
  add_recipe(final_rec) %>%
  add_model(svm_spec)

scotus_folds <- vfold_cv(scotus_train, v = 5)

final_rs <- fit_resamples(
  final_workflow,
  scotus_folds,
  metrics = metric_set(rmse, mae, mape),
  control = control_resamples(save_pred = TRUE, verbose=TRUE)
)
## i Fold1: preprocessor 1/1
## ✓ Fold1: preprocessor 1/1
## i Fold1: preprocessor 1/1, model 1/1
## ✓ Fold1: preprocessor 1/1, model 1/1
## i Fold1: preprocessor 1/1, model 1/1 (extracts)
## i Fold1: preprocessor 1/1, model 1/1 (predictions)
## i Fold2: preprocessor 1/1
## ✓ Fold2: preprocessor 1/1
## i Fold2: preprocessor 1/1, model 1/1
## ✓ Fold2: preprocessor 1/1, model 1/1
## i Fold2: preprocessor 1/1, model 1/1 (extracts)
## i Fold2: preprocessor 1/1, model 1/1 (predictions)
## i Fold3: preprocessor 1/1
## ✓ Fold3: preprocessor 1/1
## i Fold3: preprocessor 1/1, model 1/1
## ✓ Fold3: preprocessor 1/1, model 1/1
## i Fold3: preprocessor 1/1, model 1/1 (extracts)
## i Fold3: preprocessor 1/1, model 1/1 (predictions)
## i Fold4: preprocessor 1/1
## ✓ Fold4: preprocessor 1/1
## i Fold4: preprocessor 1/1, model 1/1
## ✓ Fold4: preprocessor 1/1, model 1/1
## i Fold4: preprocessor 1/1, model 1/1 (extracts)
## i Fold4: preprocessor 1/1, model 1/1 (predictions)
## i Fold5: preprocessor 1/1
## ✓ Fold5: preprocessor 1/1
## i Fold5: preprocessor 1/1, model 1/1
## ✓ Fold5: preprocessor 1/1, model 1/1
## i Fold5: preprocessor 1/1, model 1/1 (extracts)
## i Fold5: preprocessor 1/1, model 1/1 (predictions)
final_rs %>%
  collect_metrics() %>%
  ggplot(aes( x=.metric, y=mean)) +
  geom_bar(stat='identity') 

final_fitted <- last_fit(final_workflow, scotus_split)

collect_metrics(final_fitted)
scotus_fit <- extract_fit_parsnip(final_fitted$.workflow[[1]])

scotus_fit %>%
  tidy() %>%
  filter(term != "Bias") %>%
  mutate(
    sign = case_when(estimate > 0 ~ "Later (after mean year)",
                     TRUE ~ "Earlier (before mean year)"),
    estimate = abs(estimate),
    term = str_remove_all(term, "tfidf_text_")
  ) %>%
  group_by(sign) %>%
  top_n(20, estimate) %>%
  ungroup() %>%
  ggplot(aes(x = estimate,
             y = fct_reorder(term, estimate),
             fill = sign)) +
  geom_col(show.legend = FALSE) +
  scale_x_continuous(expand = c(0, 0)) +
  facet_wrap(~sign, scales = "free") +
  labs(
    y = NULL,
    title = paste("Variable importance for predicting year of",
                  "Supreme Court opinions"),
    subtitle = paste("These features are the most importance",
                     "in predicting the year of an opinion")
  )

final_fitted %>%
  collect_predictions() %>%
  ggplot(aes(year, .pred)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Truth",
    y = "Predicted year",
    title = paste("Predicted and true years for the testing set of",
                  "Supreme Court opinions"),
    subtitle = "For the testing set, predictions are more reliable after 1850"
  )

6 Ch 16 Dimensionality Reduction

Dimensionality reduction transforms a data set from a high-dimensional space into a low-dimensional space, and can be a good choice when you suspect there are “too many” variables. An excess of variables, usually predictors, can be a problem because it is difficult to understand or visualize data in higher dimensions.

  • Principal component analysis (PCA) unsupervised and uncorrelated

  • Partial least squares (PCA) supervised

“It tries to find components that simultaneously maximize the variation in the predictors while also maximizing the relationship between those components and the outcome.”

  • Independent Component Analysis (ICA) statistically independent

7 Ch 19 Trusting Predictions

How do we avoid inappropriate predictions?

  • Equivocal zones use the predicted values to alert the user that the results may be suspect.

  • Applicability uses the predictors to measure the amount of extrapolation (if any) for new samples.

library(tidymodels)
tidymodels_prefer()
simulate_two_classes <- 
  function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2))  {
    # Slightly correlated predictors
    sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2)
    dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma)
    colnames(dat) <- c("x", "y")
    cls <- paste0("class_", 1:2)
    dat <- 
      as_tibble(dat) %>% 
      mutate(
        linear_pred = !!eqn,
        # Add some misclassification noise
        linear_pred = linear_pred + rnorm(n, sd = error),
        prob = binomial()$linkinv(linear_pred),
        class = ifelse(prob > runif(n), cls[1], cls[2]),
        class = factor(class, levels = cls)
      )
    dplyr::select(dat, x, y, class)
  }
set.seed(1901)
training_set <- simulate_two_classes(200)
testing_set  <- simulate_two_classes(50)

We estimate a logistic regression model using Bayesian methods (using the default Gaussian prior distributions for the parameters):

two_class_mod <- 
  logistic_reg() %>% 
  set_engine("stan", seed = 1902) %>% 
  fit(class ~ . + I(x^2)+ I(y^2), data = training_set)
print(two_class_mod, digits = 3)
## parsnip model object
## 
## stan_glm
##  family:       binomial [logit]
##  formula:      class ~ . + I(x^2) + I(y^2)
##  observations: 200
##  predictors:   5
## ------
##             Median MAD_SD
## (Intercept)  1.092  0.287
## x            2.290  0.423
## y            0.314  0.354
## I(x^2)       0.077  0.307
## I(y^2)      -2.465  0.424
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

The fitted class boundary is overlaid onto the test set. The data points closest to the class boundary are the most uncertain. If their values changed slightly, their predicted class might change. One simple method for disqualifying some results is to call them “equivocal” if the values are within some range around 50% (or whatever the appropriate probability cutoff might be for a certain situation). Depending on the problem that the model is being applied to, this might indicate that another measurement should be collected or that we require more information before a trustworthy prediction is possible.

Simulated two class data set with a logistic regression fit and decision boundary. The scatter plot of the two classes shows fairly correlated data. The decision boundary is a parabola in the x axis that does a good job of separating the classes.

Simulated two class data set with a logistic regression fit and decision boundary.

We could base the width of the band around the cutoff on how performance improves when the uncertain results are removed. However, we should also estimate the reportable rate (the expected proportion of usable results). For example, it would not be useful in real-world situations to have perfect performance but only release predictions on 2% of the samples passed to the model.

Let’s use the test set to determine the balance between improving performance and having enough reportable results. The predictions are created using:

test_pred <- augment(two_class_mod, testing_set)
test_pred %>% head()

With tidymodels, the probably package contains functions for equivocal zones. For cases with two classes, the make_two_class_pred() function creates a factor-like column that has the predicted classes with an equivocal zone:

library(probably)
lvls <- levels(training_set$class)
test_pred <- 
  test_pred %>% 
  mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = 0.15))
test_pred %>% count(.pred_with_eqz)

Rows that are within \(0.50\pm0.15\) are given a value of [EQ].

It is important to realize that [EQ] in this example is not a factor level, but an attribute of that column.

Since the factor levels are the same as the original data, confusion matrices and other statistics can be computed without error. When using standard functions from the yardstick package, the equivocal results are converted to NA and are not used in the calculations that use the hard class predictions. Notice the differences in these confusion matrices:

# All data
test_pred %>% conf_mat(class, .pred_class)
##           Truth
## Prediction class_1 class_2
##    class_1      20       6
##    class_2       5      19
# Reportable results only: 
test_pred %>% conf_mat(class, .pred_with_eqz)
##           Truth
## Prediction class_1 class_2
##    class_1      17       3
##    class_2       5      16

There is also an is_equivocal() function available for filtering these rows from the data.

Does the equivocal zone help improve accuracy? Let’s look over different buffer sizes:

# A function to change the buffer then compute performance.
eq_zone_results <- function(buffer) {
  test_pred <- 
    test_pred %>% 
    mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = buffer))
  acc <- test_pred %>% accuracy(class, .pred_with_eqz)
  rep_rate <- reportable_rate(test_pred$.pred_with_eqz)
  tibble(accuracy = acc$.estimate, reportable = rep_rate, buffer = buffer)
}
# Evaluate a sequence of buffers and plot the results. 
map_dfr(seq(0, .1, length.out = 40), eq_zone_results) %>% 
  pivot_longer(c(-buffer), names_to = "statistic", values_to = "value") %>% 
  ggplot(aes(x = buffer, y = value, lty = statistic)) + 
  geom_step(size = 1.2, alpha = 0.8) + 
  labs(y = NULL, lty = NULL)
The effect of equivocal zones on model performance. There is a slight increase in accuracy at the expense of a falling reportable rate.

The effect of equivocal zones on model performance.

The figure shows us that accuracy improves by a few percentage points but at the cost of nearly 10% of predictions being unusable! The value of such a compromise depends on how the model predictions will be used.

This analysis focused on using the predicted class probability to disqualify points, since this is a fundamental measure of uncertainty in classification models. A slightly better approach would be to use the standard error of the class probability. Since we used a Bayesian model, the probability estimates we found are actually the mean of the posterior predictive distribution. In other words, the Bayesian model gives us a distribution for the class probability. Measuring the standard deviation of this distribution gives us a standard error of prediction of the probability. In most cases, this value is directly related to the mean class probability. You might recall that, for a Bernoulli random variable with probability \(p\), the variance is \(p(1-p)\). Because of this relationship, the standard error is largest when the probability is 50%. Instead of assigning an equivocal result using the class probability, we could instead use a cutoff on the standard error of prediction.

One important aspect of the standard error of prediction is that it takes into account more than just the class probability. In cases where there is significant extrapolation or aberrant predictor values, the standard error might increase. The benefit of using the standard error of prediction is that it might also flag predictions that are problematic (as opposed to simply uncertain). One reason that we used the Bayesian model is that it naturally estimates the standard error of prediction; not many models can calculate this. For our test set, using type = "pred_int" will produce upper and lower limits and the std_error adds a column for that quantity. For 80% intervals:

test_pred <- 
  test_pred %>% 
  bind_cols(
    predict(two_class_mod, testing_set, type = "pred_int", std_error = TRUE)
  )

For our example where the model and data are well-behaved, Figure @ref(fig:std-errors) shows the standard error of prediction across the space:

The effect of the standard error of prediction overlaid with the test set data. The region of large variation is very similar to the class boundary space. Additionally, there is a large amount of variation to the west of the inflection point of the boundary curve.

The effect of the standard error of prediction overlaid with the test set data.

Using the standard error as a measure to preclude samples from being predicted can also be applied to models with numeric outcomes.

8 Ch 20 Ensembles of Models

Here