Today we are going to review the traditional base r methods of linear regression and then reapply that framework into a simplified version of the tidymodels pipeline.

1 Tidymodels Primer

Tidymodels is a collection of packages like tidyverse (which is ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats plus lubridate, dbplyr, dbi, rvest, readxl etc). Let’s briefly look at what we have with Tidymodels.

Tidymodels packages

The core tidymodels packages work together to enable a wide variety of modeling approaches:

tidymodels is a meta-package that installs and load the core packages listed below that you need for modeling and machine learning.

rsample provides infrastructure for efficient data splitting and resampling.

parsnip is a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.

recipes is a tidy interface to data pre-processing tools for feature engineering.

workflows bundle your pre-processing, modeling, and post-processing together.

tune helps you optimize the hyperparameters of your model and pre-processing steps

yardstick measures the effectiveness of models using performance metrics

broom converts the information in common statistical R objects into user-friendly, predictable formats.

dials creates and manages tuning parameters and parameter grids

There’s a bunch of additional packages including corrr and more specialized models (like spatialsample).

1.1 Key Functions / Packages for Today

  • corrr / correlation plots
  • broom’s tidy, glance, and augment
  • rsample’s initial_split, training, and testing
  • yardstick’s mape and rmse
  • parsnip’s linear_reg and set_engine

2 Regressions Review

Let’s look at our divvy data. It has been augmented and aggregated now so each row is the number of rides in an hour citywide and information on traffic and weather.

divvy_data <- read_csv('https://github.com/erhla/pa470spring2023/raw/main/static/lectures/week_3_data.csv')

glimpse(divvy_data)
## Rows: 8,760
## Columns: 8
## $ started_hour  <dttm> 2021-01-01 00:00:00, 2021-01-01 01:00:00, 2021-01-01 02…
## $ rides         <dbl> 35, 48, 57, 17, 14, 29, 48, 51, 49, 46, 83, 75, 55, 65, …
## $ avg_speed     <dbl> 26.93289, 25.38051, 20.30257, 19.98055, 24.68098, 26.436…
## $ temp          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ wind          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ humidity      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ solar_rad     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ interval_rain <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

There’s a lot of missing values! Our weather data comes from the open data set on beaches. Traffic derives from data on CTA bus speeds. Beaches and buses are not active 24/7/365.

Let’s look at how the variables are related to each other using corrr from tidymodels.

divvy_data %>%
  dplyr::select(-started_hour) %>%
  corrr::correlate() %>%
  corrr::fashion()
divvy_data %>%
  select(-started_hour) %>%
  corrr::correlate() %>%
  corrr::rplot(colors='Brown')

We can see that rides is most correlated with temperature/solar radiation.

How important might categorical variables be? Let’s run an ANOVA (Analysis of Variation) test on hour of day.

aov(rides ~ hour(started_hour), data=divvy_data) %>% broom::tidy()

What other time based trends should we consider?

ggplot(divvy_data, aes(x=started_hour, y=rides)) +
  geom_smooth()

Starting with solar radiation…let’s look at what broom has to offer.

m1 <- lm(rides ~ solar_rad, data=divvy_data)

m1 %>% broom::tidy()
m1 %>% broom::glance()
m1 %>% broom::augment()

From glance we can see that r2 is .274, modest.

Residuals?

ggplot(m1 %>% augment(), aes(x=.resid)) +
  geom_density(fill='navy', alpha=.6)

How does solar radiation look?

ggplot(m1 %>% augment(), aes(x=solar_rad)) +
  geom_density(fill='navy', alpha=.6)

ggplot(m1 %>% augment(), aes(x=log(solar_rad))) +
  geom_density(fill='navy', alpha=.6)

m1_log <- lm(rides ~ log(solar_rad), data=divvy_data %>% filter(solar_rad > 0))

m1_log %>% broom::tidy()
m1_log %>% broom::glance()
m1_log %>% broom::augment()
ggplot(m1_log %>% augment(), aes(x=.std.resid)) +
  geom_density(fill='navy', alpha=.6)

This is better, but we can’t really justify dropping darkness.

Let’s add some time.

m2 <- lm(rides ~ solar_rad + factor(hour(started_hour)), data=divvy_data)

m2 %>% glance()
m3 <- lm(rides ~ solar_rad + factor(hour(started_hour)) + 
           factor(wday(started_hour)) +
           factor(month(started_hour)) +
           temp + wind + interval_rain + avg_speed, data=divvy_data)

m3 %>% glance()
ggplot(m3 %>% augment(), aes(x=.std.resid)) +
  geom_density(fill='navy', alpha=.6)

3 Regression Prediction

3.1 yardstick

Metric types

There are three main metric types in yardstick: class, class probability, and numeric. Each type of metric has standardized argument syntax, and all metrics return the same kind of output (a tibble with 3 columns). This standardization allows metrics to easily be grouped together and used with grouped data frames for computing on multiple resamples at once. Below are the three types of metrics, along with the types of the inputs they take.

  • Class metrics (hard predictions)
    • truth - factor
    • estimate - factor
  • Class probability metrics (soft predictions)
    • truth - factor
    • estimate - multiple numeric columns containing class probabilities
  • Numeric metrics
    • truth - numeric
    • estimate - numeric

3.2 Getting Predictions / First Pipeline

Let’s construct a basic tidymodels pipeline. This pipeline will build a model to use a ‘trained’ regression model to score the test set. Key components:

  • rsample (splitting data)
  • parsnip (linear_reg, set_engine, set_mode, fit)
  • yardstick (mape, rmse)
  • broom (augment).

3.2.1 Split Data

grouped <- rsample::initial_split(divvy_data)

train <- training(grouped)
test  <- testing(grouped)

3.2.2 Model Framework

lm_model <-
  parsnip::linear_reg() %>%
  set_engine("lm") %>%
  fit(rides ~ solar_rad + factor(hour(started_hour)) + 
           factor(wday(started_hour)) +
           factor(month(started_hour)) +
           temp + wind + interval_rain + avg_speed, data=train)

3.2.3 Predictions

preds <- 
  predict(lm_model, test %>% filter(month(started_hour) >= 5)) 

test_preds <- lm_model %>% 
  augment(test %>% filter(month(started_hour) >=5))

3.2.4 yardstick / evaluate model

mape (mean absolute percentage error) - avg pct difference b/t forecast and actual rmse (root mean square error) - std of residuals

yardstick::mape(test_preds, 
     truth = rides,
     estimate = .pred)
yardstick::rmse(test_preds, 
     truth = rides,
     estimate = .pred)
ggplot(test_preds, aes(x=.pred)) +
  geom_density()

This model isn’t great. Can you improve it?

4 Unsupervised

Tidymodels K Means

set.seed(27)

centers <- tibble(
  cluster = factor(1:3), 
  num_points = c(100, 150, 50),  # number points in each cluster
  x1 = c(5, 0, -3),              # x1 coordinate of cluster center
  x2 = c(-1, 1, -2)              # x2 coordinate of cluster center
)

labelled_points <- 
  centers %>%
  mutate(
    x1 = map2(num_points, x1, rnorm),
    x2 = map2(num_points, x2, rnorm)
  ) %>% 
  select(-num_points) %>% 
  unnest(cols = c(x1, x2))

ggplot(labelled_points, aes(x1, x2, color = cluster)) +
  geom_point(alpha = 0.5)

points <- 
  labelled_points %>% 
  select(-cluster)

kclust <- kmeans(points, centers = 3)
kclust
## K-means clustering with 3 clusters of sizes 148, 51, 101
## 
## Cluster means:
##            x1        x2
## 1  0.08853475  1.045461
## 2 -3.14292460 -2.000043
## 3  5.00401249 -1.045811
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 298.9415 108.8112 243.2092
##  (between_SS / total_SS =  82.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
summary(kclust)
##              Length Class  Mode   
## cluster      300    -none- numeric
## centers        6    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
  • cluster (300 values) contains information about each point
  • centers, withinss, and size (3 values) contain information about each cluster
  • totss, tot.withinss, betweenss, and iter (1 value) contain information about the full clustering
augment(kclust, points)
tidy(kclust)
glance(kclust)

4.1 Exploratory clustering

While these summaries are useful, they would not have been too difficult to extract out from the data set yourself. The real power comes from combining these analyses with other tools like dplyr.

Let’s say we want to explore the effect of different choices of k, from 1 to 9, on this clustering. First cluster the data 9 times, each using a different value of k, then create columns containing the tidied, glanced and augmented data:

kclusts <- 
  tibble(k = 1:9) %>%
  mutate(
    kclust = map(k, ~kmeans(points, .x)),
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, points)
  )

kclusts

We can turn these into three separate data sets each representing a different type of data: using tidy(), using augment(), and using glance(). Each of these goes into a separate data set as they represent different types of data.

clusters <- 
  kclusts %>%
  unnest(cols = c(tidied))

assignments <- 
  kclusts %>% 
  unnest(cols = c(augmented))

clusterings <- 
  kclusts %>%
  unnest(cols = c(glanced))

Now we can plot the original points using the data from augment(), with each point colored according to the predicted cluster.

p1 <- 
  ggplot(assignments, aes(x = x1, y = x2)) +
  geom_point(aes(color = .cluster), alpha = 0.8) + 
  facet_wrap(~ k)
p1

Already we get a good sense of the proper number of clusters (3), and how the k-means algorithm functions when k is too high or too low. We can then add the centers of the cluster using the data from tidy():

p2 <- p1 + geom_point(data = clusters, size = 10, shape = "x")
p2

The data from glance() fills a different but equally important purpose; it lets us view trends of some summary statistics across values of k. Of particular interest is the total within sum of squares, saved in the tot.withinss column.

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_line() +
  geom_point()

This represents the variance within the clusters. It decreases as k increases, but notice a bend (or “elbow”) around k = 3. This bend indicates that additional clusters beyond the third have little value.