1 Introduction

Cook County, Illinois includes almost 2 million parcels ranging from suburban mansions to thousand feet skyscrapers. The system is administered by a series of countywide agencies and it is responsible for fairly dividing property tax levies (the amounts that taxing bodies such as school districts levy against properties in their districts) among all these properties.

In the past, assessments in Cook County were highly regressive, where lower valued properties paid a significantly higher tax rate than higher valued properties. A study by the Center for Municipal Finance at the University of Chicago showed that from 2011 to 2015 that these regressive assessments lead to over $2 billion of property tax liability being paid by lower valued homes instead of higher valued ones.

This study, alongside a public reckoning, resulted in the election of a new assessor in 2018.

This analysis focuses on single family homes Additionally, using the cmfproperty package, the IAAO arm’s length standard was applied to the data. This will present a conservative picture of the housing market in Cook County. Note that homes in Cook County as supposed to be assessed at 10% of their fair market value. For clarity, assessed values here are converted to market values.

The sales ratio is a key unit for analyzing the accuracy of assessments. It is calculated by taking the ratio of a property’s sale price to its assessment at the time of sale. The sales ratios can be evaluated using metrics from the International Association of Assessing Officers.

iaao <- cmfproperty::iaao_graphs(stats=stats, ratios=ratios, min_reporting_yr = min_yr, max_reporting_yr = max_yr, jurisdiction_name = 'Cook County')

For 2023, the COD in Cook County was 18.94 which did not meet the IAAO standard for uniformity.

iaao[[2]]

In 2023, the PRD in Cook County, was 1.021 which meets the IAAO standard for vertical equity.

iaao[[4]]

In 2023, the PRB in Cook County was 0.004 which indicates that sales ratios increase by 0.4% when home values double. This meets the IAAO standard.

iaao[[6]]

bs <- cmfproperty::binned_scatter(ratios = ratios, min_reporting_yr = min_yr, max_reporting_yr = max_yr, jurisdiction_name = 'Cook County')

In 2023, the most expensive homes (the top decile) were assessed at 77.1% of their value and the least expensive homes (the bottom decile) were assessed at 91.8%. In other words, the least expensive homes were assessed at 1.19 times the rate applied to the most expensive homes. Across our sample from 2021 to 2023, the most expensive homes were assessed at 77.2% of their value and the least expensive homes were assessed at 82.6%, which is 1.07 times the rate applied to the most expensive homes.

bs[[2]]

2 Modeling Market Values

We will now generate our own assessments for tax year 2023 for the City of Chicago. We have a base model from Part 3 which we will initially state and then expand upon.

Predictions will then be made for 2023 compared to the baseline of actual assessed values. Workflow is the same except that xgboost requires us to bake our data first.

Note that instead of joining assessments to sales we will join sales to assessments since we need to make a prediction for every property.

2.1 Feature Engineering

From part 3, we will use model 5 which is specified as follows:

sp_log ~ sale_date + char_yrblt + char_bldg_sf + \
         char_land_sf + township_name + mean_sale_sqft + \
         nearest_cta_stop
nbhd_avg_chars <- ratios %>% select(pin, nbhd_code, SALE_PRICE) %>% 
  left_join(characteristics %>% 
              select(pin, char_yrblt, char_bldg_sf, char_land_sf) %>% 
              distinct(pin, .keep_all=TRUE), 
            join_by('pin')) %>%
  filter(char_bldg_sf > 0, !is.na(char_bldg_sf)) %>%
  group_by(nbhd_code) %>%
  summarize(
    mean_sale_sqft = mean(SALE_PRICE / char_bldg_sf, na.rm=T),
    median_sale_sqft = median(SALE_PRICE / char_bldg_sf, na.rm=T))

joined_full <- assessments %>%
  filter(township_code >= 70) %>% #chicago townships
  mutate(assessed_value = if_else(tax_year == 2023, 10*certified_tot, 10*board_tot)) %>%
  select(pin, tax_year, class, township_name, assessed_value) %>%
  left_join(
    sales %>%
  mutate(year = year(ymd(sale_date)),
         sale_date = as_date(sale_date)) %>%
  filter(sale_filter_less_than_10k == 0,
         sale_filter_deed_type == 0,
         is_multisale == 0) %>% 
    distinct(pin, year, .keep_all=TRUE) %>%
    select(pin, year, sale_price, sale_date, year),
  join_by('pin', 'tax_year'=='year')
  ) %>% 
  left_join(characteristics %>% 
              select(pin, char_yrblt, char_bldg_sf, char_land_sf) %>% 
              distinct(pin, .keep_all=TRUE), 
            join_by('pin')) %>%
  left_join(ratios %>% 
              select(pin, SALE_YEAR) %>% 
              mutate(arms_length = 1), 
            join_by('pin', 'tax_year'=='SALE_YEAR')) %>%
  left_join(parcel_geo %>% select(pin, nbhd_code, GEOID, nearest_cta_stop) %>%
              mutate(nbhd_code = as.character(nbhd_code),
                     nearest_cta_stop = as.double(nearest_cta_stop))) %>%
  left_join(nbhd_avg_chars)
## Joining with `by = join_by(pin)`
## Joining with `by = join_by(nbhd_code)`
model_data <- joined_full %>% 
  filter(tax_year < 2023, 
         !is.na(char_yrblt), 
         !is.na(sale_price),
         !is.na(GEOID),
         arms_length == 1) %>%
  mutate(sp_log = log(sale_price))

split <- rsample::initial_split(model_data)
train <- training(split) 
test <- testing(split)

reg_model <- boost_tree(trees=200) %>%
  set_mode('regression') %>%
  set_engine('xgboost')

m_five <- recipe(sp_log ~ sale_date +
                         char_yrblt + char_bldg_sf + 
                         char_land_sf + township_name +
                         mean_sale_sqft + nearest_cta_stop, 
                       data=train) %>%
  step_date(sale_date, features = c("dow", "month", "year"), keep_original_cols = FALSE) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_corr(all_predictors()) %>% prep()

2.2 New Assessment Models / Model Evaluation

We will create a workflow_set() of three different model types glm, random forest, and boosted tree.

linear_reg_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")

rf_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 250) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

xgb_spec <- 
  boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
             min_n = tune(), sample_size = tune(), trees = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

my_set <- workflow_set(
  preproc = list(r5 = m_five),
  models = list(linear_reg = linear_reg_spec, random_forest = rf_spec, boosted = xgb_spec)
)

We will take our training data and resample it 3-fold. Typically we would use more resamples but for computational simplicity 3 is fine. We then train our model across our resamples for our five candidate models.

train_resample <- rsample::vfold_cv(train, v = 3)

grid_ctrl <-
   control_grid(
      save_pred = FALSE,
      save_workflow = FALSE
   )

models <- 
  my_set %>% 
  workflow_map(seed = 1503, 
               verbose = TRUE,
               grid = 5,
               resamples = train_resample, 
               control = grid_ctrl)
## i 1 of 3 tuning:     r5_linear_reg
## ✔ 1 of 3 tuning:     r5_linear_reg (1.1s)
## i 2 of 3 tuning:     r5_random_forest
## i Creating pre-processing data to finalize unknown parameter: mtry
## ✔ 2 of 3 tuning:     r5_random_forest (42s)
## i 3 of 3 tuning:     r5_boosted
## ✔ 3 of 3 tuning:     r5_boosted (2m 12.2s)
autoplot(
   models,
   rank_metric = "rmse",  # <- how to order models
   metric = "rmse",       # <- which metric to visualize
   select_best = TRUE     # <- one point per workflow
) 

autoplot(
   models,
   id = "r5_boosted",  # <- how to order models
   metric = "rmse",       # <- which metric to visualize
) 

models %>% 
  rank_results() %>%
  filter(.metric == 'rmse') %>%
  make_table()
autoplot(models, metric = "rmse") +
  ggrepel::geom_text_repel(aes(label = str_sub(wflow_id, 4, 5)), nudge_x = 1/8, nudge_y = 1/100) +
  theme(legend.position = "none")

collect_metrics(models, summarize=FALSE) %>% 
  filter(.metric == "rmse") %>% 
  mutate(wflow_id = reorder(wflow_id, .estimate)) %>% 
  ggplot(aes(x = wflow_id, y = .estimate, group = id, color = id)) + 
  geom_line(alpha = .5, lwd = 1.25) 

The situation may vary, but 1/5 boosted tree models beats all the random forests but the other 4/5 boosted tree models are all worse than the random forest models. We will run the boosted tree model but it is clear that it is very sensitive to tuning so if the final performance is worse than 0.32 RMSE we would need to revist our model selection.

Below is our initial ‘best’ model. We will now tune this model more aggressively.

best_results <- 
   models %>% 
   extract_workflow_set_result("r5_boosted") %>%
   select_best(metric = "rmse")

best_results_fit <- 
   models %>% 
   extract_workflow("r5_boosted") %>%
   finalize_workflow(best_results) %>% 
   last_fit(split = split)

best_results_fit %>% 
   collect_predictions() %>% 
   ggplot(aes(x = sp_log, y = .pred)) + 
   geom_abline(color = "gray50", lty = 2) + 
   geom_point(alpha = 0.5) + 
   coord_obs_pred() + 
   labs(x = "observed", y = "predicted")

2.4 Predictions

We now fit/train our model on all the sale data and run predictions on all properties.

trained <- workflow() %>%
  add_recipe(m_five) %>%
  add_model(reg_model) %>% fit(model_data)

all_predictions <- trained %>% augment(joined_full %>% filter(tax_year == 2023) %>%
                                         mutate(sale_date = ymd('2023-01-01')))

all_predictions %>% 
         select(pin, assessed_value, .pred) %>% 
         mutate(.pred = exp(.pred)) %>% 
         pivot_longer(!pin) %>%
ggplot(aes(x=value, color=name, fill=name)) +
  geom_density(alpha=.3) +
  scale_x_log10(labels=dollar) +
  labs(x = 'Assessed Value', y='Density', 
       fill = 'Type', color='Type', title='Density of Predictions and AV')

library(DALEXtra)
## Loading required package: DALEX
## Welcome to DALEX (version: 2.4.3).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
## 
## Attaching package: 'DALEX'
## The following object is masked from 'package:dplyr':
## 
##     explain
explain_reg <- explain_tidymodels(
  trained,
  data = train %>% select(-sp_log),
  y = train$sp_log,
  verbose=TRUE
)
## Preparation of a new explainer is initiated
##   -> model label       :  workflow  (  default  )
##   -> data              :  14722  rows  16  cols 
##   -> data              :  tibble converted into a data.frame 
##   -> target variable   :  14722  values 
##   -> predict function  :  yhat.workflow  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package tidymodels , ver. 1.1.0 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  9.749999 , mean =  12.63874 , max =  15.79478  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -1.197635 , mean =  0.0008017982 , max =  0.8248037  
##   A new explainer has been created!
targ <- model_parts(explain_reg, loss_function = loss_root_mean_square)
plot(targ)

Now lets model our predictions across race and make a map.

cook <- get_acs(geography = "tract", 
              variables = c(white_alone = "B02001_002",
                            total_pop = "B02001_001",
                            medincome = "B19013_001"),
              state=17, county=031,
              output='wide') %>%
  mutate(pct_nonwhite = (total_popE - white_aloneE) / total_popE) %>%
  select(GEOID, pct_nonwhite, medincomeE)
## Getting data from the 2017-2021 5-year ACS
geoid_avg <- all_predictions %>% left_join(
  parcel_geo %>% select(pin, GEOID)) %>%
  group_by(GEOID) %>%
  summarize(
    avg_prediction = mean(.pred, na.rm=T),
    size = n()
  ) %>%
  filter(!is.na(GEOID)) %>%
  left_join(cook)
## Joining with `by = join_by(pin, GEOID)`
## Joining with `by = join_by(GEOID)`
ggplot(geoid_avg, aes(x=exp(avg_prediction), y=pct_nonwhite)) +
  geom_point(alpha=.3, size=3) +
  scale_x_log10(labels=dollar) +
  geom_smooth(se=FALSE) +
  labs(x='Average Value Prediction', 
       y='Pct NonWhite Tract', title='Predicted Value by NonWhite Pop')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

library(leaflet)
geo_data <- chicago_tracts %>%
  left_join(geoid_avg) %>%
  mutate(avg_prediction = round(exp(avg_prediction), -2))
## Joining with `by = join_by(GEOID)`
label_str <- str_glue("<strong>Tract %s</strong><br> Avg Value %s<br>")
labels <- sprintf(label_str,
                geo_data$GEOID,
                dollar(geo_data$avg_prediction)) %>% 
  lapply(htmltools::HTML)
### make palette
pal1 <-
  colorQuantile(
    palette = "Oranges",
    domain = geo_data$avg_prediction,
    na.color = "Grey",
    n=7
  )
  
m <- leaflet() %>%
  addTiles() %>% addPolygons(
    data = geo_data,
    fillColor = ~ pal1(avg_prediction),
    weight = 0.5,
    opacity = 0.5,
    color = "white",
    dashArray = 3,
    fillOpacity = 0.7,
    highlight = highlightOptions(
      weight = 5,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.7,
      bringToFront = TRUE
    ),
    label = labels,
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "15px",
      direction = "auto"
    )
  )
  
m