1 Introduction

Cook County, Illinois includes almost 2 million parcels ranging from suburban mansions to thousand feet skyscrapers. The system is administered by a series of countywide agencies and it is responsible for fairly dividing property tax levies (the amounts that taxing bodies such as school districts levy against properties in their districts) among all these properties.

In the past, assessments in Cook County were highly regressive, where lower valued properties paid a significantly higher tax rate than higher valued properties. A study by the Center for Municipal Finance at the University of Chicago showed that from 2011 to 2015 that these regressive assessments lead to over $2 billion of property tax liability being paid by lower valued homes instead of higher valued ones.

This study, alongside a public reckoning, resulted in the election of a new assessor in 2018.

This analysis focuses on single family homes Additionally, using the cmfproperty package, the IAAO arm’s length standard was applied to the data. This will present a conservative picture of the housing market in Cook County. Note that homes in Cook County as supposed to be assessed at 10% of their fair market value. For clarity, assessed values here are converted to market values.

The sales ratio is a key unit for analyzing the accuracy of assessments. It is calculated by taking the ratio of a property’s sale price to its assessment at the time of sale. The sales ratios can be evaluated using metrics from the International Association of Assessing Officers.

iaao <- cmfproperty::iaao_graphs(stats=stats, ratios=ratios, min_reporting_yr = min_yr, max_reporting_yr = max_yr, jurisdiction_name = 'Cook County')

For 2023, the COD in Cook County was 18.94 which did not meet the IAAO standard for uniformity.

iaao[[2]]

In 2023, the PRD in Cook County, was 1.021 which meets the IAAO standard for vertical equity.

iaao[[4]]

In 2023, the PRB in Cook County was 0.004 which indicates that sales ratios increase by 0.4% when home values double. This meets the IAAO standard.

iaao[[6]]

bs <- cmfproperty::binned_scatter(ratios = ratios, min_reporting_yr = min_yr, max_reporting_yr = max_yr, jurisdiction_name = 'Cook County')

In 2023, the most expensive homes (the top decile) were assessed at 77.1% of their value and the least expensive homes (the bottom decile) were assessed at 91.8%. In other words, the least expensive homes were assessed at 1.19 times the rate applied to the most expensive homes. Across our sample from 2021 to 2023, the most expensive homes were assessed at 77.2% of their value and the least expensive homes were assessed at 82.6%, which is 1.07 times the rate applied to the most expensive homes.

bs[[2]]

2 Modeling Market Values

We will now generate our own assessments for tax year 2023 for the City of Chicago. We have a base model from Part 3 which we will initially state and then expand upon.

Predictions will then be made for 2023 compared to the baseline of actual assessed values. Workflow is the same except that xgboost requires us to bake our data first.

Note that instead of joining assessments to sales we will join sales to assessments since we need to make a prediction for every property.

2.1 Feature Engineering

From part 3, we will use model 5 which is specified as follows:

sp_log ~ sale_date + char_yrblt + char_bldg_sf + \
         char_land_sf + township_name + mean_sale_sqft + \
         nearest_cta_stop

nbhd_avg_chars <- ratios %>% select(pin, nbhd_code, SALE_PRICE) %>% 
  left_join(characteristics %>% 
              select(pin, char_yrblt, char_bldg_sf, char_land_sf) %>% 
              distinct(pin, .keep_all=TRUE), 
            join_by('pin')) %>%
  filter(char_bldg_sf > 0, !is.na(char_bldg_sf)) %>%
  group_by(nbhd_code) %>%
  summarize(
    mean_sale_sqft = mean(SALE_PRICE / char_bldg_sf, na.rm=T),
    median_sale_sqft = median(SALE_PRICE / char_bldg_sf, na.rm=T))

joined_full <- assessments %>%
  filter(township_code >= 70) %>% #chicago townships
  mutate(assessed_value = if_else(tax_year == 2023, 10*certified_tot, 10*board_tot)) %>%
  select(pin, tax_year, class, township_name, assessed_value) %>%
  left_join(
    sales %>%
  mutate(year = year(ymd(sale_date)),
         sale_date = as_date(sale_date)) %>%
  filter(sale_filter_less_than_10k == 0,
         sale_filter_deed_type == 0,
         is_multisale == 0) %>% 
    distinct(pin, year, .keep_all=TRUE) %>%
    select(pin, year, sale_price, sale_date, year),
  join_by('pin', 'tax_year'=='year')
  ) %>% 
  left_join(characteristics %>% 
              select(pin, char_yrblt, char_bldg_sf, char_land_sf) %>% 
              distinct(pin, .keep_all=TRUE), 
            join_by('pin')) %>%
  left_join(ratios %>% 
              select(pin, SALE_YEAR) %>% 
              mutate(arms_length = 1), 
            join_by('pin', 'tax_year'=='SALE_YEAR')) %>%
  left_join(parcel_geo %>% select(pin, nbhd_code, GEOID, nearest_cta_stop) %>%
              mutate(nbhd_code = as.character(nbhd_code),
                     nearest_cta_stop = as.double(nearest_cta_stop))) %>%
  left_join(nbhd_avg_chars)

## Joining with `by = join_by(pin)`
## Joining with `by = join_by(nbhd_code)`

model_data <- joined_full %>% 
  filter(tax_year < 2023, 
         !is.na(char_yrblt), 
         !is.na(sale_price),
         !is.na(GEOID),
         arms_length == 1) %>%
  mutate(sp_log = log(sale_price))

split <- rsample::initial_split(model_data)
train <- training(split) 
test <- testing(split)

reg_model <- boost_tree(trees=200) %>%
  set_mode('regression') %>%
  set_engine('xgboost')

m_five <- recipe(sp_log ~ sale_date +
                         char_yrblt + char_bldg_sf + 
                         char_land_sf + township_name +
                         mean_sale_sqft + nearest_cta_stop, 
                       data=train) %>%
  step_date(sale_date, features = c("dow", "month", "year"), keep_original_cols = FALSE) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_corr(all_predictors()) %>% prep()

2.2 New Assessment Models / Model Evaluation

We will create a workflow_set() of three different model types glm, random forest, and boosted tree.

linear_reg_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")

rf_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 250) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

xgb_spec <- 
  boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
             min_n = tune(), sample_size = tune(), trees = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

my_set <- workflow_set(
  preproc = list(r5 = m_five),
  models = list(linear_reg = linear_reg_spec, random_forest = rf_spec, boosted = xgb_spec)
)

We will take our training data and resample it 3-fold. Typically we would use more resamples but for computational simplicity 3 is fine. We then train our model across our resamples for our five candidate models.

train_resample <- rsample::vfold_cv(train, v = 3)

grid_ctrl <-
   control_grid(
      save_pred = FALSE,
      save_workflow = FALSE
   )

models <- 
  my_set %>% 
  workflow_map(seed = 1503, 
               verbose = TRUE,
               grid = 5,
               resamples = train_resample, 
               control = grid_ctrl)

## i 1 of 3 tuning:     r5_linear_reg

## ✔ 1 of 3 tuning:     r5_linear_reg (1.1s)

## i 2 of 3 tuning:     r5_random_forest

## i Creating pre-processing data to finalize unknown parameter: mtry

## ✔ 2 of 3 tuning:     r5_random_forest (42s)

## i 3 of 3 tuning:     r5_boosted

## ✔ 3 of 3 tuning:     r5_boosted (2m 12.2s)

autoplot(
   models,
   rank_metric = "rmse",  # <- how to order models
   metric = "rmse",       # <- which metric to visualize
   select_best = TRUE     # <- one point per workflow
)

autoplot(
   models,
   id = "r5_boosted",  # <- how to order models
   metric = "rmse",       # <- which metric to visualize
)

models %>% 
  rank_results() %>%
  filter(.metric == 'rmse') %>%
  make_table()

autoplot(models, metric = "rmse") +
  ggrepel::geom_text_repel(aes(label = str_sub(wflow_id, 4, 5)), nudge_x = 1/8, nudge_y = 1/100) +
  theme(legend.position = "none")

collect_metrics(models, summarize=FALSE) %>% 
  filter(.metric == "rmse") %>% 
  mutate(wflow_id = reorder(wflow_id, .estimate)) %>% 
  ggplot(aes(x = wflow_id, y = .estimate, group = id, color = id)) + 
  geom_line(alpha = .5, lwd = 1.25)

The situation may vary, but 1/5 boosted tree models beats all the random forests but the other 4/5 boosted tree models are all worse than the random forest models. We will run the boosted tree model but it is clear that it is very sensitive to tuning so if the final performance is worse than 0.32 RMSE we would need to revist our model selection.

Below is our initial ‘best’ model. We will now tune this model more aggressively.

best_results <- 
   models %>% 
   extract_workflow_set_result("r5_boosted") %>%
   select_best(metric = "rmse")

best_results_fit <- 
   models %>% 
   extract_workflow("r5_boosted") %>%
   finalize_workflow(best_results) %>% 
   last_fit(split = split)

best_results_fit %>% 
   collect_predictions() %>% 
   ggplot(aes(x = sp_log, y = .pred)) + 
   geom_abline(color = "gray50", lty = 2) + 
   geom_point(alpha = 0.5) + 
   coord_obs_pred() + 
   labs(x = "observed", y = "predicted")

2.3 Bayes Search

ctrl <- control_bayes(verbose = TRUE)

xgb_spec <- 
  boost_tree(tree_depth = tune(), learn_rate = tune(),
             min_n = tune(), trees = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

best_workflow <- 
  workflow() %>%
  add_recipe(m_five) %>%
  add_model(xgb_spec)

your_search <- 
  best_workflow %>%
  tune_bayes(
    resamples = train_resample, 
    metrics = metric_set(rmse),
    initial = 4,
    iter = 25,
    control = ctrl
  )

## ! There are 4 tuning parameters and 4 grid points were requested.

## • There are as many tuning parameters as there are initial points. This is
##   likely to cause numerical issues in the first few search iterations.

##

## ❯  Generating a set of 4 initial parameter results

## ✓ Initialization complete

##

##

## ── Iteration 1 ─────────────────────────────────────────────────────────────────

##

## i Current best:      rmse=0.3231 (@iter 0)

## i Gaussian process model

## ! The Gaussian process model is being fit using 4 features but only has 4
##   data points to do so. This may cause errors or a poor model fit.

## ✓ Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i trees=1381, min_n=2, tree_depth=5, learn_rate=0.00189

## i Estimating performance

## i Fold1: preprocessor 1/1

## ✓ Fold1: preprocessor 1/1

## i Fold1: preprocessor 1/1, model 1/1

## ✓ Fold1: preprocessor 1/1, model 1/1

## i Fold1: preprocessor 1/1, model 1/1 (extracts)

## i Fold1: preprocessor 1/1, model 1/1 (predictions)

## i Fold2: preprocessor 1/1

## ✓ Fold2: preprocessor 1/1

## i Fold2: preprocessor 1/1, model 1/1

## ✓ Fold2: preprocessor 1/1, model 1/1

## i Fold2: preprocessor 1/1, model 1/1 (extracts)

## i Fold2: preprocessor 1/1, model 1/1 (predictions)

## i Fold3: preprocessor 1/1

## ✓ Fold3: preprocessor 1/1

## i Fold3: preprocessor 1/1, model 1/1

## ✓ Fold3: preprocessor 1/1, model 1/1

## i Fold3: preprocessor 1/1, model 1/1 (extracts)

## i Fold3: preprocessor 1/1, model 1/1 (predictions)

## ✓ Estimating performance

## ⓧ Newest results:    rmse=0.9555 (+/-0.00357)

##

## ── Iteration 2 ─────────────────────────────────────────────────────────────────

##

## i Current best:      rmse=0.3231 (@iter 0)

## i Gaussian process model

## ! The Gaussian process model is being fit using 4 features but only has 5
##   data points to do so. This may cause errors or a poor model fit.

## ✓ Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i trees=1240, min_n=30, tree_depth=3, learn_rate=0.311

## i Estimating performance

## i Fold1: preprocessor 1/1

## ✓ Fold1: preprocessor 1/1

## i Fold1: preprocessor 1/1, model 1/1

## ✓ Fold1: preprocessor 1/1, model 1/1

## i Fold1: preprocessor 1/1, model 1/1 (extracts)

## i Fold1: preprocessor 1/1, model 1/1 (predictions)

## i Fold2: preprocessor 1/1

## ✓ Fold2: preprocessor 1/1

## i Fold2: preprocessor 1/1, model 1/1

## ✓ Fold2: preprocessor 1/1, model 1/1

## i Fold2: preprocessor 1/1, model 1/1 (extracts)

## i Fold2: preprocessor 1/1, model 1/1 (predictions)

## i Fold3: preprocessor 1/1

## ✓ Fold3: preprocessor 1/1

## i Fold3: preprocessor 1/1, model 1/1

## ✓ Fold3: preprocessor 1/1, model 1/1

## i Fold3: preprocessor 1/1, model 1/1 (extracts)

## i Fold3: preprocessor 1/1, model 1/1 (predictions)

## ✓ Estimating performance

## ⓧ Newest results:    rmse=0.3418 (+/-0.0024)

##

## ── Iteration 3 ─────────────────────────────────────────────────────────────────

##

## i Current best:      rmse=0.3231 (@iter 0)

## i Gaussian process model

## ✓ Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i trees=1114, min_n=22, tree_depth=15, learn_rate=0.00122

## i Estimating performance

## i Fold1: preprocessor 1/1

## ✓ Fold1: preprocessor 1/1

## i Fold1: preprocessor 1/1, model 1/1

## ✓ Fold1: preprocessor 1/1, model 1/1

## i Fold1: preprocessor 1/1, model 1/1 (extracts)

## i Fold1: preprocessor 1/1, model 1/1 (predictions)

## i Fold2: preprocessor 1/1

## ✓ Fold2: preprocessor 1/1

## i Fold2: preprocessor 1/1, model 1/1

## ✓ Fold2: preprocessor 1/1, model 1/1

## i Fold2: preprocessor 1/1, model 1/1 (extracts)

## i Fold2: preprocessor 1/1, model 1/1 (predictions)

## i Fold3: preprocessor 1/1

## ✓ Fold3: preprocessor 1/1

## i Fold3: preprocessor 1/1, model 1/1

## ✓ Fold3: preprocessor 1/1, model 1/1

## i Fold3: preprocessor 1/1, model 1/1 (extracts)

## i Fold3: preprocessor 1/1, model 1/1 (predictions)

## ✓ Estimating performance

## ⓧ Newest results:    rmse=3.155 (+/-0.00519)

##

## ── Iteration 4 ─────────────────────────────────────────────────────────────────

##

## i Current best:      rmse=0.3231 (@iter 0)

## i Gaussian process model

## ✓ Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i trees=1503, min_n=4, tree_depth=7, learn_rate=0.0302

## i Estimating performance

## i Fold1: preprocessor 1/1

## ✓ Fold1: preprocessor 1/1

## i Fold1: preprocessor 1/1, model 1/1

## ✓ Fold1: preprocessor 1/1, model 1/1

## i Fold1: preprocessor 1/1, model 1/1 (extracts)

## i Fold1: preprocessor 1/1, model 1/1 (predictions)

## i Fold2: preprocessor 1/1

## ✓ Fold2: preprocessor 1/1

## i Fold2: preprocessor 1/1, model 1/1

## ✓ Fold2: preprocessor 1/1, model 1/1

## i Fold2: preprocessor 1/1, model 1/1 (extracts)

## i Fold2: preprocessor 1/1, model 1/1 (predictions)

## i Fold3: preprocessor 1/1

## ✓ Fold3: preprocessor 1/1

## i Fold3: preprocessor 1/1, model 1/1

## ✓ Fold3: preprocessor 1/1, model 1/1

## i Fold3: preprocessor 1/1, model 1/1 (extracts)

## i Fold3: preprocessor 1/1, model 1/1 (predictions)

## ✓ Estimating performance

## ⓧ Newest results:    rmse=0.3295 (+/-0.0015)

##

## ── Iteration 5 ─────────────────────────────────────────────────────────────────

##

## i Current best:      rmse=0.3231 (@iter 0)

## i Gaussian process model

## ✓ Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i trees=1958, min_n=34, tree_depth=12, learn_rate=0.284

## i Estimating performance

## i Fold1: preprocessor 1/1

## ✓ Fold1: preprocessor 1/1

## i Fold1: preprocessor 1/1, model 1/1

## ✓ Fold1: preprocessor 1/1, model 1/1

## i Fold1: preprocessor 1/1, model 1/1 (extracts)

## i Fold1: preprocessor 1/1, model 1/1 (predictions)

## i Fold2: preprocessor 1/1

## ✓ Fold2: preprocessor 1/1

## i Fold2: preprocessor 1/1, model 1/1

## ✓ Fold2: preprocessor 1/1, model 1/1

## i Fold2: preprocessor 1/1, model 1/1 (extracts)

## i Fold2: preprocessor 1/1, model 1/1 (predictions)

## i Fold3: preprocessor 1/1

## ✓ Fold3: preprocessor 1/1

## i Fold3: preprocessor 1/1, model 1/1

## ✓ Fold3: preprocessor 1/1, model 1/1

## i Fold3: preprocessor 1/1, model 1/1 (extracts)

## i Fold3: preprocessor 1/1, model 1/1 (predictions)

## ✓ Estimating performance

## ⓧ Newest results:    rmse=0.367 (+/-0.000876)

##

## ── Iteration 6 ─────────────────────────────────────────────────────────────────

##

## i Current best:      rmse=0.3231 (@iter 0)

## i Gaussian process model

## ✓ Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i trees=1996, min_n=26, tree_depth=7, learn_rate=0.00241

## i Estimating performance

## i Fold1: preprocessor 1/1

## ✓ Fold1: preprocessor 1/1

## i Fold1: preprocessor 1/1, model 1/1

## ✓ Fold1: preprocessor 1/1, model 1/1

## i Fold1: preprocessor 1/1, model 1/1 (extracts)

## i Fold1: preprocessor 1/1, model 1/1 (predictions)

## i Fold2: preprocessor 1/1

## ✓ Fold2: preprocessor 1/1

## i Fold2: preprocessor 1/1, model 1/1

## ✓ Fold2: preprocessor 1/1, model 1/1

## i Fold2: preprocessor 1/1, model 1/1 (extracts)

## i Fold2: preprocessor 1/1, model 1/1 (predictions)

## i Fold3: preprocessor 1/1

## ✓ Fold3: preprocessor 1/1

## i Fold3: preprocessor 1/1, model 1/1

## ✓ Fold3: preprocessor 1/1, model 1/1

## i Fold3: preprocessor 1/1, model 1/1 (extracts)

## i Fold3: preprocessor 1/1, model 1/1 (predictions)

## ✓ Estimating performance

## ⓧ Newest results:    rmse=0.3387 (+/-0.000591)

##

## ── Iteration 7 ─────────────────────────────────────────────────────────────────

##

## i Current best:      rmse=0.3231 (@iter 0)

## i Gaussian process model

## ✓ Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i trees=1691, min_n=7, tree_depth=10, learn_rate=0.261

## i Estimating performance

## i Fold1: preprocessor 1/1

## ✓ Fold1: preprocessor 1/1

## i Fold1: preprocessor 1/1, model 1/1

## ✓ Fold1: preprocessor 1/1, model 1/1

## i Fold1: preprocessor 1/1, model 1/1 (extracts)

## i Fold1: preprocessor 1/1, model 1/1 (predictions)

## i Fold2: preprocessor 1/1

## ✓ Fold2: preprocessor 1/1

## i Fold2: preprocessor 1/1, model 1/1

## ✓ Fold2: preprocessor 1/1, model 1/1

## i Fold2: preprocessor 1/1, model 1/1 (extracts)

## i Fold2: preprocessor 1/1, model 1/1 (predictions)

## i Fold3: preprocessor 1/1

## ✓ Fold3: preprocessor 1/1

## i Fold3: preprocessor 1/1, model 1/1

## ✓ Fold3: preprocessor 1/1, model 1/1

## i Fold3: preprocessor 1/1, model 1/1 (extracts)

## i Fold3: preprocessor 1/1, model 1/1 (predictions)

## ✓ Estimating performance

## ⓧ Newest results:    rmse=0.3495 (+/-0.0027)

##

## ── Iteration 8 ─────────────────────────────────────────────────────────────────

##

## i Current best:      rmse=0.3231 (@iter 0)

## i Gaussian process model

## ✓ Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i trees=16, min_n=33, tree_depth=4, learn_rate=0.315

## i Estimating performance

## i Fold1: preprocessor 1/1

## ✓ Fold1: preprocessor 1/1

## i Fold1: preprocessor 1/1, model 1/1

## ✓ Fold1: preprocessor 1/1, model 1/1

## i Fold1: preprocessor 1/1, model 1/1 (extracts)

## i Fold1: preprocessor 1/1, model 1/1 (predictions)

## i Fold2: preprocessor 1/1

## ✓ Fold2: preprocessor 1/1

## i Fold2: preprocessor 1/1, model 1/1

## ✓ Fold2: preprocessor 1/1, model 1/1

## i Fold2: preprocessor 1/1, model 1/1 (extracts)

## i Fold2: preprocessor 1/1, model 1/1 (predictions)

## i Fold3: preprocessor 1/1

## ✓ Fold3: preprocessor 1/1

## i Fold3: preprocessor 1/1, model 1/1

## ✓ Fold3: preprocessor 1/1, model 1/1

## i Fold3: preprocessor 1/1, model 1/1 (extracts)

## i Fold3: preprocessor 1/1, model 1/1 (predictions)

## ✓ Estimating performance

## ⓧ Newest results:    rmse=0.3322 (+/-0.00073)

##

## ── Iteration 9 ─────────────────────────────────────────────────────────────────

##

## i Current best:      rmse=0.3231 (@iter 0)

## i Gaussian process model

## ✓ Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i trees=691, min_n=7, tree_depth=9, learn_rate=0.141

## i Estimating performance

## i Fold1: preprocessor 1/1

## ✓ Fold1: preprocessor 1/1

## i Fold1: preprocessor 1/1, model 1/1

## ✓ Fold1: preprocessor 1/1, model 1/1

## i Fold1: preprocessor 1/1, model 1/1 (extracts)

## i Fold1: preprocessor 1/1, model 1/1 (predictions)

## i Fold2: preprocessor 1/1

## ✓ Fold2: preprocessor 1/1

## i Fold2: preprocessor 1/1, model 1/1

## ✓ Fold2: preprocessor 1/1, model 1/1

## i Fold2: preprocessor 1/1, model 1/1 (extracts)

## i Fold2: preprocessor 1/1, model 1/1 (predictions)

## i Fold3: preprocessor 1/1

## ✓ Fold3: preprocessor 1/1

## i Fold3: preprocessor 1/1, model 1/1

## ✓ Fold3: preprocessor 1/1, model 1/1

## i Fold3: preprocessor 1/1, model 1/1 (extracts)

## i Fold3: preprocessor 1/1, model 1/1 (predictions)

## ✓ Estimating performance

## ⓧ Newest results:    rmse=0.3421 (+/-0.00154)

##

## ── Iteration 10 ────────────────────────────────────────────────────────────────

##

## i Current best:      rmse=0.3231 (@iter 0)

## i Gaussian process model

## ✓ Gaussian process model

## i Generating 5000 candidates

## i Predicted candidates

## i trees=1627, min_n=31, tree_depth=1, learn_rate=0.00693

## i Estimating performance

## i Fold1: preprocessor 1/1

## ✓ Fold1: preprocessor 1/1

## i Fold1: preprocessor 1/1, model 1/1

## ✓ Fold1: preprocessor 1/1, model 1/1

## i Fold1: preprocessor 1/1, model 1/1 (extracts)

## i Fold1: preprocessor 1/1, model 1/1 (predictions)

## i Fold2: preprocessor 1/1

## ✓ Fold2: preprocessor 1/1

## i Fold2: preprocessor 1/1, model 1/1

## ✓ Fold2: preprocessor 1/1, model 1/1

## i Fold2: preprocessor 1/1, model 1/1 (extracts)

## i Fold2: preprocessor 1/1, model 1/1 (predictions)

## i Fold3: preprocessor 1/1

## ✓ Fold3: preprocessor 1/1

## i Fold3: preprocessor 1/1, model 1/1

## ✓ Fold3: preprocessor 1/1, model 1/1

## i Fold3: preprocessor 1/1, model 1/1 (extracts)

## i Fold3: preprocessor 1/1, model 1/1 (predictions)

## ✓ Estimating performance

## ⓧ Newest results:    rmse=0.342 (+/-0.00278)

## ! No improvement for 10 iterations; returning current results.

best_results <- 
   your_search %>% 
   select_best()

autoplot(your_search, type = "performance") +
  ylim(0.2, 0.5)

model_fit <- 
   best_workflow %>% 
   finalize_workflow(best_results) %>%
   fit(train)

our_preds <- model_fit %>% augment(test)

multi_metric <- metric_set(mape, rmse, rsq)
multi_metric(our_preds, truth=sp_log, estimate=.pred) %>%
  kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

.metric	.estimator	.estimate
mape	standard	1.9188282
rmse	standard	0.3208200
rsq	standard	0.8266284

Actual assessments:

multi_metric(our_preds %>% mutate(av_log = log(assessed_value)), truth=sp_log, estimate=av_log) %>%
  kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

.metric	.estimator	.estimate
mape	standard	2.8543344
rmse	standard	0.4740419
rsq	standard	0.8281116

ratios_new <- cmfproperty::reformat_data(
  our_preds %>% mutate(av_pred = exp(.pred)),
  'sale_price',
  'av_pred',
  'tax_year'
)

## [1] "Filtered out non-arm's length transactions"
## [1] "Inflation adjusted to 2021"

bs <- cmfproperty::binned_scatter(ratios = ratios_new, min_reporting_yr = 2021, max_reporting_yr = 2022,
                                  jurisdiction_name = 'Cook County')

In 2022, the most expensive homes (the top decile) were assessed at 94.2% of their value and the least expensive homes (the bottom decile) were assessed at 131.6%. In other words, the least expensive homes were assessed at 1.4 times the rate applied to the most expensive homes. Across our sample from 2021 to 2022, the most expensive homes were assessed at 94.4% of their value and the least expensive homes were assessed at 125.6%, which is 1.33 times the rate applied to the most expensive homes.

bs[[2]]

2.4 Predictions

We now fit/train our model on all the sale data and run predictions on all properties.

trained <- workflow() %>%
  add_recipe(m_five) %>%
  add_model(reg_model) %>% fit(model_data)

all_predictions <- trained %>% augment(joined_full %>% filter(tax_year == 2023) %>%
                                         mutate(sale_date = ymd('2023-01-01')))

all_predictions %>% 
         select(pin, assessed_value, .pred) %>% 
         mutate(.pred = exp(.pred)) %>% 
         pivot_longer(!pin) %>%
ggplot(aes(x=value, color=name, fill=name)) +
  geom_density(alpha=.3) +
  scale_x_log10(labels=dollar) +
  labs(x = 'Assessed Value', y='Density', 
       fill = 'Type', color='Type', title='Density of Predictions and AV')

library(DALEXtra)

## Loading required package: DALEX

## Welcome to DALEX (version: 2.4.3).
## Find examples and detailed introduction at: http://ema.drwhy.ai/

## 
## Attaching package: 'DALEX'

## The following object is masked from 'package:dplyr':
## 
##     explain

explain_reg <- explain_tidymodels(
  trained,
  data = train %>% select(-sp_log),
  y = train$sp_log,
  verbose=TRUE
)

## Preparation of a new explainer is initiated
##   -> model label       :  workflow  (  default  )
##   -> data              :  14722  rows  16  cols 
##   -> data              :  tibble converted into a data.frame 
##   -> target variable   :  14722  values 
##   -> predict function  :  yhat.workflow  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package tidymodels , ver. 1.1.0 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  9.749999 , mean =  12.63874 , max =  15.79478  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -1.197635 , mean =  0.0008017982 , max =  0.8248037  
##   A new explainer has been created!

targ <- model_parts(explain_reg, loss_function = loss_root_mean_square)
plot(targ)

Now lets model our predictions across race and make a map.

cook <- get_acs(geography = "tract", 
              variables = c(white_alone = "B02001_002",
                            total_pop = "B02001_001",
                            medincome = "B19013_001"),
              state=17, county=031,
              output='wide') %>%
  mutate(pct_nonwhite = (total_popE - white_aloneE) / total_popE) %>%
  select(GEOID, pct_nonwhite, medincomeE)

## Getting data from the 2017-2021 5-year ACS

geoid_avg <- all_predictions %>% left_join(
  parcel_geo %>% select(pin, GEOID)) %>%
  group_by(GEOID) %>%
  summarize(
    avg_prediction = mean(.pred, na.rm=T),
    size = n()
  ) %>%
  filter(!is.na(GEOID)) %>%
  left_join(cook)

## Joining with `by = join_by(pin, GEOID)`

## Joining with `by = join_by(GEOID)`

ggplot(geoid_avg, aes(x=exp(avg_prediction), y=pct_nonwhite)) +
  geom_point(alpha=.3, size=3) +
  scale_x_log10(labels=dollar) +
  geom_smooth(se=FALSE) +
  labs(x='Average Value Prediction', 
       y='Pct NonWhite Tract', title='Predicted Value by NonWhite Pop')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

library(leaflet)
geo_data <- chicago_tracts %>%
  left_join(geoid_avg) %>%
  mutate(avg_prediction = round(exp(avg_prediction), -2))

## Joining with `by = join_by(GEOID)`

label_str <- str_glue("<strong>Tract %s</strong><br> Avg Value %s<br>")
labels <- sprintf(label_str,
                geo_data$GEOID,
                dollar(geo_data$avg_prediction)) %>% 
  lapply(htmltools::HTML)
### make palette
pal1 <-
  colorQuantile(
    palette = "Oranges",
    domain = geo_data$avg_prediction,
    na.color = "Grey",
    n=7
  )
  
m <- leaflet() %>%
  addTiles() %>% addPolygons(
    data = geo_data,
    fillColor = ~ pal1(avg_prediction),
    weight = 0.5,
    opacity = 0.5,
    color = "white",
    dashArray = 3,
    fillOpacity = 0.7,
    highlight = highlightOptions(
      weight = 5,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.7,
      bringToFront = TRUE
    ),
    label = labels,
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "15px",
      direction = "auto"
    )
  )
  
m

Part 4