Cook Part 4 - PA 470

Assignment

Instead of traditional problem sets, this course has a single four part assignment where you will build upon your previous work each week with new material from the course. You will explore property assessment in Cook County, Illinois and create an assessment model. After the completion of the assignment, you will wrap your model into a report which analyzes the effectiveness of your model based on the ethical and other frameworks from class and make a brief presentation to the class.

Submissions

Each week you will submit two files on blackboard, your code/Rmd file and the knitted output of your code. Blackboard will not accept html files so you must zip the files together.

Final Submission

Create final_report.Rmd in the reports folder, copying the yaml/framework from part_3.Rmd.

Bring together your previous submissions into one cohesive report. This report should offer a brief overview of the prediction problem (market value from sales), general trends on properties, your model, why your model is better than other models, and any technical or ethical critiques.

Your final submission will build upon your part 3 submission by ‘switching out’ the model you use and adding a conclusion.

New Assessment Models

Mirror Section 15.3 from the textbook.

Create a workflow_set() of three different model types. You may choose any which are comparable with tidymodels. Suggested models and tuning parameters below. I encourage you to consider using one model not from this list, but that is optional.

linear_reg_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")

rf_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 250) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

xgb_spec <- 
  boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
             min_n = tune(), sample_size = tune(), trees = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

my_set <- workflow_set(
  preproc = list(name_for_your_recipe = your_recipe), # REPLACE WITH YOUR RECIPE
  models = list(linear_reg = linear_reg_spec, random_forest = rf_spec, boosted = xgb_spec)
)

The textbook applies slightly different preprocessing steps to these models, but they should work reasonably well with your current recipe. If you have any compatability issues, I recommend starting with a simple recipe and adding things back one at a time.

Now, apply workflow_map() to your workflow_set(). You should resample your testing data and create a small grid. Please use at least 3 resamples and a grid of at least 5. This may take a long time (10+ minutes). I recommend starting very small and gradually increasing your resamples/grid. Note that I’ve included verbose = TRUE here to help you debug, please do not include this output in your final project.

grid_ctrl <-
   control_grid(
      save_pred = FALSE,
      save_workflow = FALSE
   )

grid_results <-
   my_set %>%
   workflow_map(
      seed = 1503,
      resamples = your_resamples,  # REPLACE WITH YOUR RESAMPLES
      grid = 5,
      control = grid_ctrl,
      verbose = TRUE
   )

Now, use rank_results on your selected performance metric and autoplot to replicate figure 15.1. Does one model perform significantly better than others? Select what you feel is the best by finalizing your model (see Section 15.5).

best_results <- 
   grid_results %>% 
   extract_workflow_set_result("best model type name") %>% # REPLACE
   select_best(metric = "your metric") # REPLACE
best_results

best_results_fit <- 
   grid_results %>% 
   extract_workflow("best model type name") %>% # REPLACE
   finalize_workflow(best_results) %>% 
   last_fit(split = your_rsample_data) #this is the output of rsample::initial_time_split() or rsample::initial_split()

Consider making a simple visualization of predicted / observed values from your best model similar to Figure 15.5

best_results_fit %>% 
   collect_predictions() %>% 
   ggplot(aes(x = target_variable, y = .pred)) + 
   geom_abline(color = "gray50", lty = 2) + 
   geom_point(alpha = 0.5) + 
   coord_obs_pred() + 
   labs(x = "observed", y = "predicted")

Take only the best performing model

Mirroring Section 14.2.3, take your finalized best workflow/best model from part A and use tune_bayes() to create a small tuning grid for your classification model. You will need to:

Identify appropriate hyperparameters to be tuned for your chosen model type and set them equal to tune() in your workflow (note: do not include mtry in your tuning grid. note: if you use logistic_reg() you must use engine glmnet to have tuning parameters.) These can be the same as the ones previously introduced.
Run a bayes search

ctrl <- control_bayes(verbose = TRUE)

your_search <- 
  your_workflow %>%
  tune_bayes(
    resamples = ..., # REPLACE
    metrics = ..., # REPLACE
    initial = 4,
    iter = 25,
    control = ctrl
  )

Call show_best and finalize your model

Part C, Conclusion & Presentation

Write a four paragraph conclusion to your file. Include information on your model type, its performance on your chosen objective function, any ethical or implementation issues (e.g. should Cook County use your model?).

In class on 4/11, everyone will give a brief presentation on their work. You may present your knitted Rmd file or pull some of your graphs into a slide deck. Your presentation should be at most five minutes. Broadly look to answer if your model should be implemented by discussing the information in your conclusion and assignment.

Grading Overview

For each assignment, you will be graded on substantial completion of the assignment (demonstrated by an attempt of all parts). When submitting parts 2, 3, and 4, you will be additionally graded on your incorporation of feedback, new concepts from the course, or the correction of any flagged issues.

The assignment will culminate in a final submission of code/report and presentation. Code will be graded based on reproducibility, conceptual understanding, and accuracy. The report will be an Rmarkdown file which knits together graphs, tables, and ethical frameworks. It should be concise (include only relevant information from Parts 1-4). This report will be used to give a five minute presentation to the class on your model and ethical/technical issues with Cook County property assessment.

Asg.	Points	Category	Notes
1	5	Substantial Completion (attempted all parts)
2	5	Substantial Completion (attempted all parts)
2	5	Incorporation of Feedback/New Concepts	From Part 1
3	10	Substantial Completion (attempted all parts)
3	10	Incorporation of Feedback/New Concepts	From Part 2
4	30	Final Code	Reproducible (10), Concepts (10), Accurate (10)
4	20	Final Report	Via Rmarkdown HTML, contextualized analysis and ethics
4	15	Final Presentation	3-5 minute presentation on model and insights