Assignment
Instead of traditional problem sets, this course has a single four part assignment where you will build upon your previous work each week with new material from the course. You will explore property assessment in Cook County, Illinois and create an assessment model. After the completion of the assignment, you will wrap your model into a report which analyzes the effectiveness of your model based on the ethical and other frameworks from class and make a brief presentation to the class.
Submissions
Each week you will submit two files on blackboard, your code/Rmd file and the knitted output of your code. Blackboard will not accept html files so you must zip the files together.
Final Submission
Create final_report.Rmd
in the reports folder, copying the yaml/framework from part_3.Rmd
.
Bring together your previous submissions into one cohesive report. This report should offer a brief overview of the prediction problem (market value from sales), general trends on properties, your model, why your model is better than other models, and any technical or ethical critiques.
Your final submission will build upon your part 3 submission by ‘switching out’ the model you use and adding a conclusion.
New Assessment Models
Mirror Section 15.3 from the textbook.
- Create a
workflow_set()
of three different model types. You may choose any which are comparable withtidymodels
. Suggested models and tuning parameters below. I encourage you to consider using one model not from this list, but that is optional.
linear_reg_spec <-
linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
rf_spec <-
rand_forest(mtry = tune(), min_n = tune(), trees = 250) %>%
set_engine("ranger") %>%
set_mode("regression")
xgb_spec <-
boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(),
min_n = tune(), sample_size = tune(), trees = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
my_set <- workflow_set(
preproc = list(name_for_your_recipe = your_recipe), # REPLACE WITH YOUR RECIPE
models = list(linear_reg = linear_reg_spec, random_forest = rf_spec, boosted = xgb_spec)
)
The textbook applies slightly different preprocessing steps to these models, but they should work reasonably well with your current recipe. If you have any compatability issues, I recommend starting with a simple recipe and adding things back one at a time.
- Now, apply
workflow_map()
to yourworkflow_set()
. You should resample your testing data and create a small grid. Please use at least 3 resamples and a grid of at least 5. This may take a long time (10+ minutes). I recommend starting very small and gradually increasing your resamples/grid. Note that I’ve included verbose = TRUE here to help you debug, please do not include this output in your final project.
grid_ctrl <-
control_grid(
save_pred = FALSE,
save_workflow = FALSE
)
grid_results <-
my_set %>%
workflow_map(
seed = 1503,
resamples = your_resamples, # REPLACE WITH YOUR RESAMPLES
grid = 5,
control = grid_ctrl,
verbose = TRUE
)
- Now, use
rank_results
on your selected performance metric andautoplot
to replicate figure 15.1. Does one model perform significantly better than others? Select what you feel is the best by finalizing your model (see Section 15.5).
best_results <-
grid_results %>%
extract_workflow_set_result("best model type name") %>% # REPLACE
select_best(metric = "your metric") # REPLACE
best_results
best_results_fit <-
grid_results %>%
extract_workflow("best model type name") %>% # REPLACE
finalize_workflow(best_results) %>%
last_fit(split = your_rsample_data) #this is the output of rsample::initial_time_split() or rsample::initial_split()
- Consider making a simple visualization of predicted / observed values from your best model similar to Figure 15.5
best_results_fit %>%
collect_predictions() %>%
ggplot(aes(x = target_variable, y = .pred)) +
geom_abline(color = "gray50", lty = 2) +
geom_point(alpha = 0.5) +
coord_obs_pred() +
labs(x = "observed", y = "predicted")
Take only the best performing model
Mirroring Section 14.2.3, take your finalized best workflow/best model from part A and use tune_bayes()
to create a small tuning grid for your classification model. You will need to:
Identify appropriate hyperparameters to be tuned for your chosen model type and set them equal to
tune()
in your workflow (note: do not includemtry
in your tuning grid. note: if you uselogistic_reg()
you must use engineglmnet
to have tuning parameters.) These can be the same as the ones previously introduced.Run a bayes search
ctrl <- control_bayes(verbose = TRUE)
your_search <-
your_workflow %>%
tune_bayes(
resamples = ..., # REPLACE
metrics = ..., # REPLACE
initial = 4,
iter = 25,
control = ctrl
)
- Call
show_best
and finalize your model
Part C, Conclusion & Presentation
Write a four paragraph conclusion to your file. Include information on your model type, its performance on your chosen objective function, any ethical or implementation issues (e.g. should Cook County use your model?).
In class on 4/11, everyone will give a brief presentation on their work. You may present your knitted Rmd file or pull some of your graphs into a slide deck. Your presentation should be at most five minutes. Broadly look to answer if your model should be implemented by discussing the information in your conclusion and assignment.
Grading Overview
For each assignment, you will be graded on substantial completion of the assignment (demonstrated by an attempt of all parts). When submitting parts 2, 3, and 4, you will be additionally graded on your incorporation of feedback, new concepts from the course, or the correction of any flagged issues.
The assignment will culminate in a final submission of code/report and presentation. Code will be graded based on reproducibility, conceptual understanding, and accuracy. The report will be an Rmarkdown file which knits together graphs, tables, and ethical frameworks. It should be concise (include only relevant information from Parts 1-4). This report will be used to give a five minute presentation to the class on your model and ethical/technical issues with Cook County property assessment.
Asg. | Points | Category | Notes |
---|---|---|---|
1 | 5 | Substantial Completion (attempted all parts) | |
2 | 5 | Substantial Completion (attempted all parts) | |
2 | 5 | Incorporation of Feedback/New Concepts | From Part 1 |
3 | 10 | Substantial Completion (attempted all parts) | |
3 | 10 | Incorporation of Feedback/New Concepts | From Part 2 |
4 | 30 | Final Code | Reproducible (10), Concepts (10), Accurate (10) |
4 | 20 | Final Report | Via Rmarkdown HTML, contextualized analysis and ethics |
4 | 15 | Final Presentation | 3-5 minute presentation on model and insights |