---
title: "Geofaceting exercise"
author: "Your Name"
date: "2026-06-18"
params:
  number: 3
  purpose: "The purpose of this exercise is for you to practice with the geofacet package."
format: 
  html:
    toc: true
    toc-location: right
    code-fold: true
    theme: flatly
    self-contained: true
execute:
  echo: true
  warning: false
  message: false
editor_options: 
  chunk_output_type: console
---

```{r}
#| label: setup
#| include: false

knitr::opts_chunk$set(
    warning = FALSE,
    message = FALSE,
    fig.path = "figs/",
    fig.width = 7.252,
    fig.height = 4,
    comment = "#>",
    fig.retina = 3
)

```

## Overview of the exercise

We will be working with the ucmr5_dat.csv file, which is loaded below as `ucmr5`. This is a pre-processed and simplified dataset that contains a row for every drinking water sample included in the January 2026 release of the US EPA's Fifth Unregulated Contaminant Monitoring Rule (UCMR 5) survey. This monitoring survey was conducted from 2023-2025, although data are still being released throughout this year. It included 29 different PFAS (per- and polyfluoroalkyl substances, also known as "forever chemicals") and lithium.

The full version of these data includes various columns pertaining to drinking water quality results, but we'll ignore these data for this exercise.

Our goal for this exercise is to produce a `geofacet`-ed plot of the cumulative proportion of samples received over time in each state, broken down further by how big the water systems are in terms of the populations they serve. We'll present these as relative proportions since states can vary dramatically in how many total samples they have.

**Note before we move on: I had some issues using `geofacet` with the latest version of `ggplot2` and [others](https://github.com/hafen/geofacet/issues/448) seem to have this issue too. Try running `devtools::install_version(package = "ggplot2", version = "3.5.2", repos = "http://cran.us.r-project.org")` to change your version of `gggplot2` for this exercise. Afterwards, you can update it under "Tools."**

### Load required packages and datasets

```{r}

library(tidyverse)
library(geofacet)

ucmr5 <- read_csv("data/ucmr5_dat.csv")

```

## Part 1: take a look at the data

Questions \[5 mins\]:

- How many unique public water systems (based on the `PWSID`) do we have?

- How many states are represented? Which states? Do you notice anything unexpected?

- What other columns are shown and what kinds of values do you see?

```{r initial_questions}
# answer questions above here...

```

Let's reclassify the `CollectionDate` column and drop certain values of `State`\* to simplify things before moving on.

\* Note: `State` is shown this way because it also includes Tribal public water systems that don't have a corresponding "primacy" (AKA governing) agency. These systems instead have the EPA region listed, ranging from 1-7. For simplicity, we'll drop everything outside the contiguous US, AK, DC, and HI, although if we wanted to make a more publication-ready plot we should think more about this!

```{r preprocess}
# the datasets package already contains all the state abbrevations
state_abb <- datasets::state.abb

ucmr5 <- ucmr5 %>%
  mutate(CollectionDate = as.Date(CollectionDate, format = "%m/%d/%Y")) %>%
  filter(State %in% c(state_abb, "DC"))

```

------------------------------------------------------------------------

## Part 2: making our data ready to geofacet

Questions \[5-10 mins\]:

- How would you go about calculating the number of samples in each month submitted in UCMR 5 for all system size groups (L and S) in all states? Talk amongst yourselves and then convert your thoughts into code. *Note: the `CollectionDate` column is now stored with a Date class.*

- Ideally, you want to create a dataframe with the cumulative proportion of samples submitted over time in each state *and* for each system size group. How would you do this starting with the `ucmr5` data?

```{r part2}
# here I create new columns with indicators for month and year (using `lubridate`)
ucmr5 <- ucmr5 %>%
  mutate(CollectionMonth = month(CollectionDate),
         CollectionYear = year(CollectionDate))

# TODO: then think about how to count the number of samples in each combination
  # of state, size, month, and year


# TODO: then calculate the cumulative number of samples for each combination 
  # of state and size - make sure you account for the order of dates. 
  # the 'cumsum' function is helpful here.


# TODO: create a new column for each combination of month and year.
  # you can just use substitute in the first of each month to make it simpler.
  # the `as.Date` function is helpful here.

```

------------------------------------------------------------------------

## Part 3: making a geofacet plot

Before we move on, take a brief look at the available grids in the `geofacet` package to make yourself familiar. You can do this by typing `geofacet::` and you should see a list of grids and a few functions. See any interesting or unexpected ones?

Now we're ready to make a plot. As we saw in the lecture slides, you can make these kinds of plots using the `facet_geo` function, which has a `grid` argument. We'll use `us_state_grid2` (which includes AK, HI, and DC), although you'll see several other US grids available that all differ slightly. We'll also consider system size - here, 'L' corresponds to systems serving >10k people and 'S' corresponds to systems serving ≤10k.

**Make a line plot for each state, further grouped by system size (based on the `Size` column), with this grid. Play around with the axis text sizes as well and make sure you label your axes appropriately.**

```{r part3, fig.height=6, fig.width=8}

# ggsave("exercise_plot.png", dpi = 600, width = 8, height = 6, bg = "white")
  
```

Because these plots tend to be pretty big, you can set your code chunk output to appear in the console - click on the gear icon next to "Render" to do this.

If there's time, replace the cumulative proportion with the cumulative sum of received samples. Make sure you set `facet_geo(..., scales = "free_y")` when you do this so that the y-axis is "free" in each mini-plot.

## Quick group discussion/recap

Notice anything interesting? How would you explore these data further?

Feel free now also to re-update `ggplot2` by running `install.packages()` or go to Tools -> Check for package updates.

## Bonus: my favorite package for palettes

There are many packages out there that provide additional color palettes, but the package I turn most to for palettes is the [`MetBrewer` package](https://github.com/BlakeRMills/MetBrewer). It has palettes inspired by works at the Metropolitan Museum of Art in New York. There are many options, including colorblind-friendly, diverging, sequential, and discrete palettes. Check it out!

