Code
library(tidyverse)
library(geofacet)
ucmr5 <- read_csv("data/ucmr5_dat.csv")We will be working with the ucmr5_dat.csv file, which is loaded below as ucmr5. This is a pre-processed and simplified dataset that contains a row for every drinking water sample included in the January 2026 release of the US EPA’s Fifth Unregulated Contaminant Monitoring Rule (UCMR 5) survey. This monitoring survey was conducted from 2023-2025, although data are still being released throughout this year. It included 29 different PFAS (per- and polyfluoroalkyl substances, also known as “forever chemicals”) and lithium.
The full version of these data includes various columns pertaining to drinking water quality results, but we’ll ignore these data for this exercise.
Our goal for this exercise is to produce a geofacet-ed plot of the cumulative proportion of samples received over time in each state, broken down further by how big the water systems are in terms of the populations they serve. We’ll present these as relative proportions since states can vary dramatically in how many total samples they have.
Note before we move on: I had some issues using geofacet with the latest version of ggplot2 and others seem to have this issue too. Try running devtools::install_version(package = "ggplot2", version = "3.5.2", repos = "http://cran.us.r-project.org") to change your version of gggplot2 for this exercise. Afterwards, you can update it under “Tools.”
library(tidyverse)
library(geofacet)
ucmr5 <- read_csv("data/ucmr5_dat.csv")Questions [5 mins]:
How many unique public water systems (based on the PWSID) do we have?
How many states are represented? Which states? Do you notice anything unexpected?
What other columns are shown and what kinds of values do you see?
# answer questions above here...Let’s reclassify the CollectionDate column and drop certain values of State* to simplify things before moving on.
* Note: State is shown this way because it also includes Tribal public water systems that don’t have a corresponding “primacy” (AKA governing) agency. These systems instead have the EPA region listed, ranging from 1-7. For simplicity, we’ll drop everything outside the contiguous US, AK, DC, and HI, although if we wanted to make a more publication-ready plot we should think more about this!
# the datasets package already contains all the state abbrevations
state_abb <- datasets::state.abb
ucmr5 <- ucmr5 %>%
mutate(CollectionDate = as.Date(CollectionDate, format = "%m/%d/%Y")) %>%
filter(State %in% c(state_abb, "DC"))Questions [5-10 mins]:
How would you go about calculating the number of samples in each month submitted in UCMR 5 for all system size groups (L and S) in all states? Talk amongst yourselves and then convert your thoughts into code. Note: the CollectionDate column is now stored with a Date class.
Ideally, you want to create a dataframe with the cumulative proportion of samples submitted over time in each state and for each system size group. How would you do this starting with the ucmr5 data?
# here I create new columns with indicators for month and year (using `lubridate`)
ucmr5 <- ucmr5 %>%
mutate(CollectionMonth = month(CollectionDate),
CollectionYear = year(CollectionDate))
# TODO: then think about how to count the number of samples in each combination
# of state, size, month, and year
# TODO: then calculate the cumulative number of samples for each combination
# of state and size - make sure you account for the order of dates.
# the 'cumsum' function is helpful here.
# TODO: create a new column for each combination of month and year.
# you can just use substitute in the first of each month to make it simpler.
# the `as.Date` function is helpful here.Before we move on, take a brief look at the available grids in the geofacet package to make yourself familiar. You can do this by typing geofacet:: and you should see a list of grids and a few functions. See any interesting or unexpected ones?
Now we’re ready to make a plot. As we saw in the lecture slides, you can make these kinds of plots using the facet_geo function, which has a grid argument. We’ll use us_state_grid2 (which includes AK, HI, and DC), although you’ll see several other US grids available that all differ slightly. We’ll also consider system size - here, ‘L’ corresponds to systems serving >10k people and ‘S’ corresponds to systems serving ≤10k.
Make a line plot for each state, further grouped by system size (based on the Size column), with this grid. Play around with the axis text sizes as well and make sure you label your axes appropriately.
# ggsave("exercise_plot.png", dpi = 600, width = 8, height = 6, bg = "white")Because these plots tend to be pretty big, you can set your code chunk output to appear in the console - click on the gear icon next to “Render” to do this.
If there’s time, replace the cumulative proportion with the cumulative sum of received samples. Make sure you set facet_geo(..., scales = "free_y") when you do this so that the y-axis is “free” in each mini-plot.
Notice anything interesting? How would you explore these data further?
Feel free now also to re-update ggplot2 by running install.packages() or go to Tools -> Check for package updates.
There are many packages out there that provide additional color palettes, but the package I turn most to for palettes is the MetBrewer package. It has palettes inspired by works at the Metropolitan Museum of Art in New York. There are many options, including colorblind-friendly, diverging, sequential, and discrete palettes. Check it out!