Final Project: GW Tobacco History Completeness Pilot Dashboard | PUBH 6199: Visualizing Data with R

Author

Affiliation

Sora Ely; Ashlan Jackson

George Washington University

Published

June 26, 2025

Code

library(readr)
library(tidyverse)


#| label: setup
#| include: false

knitr::opts_chunk$set(
    warning = FALSE,
    message = FALSE,
    fig.path = "figs/",
    fig.width = 7.252,
    fig.height = 4,
    comment = "#>",
    fig.retina = 3
)

Introduction

Cancer is the leading cause of death among American adults >45yo; lung cancer is the leading cause of cancer death and 3rd leading cause of death in adults in the U.S. (and worldwide) [1]. Lung cancer is responsible for more deaths annually than the next 3 leading causes of cancer death (colon, breast, prostate) combined [2]. This is also true in the District. There is a screening test for lung cancer that reduces lung cancer mortality by at least 20% [3], but lung cancer screening (LCS) is severely under-utilized more than a decade after the USPSTF first recommended it [4]. Because lung cancer is by far the #1 cancer killer, LCS could potentially save more lives than breast or colon cancer screening. Yet, while screening participation for breast cancer reached 81% in Washington, DC in 2020 and 76% nationwide in 2021 [5], the LCS rate was only 1.5% in DC and 4.5% nationally in 2023 [6].

Because the eligibility criteria for LCS require detailed tobacco history information, the tobacco history documented in discrete data fields in the patient electronic chart are often inadequate to determine eligibility and prompt screening. Ensuring a high level of complete tobacco history documentation is therefore a critical first step to increasing LCS and decreasing lung cancer mortality.

Research Question

We were interested in examining (and tracking) the completeness rates of tobacco history among adult patient visits at GW Medicine as part of an ongoing pilot program to increase completeness rates.

Primary Research Q
**How does the percentage of visits with complete tobacco history compare between specialties/staff participating (“Pilot”) versus not participating (“Non-Pilot”) in the pilot program?** Has that changed over time (especially since the start of the pilot program)?
Secondary Research Q
Who are the top and bottom performers among all specialties and staff?

Data Sources

Data routinely collected for performance tracking as part of this approved QI project was fully de-identified and modified for use in this project.

Data De-identification - The primary data was fully de-identified of any patient-related data.
- Furthermore, the staff were also fully de-identified and assigned a unique ID number (and for the purposes of this project, no master list/link was retained; re-identification is not possible from the anonymized datasets).

Key Variables Data was structured such that each data point (row) reflects a visit (as each visit is an opportunity to collect a complete tobacco history), not a patient.

Data Wrangling

Several R packages to clean and explore the data. Tidyverse handled data manipulation, lubridate helped parse dates and group data by week and month, ggplot2 was for creating bar charts, shiny made an interactive dashboard, and quarto was for putting the report together. We filtered out “Pilot” and “Non-Pilot” groups to keep things focused on real specialties. Then we calculated how often each specialty completed tobacco history documentation, making sure to compare fairly since visit numbers vary.

Code

#| eval: true
install.packages("shiny")

#> # Downloading packages -------------------------------------------------------
#> - Downloading shiny from CRAN ...               OK [4.3 Mb in 0.63s]
#> - Downloading commonmark from CRAN ...          OK [126.1 Kb in 0.46s]
#> - Downloading promises from CRAN ...            OK [1.6 Mb in 0.51s]
#> - Downloading otel from CRAN ...                OK [273.3 Kb in 0.47s]
#> Successfully downloaded 4 packages in 3.1 seconds.
#> 
#> The following package(s) will be installed:
#> - commonmark [2.0.0]
#> - otel       [0.2.0]
#> - promises   [1.5.0]
#> - shiny      [1.13.0]
#> These packages will be installed into "~/Documents/Teaching/PUBH6199/hw6-bart copy/renv/library/macos/R-4.4/aarch64-apple-darwin20".
#> 
#> # Installing packages --------------------------------------------------------
#> - Installing commonmark ...                     OK [installed binary and cached]
#> - Installing otel ...                           OK [installed binary and cached]
#> - Installing promises ...                       OK [installed binary and cached]
#> - Installing shiny ...                          OK [installed binary and cached in 0.28s]
#> Successfully installed 4 packages in 0.7 seconds.

Code

install.packages("ggrepel")

#> # Downloading packages -------------------------------------------------------
#> - Downloading ggrepel from CRAN ...             OK [351.3 Kb in 0.48s]
#> - Downloading S7 from CRAN ...                  OK [314.8 Kb in 0.35s]
#> Successfully downloaded 2 packages in 1.1 seconds.
#> 
#> The following package(s) will be installed:
#> - ggrepel [0.9.8]
#> - S7      [0.2.1-1]
#> These packages will be installed into "~/Documents/Teaching/PUBH6199/hw6-bart copy/renv/library/macos/R-4.4/aarch64-apple-darwin20".
#> 
#> # Installing packages --------------------------------------------------------
#> - Installing S7 ...                             OK [installed binary and cached]
#> - Installing ggrepel ...                        OK [installed binary and cached]
#> Successfully installed 2 packages in 0.19 seconds.

Code

install.packages("viridis")

#> The following package(s) will be installed:
#> - viridis [0.6.5]
#> These packages will be installed into "~/Documents/Teaching/PUBH6199/hw6-bart copy/renv/library/macos/R-4.4/aarch64-apple-darwin20".
#> 
#> # Installing packages --------------------------------------------------------
#> - Installing viridis ...                        OK [linked from cache]
#> Successfully installed 1 package in 2.4 milliseconds.

Code

#| output: FALSE

library(viridisLite)
library(viridis)
tobacco_data_weekly_modified <- read_csv("data/tobacco_data_weeklymodified.csv")

tobacco_data_monthly_moified <- read_csv("data/tobacco_data_monthlymodified.csv")

Data Visualizations

Visualization 1: “Highest and Lowest Departments with Complete Tobacco History (Weekly)”

A weekly bar chart showing documentation rates by specialty — Nephrology and Gastroenterology came out on top, while Infectious Disease and Radiation Oncology were at the bottom. This plot will help providers diagnoses lung issues sooner.

Code

library(ggplot2)
library(dplyr)
library(ggrepel)
library(viridis)

summary_groups <- tobacco_data_weekly_modified %>%
  filter(!group %in% c("Pilot", "Non-Pilot")) %>%
  filter(week_num == max(week_num)) %>%
  group_by(group) %>%
  summarize(final_value = max(percent_complete, na.rm = TRUE)) %>%
  arrange(desc(final_value))

top_groups <- summary_groups %>% slice_head(n = 2) %>% pull(group)
bottom_groups <- summary_groups %>% slice_tail(n = 2) %>% pull(group)
highlight_groups <- c(top_groups, bottom_groups)

highlight_colors <- c(
  setNames(rep("steelblue", 2), top_groups),
  setNames(rep("firebrick", 2), bottom_groups),
  "Other" = "gray80"
)

ggplot(
  tobacco_data_weekly_modified %>%
    filter(!group %in% c("Pilot", "Non-Pilot")) %>%
    mutate(line_group = ifelse(group %in% highlight_groups, group, "Other")),
  aes(x = week_num, y = percent_complete, group = group)
) +
  geom_line(aes(color = line_group, size = line_group, alpha = line_group)) +
  geom_point(aes(color = line_group, alpha = line_group), size = 2) +
  geom_text_repel(
    data = tobacco_data_weekly_modified %>%
      filter(group %in% highlight_groups) %>%
      group_by(group) %>%
      filter(week_num == max(week_num)),
    aes(label = group, color = group),
    nudge_x = 0.3,
    hjust = 0,
    segment.color = NA,
    size = 4,
    show.legend = FALSE
  ) +
  scale_color_manual(values = highlight_colors) +
  scale_size_manual(values = c(setNames(rep(1.4, 4), highlight_groups), "Other" = 0.6)) +
  scale_alpha_manual(values = c(setNames(rep(1, 4), highlight_groups), "Other" = 0.3)) +
  scale_x_continuous(
    breaks = unique(tobacco_data_weekly_modified$week_num),
    labels = paste("Week", unique(tobacco_data_weekly_modified$week_num)),
    expand = expansion(mult = c(0.05, 0.2))
  ) +
  labs(
    title = "Highest and Lowest Departments with Complete Tobacco History (Weekly)",
    x = "Week of",
    y = "Percent Complete"
  ) +
  theme_minimal() +  
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)  
  )

Visualization 2: Highest and Lowest Departments Complete Tobacco History (Monthly)“,

This plot shows the departments with the highest and lowest percentages of a complete tobacco history by month and this plot will help providers diagnoses lung issues sooner.This shows a slightly different picture — Physical Medicine and Rehabilitation did best, and Rheumatology and Radiation Oncology were lowest.

Code

library(ggplot2)
library(dplyr)
library(ggrepel)
library(viridis)

clean_data <- tobacco_data_monthly_moified %>%
  filter(!group %in% c("Pilot", "Non-Pilot", "Geriatrics (Ingleside)"))

summary_groups <- clean_data %>%
  filter(month_num == max(month_num)) %>%
  group_by(group) %>%
  summarize(final_value = max(percent_complete, na.rm = TRUE)) %>%
  arrange(desc(final_value))

top_groups <- summary_groups %>% slice_head(n = 2) %>% pull(group)
bottom_groups <- summary_groups %>% slice_tail(n = 2) %>% pull(group)
highlight_groups <- c(top_groups, bottom_groups)

highlight_colors <- c(
  setNames(rep("steelblue", 2), top_groups),
  setNames(rep("firebrick", 2), bottom_groups),
  "Other" = "gray80"
)

plot_data <- clean_data %>%
  mutate(line_group = ifelse(group %in% highlight_groups, group, "Other"))

ggplot(plot_data, aes(x = month_num, y = percent_complete, group = group)) +
  geom_line(aes(color = line_group, size = line_group, alpha = line_group)) +
  geom_point(aes(color = line_group, alpha = line_group), size = 2) +
  geom_text_repel(
    data = plot_data %>%
      filter(group %in% highlight_groups) %>%
      group_by(group) %>%
      filter(month_num == max(month_num)),
    aes(label = group, color = group),
    nudge_x = 0.3,
    hjust = 0,
    segment.color = NA,
    size = 4,
    show.legend = FALSE
  ) +
  scale_color_manual(values = highlight_colors) +
  scale_size_manual(values = c(setNames(rep(1.4, 4), highlight_groups), "Other" = 0.6)) +
  scale_alpha_manual(values = c(setNames(rep(1, 4), highlight_groups), "Other" = 0.3)) +
  scale_x_continuous(
    breaks = 1:12,
    labels = month.abb[1:12], 
    expand = expansion(mult = c(0.05, 0.2))
  ) +
  scale_y_continuous(
    limits = c(0, 1),
    labels = scales::percent_format(accuracy = 1)
  ) +
  labs(
    title = "Highest and Lowest Departments Complete Tobacco History (Monthly)",
    x = "Month",
    y = "Percent Complete"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Shiny app: GW Tobacco History Initiative Dashboard

The Shiny app is a dashboard that allows users to interactively explore the completeness of tobacco history documentation across different specialties & staff at GW Medicine. Users can select time intervals and ranges to view trends, compare performance between pilot and non-pilot groups, and identify top performers.

Click here to open the interactive Shiny app

Features

Interactive Filters: Users can view data by:
- time intervals (weekly or monthly)
- by pilot/non-pilot groups
- for specialties or staff
Visualizations: The app includes bar charts that show the completeness rates of tobacco history documentation by specialty and staff, highlighting the top and bottom performers.
- Bar Chart of Pilot v Non-Pilot Groups:
  This chart compares the completeness rates of tobacco history documentation between pilot and non-pilot groups over time, allowing us to explore how the pilot program has impacted complete documentation rates.
- Top 10 Leader Board:
  This chart displays the top 10 specialties or staff with the highest average completeness rates for the selected time period, allowing us to identify the best performers for positive feedback & to motivate peers.

Shiny Assistant’s response

It was a pleasant surprise that actually completion rates are higher than expected overall, with most being between 70-80%.
There was more variability among % complete among staff than among specialties.
There has disappointingly NOT been a clear trend toward improvement over other specialties/staff as currently visualized among the pilot group, but this is still a work in progress!

Limitations

The design of the visualizations & dashboard was somewhat limited by the large number of specialties and especially staff, which made it challenging to visualize all data clearly in a single plot.

Additionally, it was important to maintain full anonymity for this public project, so we could not include staff names or data linkages, so they cannot use this dashboard in its current form to see their own performance.

The authentication (login-required) mode of shinyapps.io requires a $50/mo or $550/y subscription, and so more detailed information could not be used.

Importantly, the data used in this project was incomplete – only a subset of visits were included and development continues on creating a final dataset pulled from the EMR. Results may differ with the complete data.

Conclusion

This was actually the first time we were able to visualize data from this pilot project, although it was an incomplete dataset. Even though the trends are not as promising as we had hoped, it’s still exciting to see the data displayed. It will also serve as a great tool for reporting on the pilot program’s progress and outcomes with the funding agency and GW Medicine leadership.

A version of this dashboard will go into actual use at GW Medicine to help track the completeness of tobacco history documentation over time, and to identify top performers. The pilot participants will be able to track and compare their own performance. Hopefully, this will increase engagement and motivation.

References

Centers for Disease Control and Prevention, National Center for Health Statistics. National Vital Statistics System, Mortality 2018-2023 on CDC WONDER Online Database, released in 2024. Data are from the Multiple Cause of Death Files, 2018-2023, as compiled from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program. Accessed at http://wonder.cdc.gov/ucd-icd10-expanded.html on Jun 22, 2025 5:09:41 PM
American Cancer Society. Facts & Figures 2025. American Cancer Society. Atlanta, Ga. 2025.
Reduced lung-cancer mortality with low-dose computed tomographic screening. (2011). New England Journal of Medicine, 365(5), 395–409. https://doi.org/10.1056/nejmoa1102873
Moyer, V. A. (2014). Screening for lung cancer: U.S. Preventive Services Task Force Recommendation Statement. Annals of Internal Medicine, 160(5), 330–338. https://doi.org/10.7326/m13-2771
American Cancer Society. Cancer Prevention & Early Detection Facts & Figures 2023-2024.
American Lung Association. (2023). (rep.). State of Lung Cancer: 2023 Report. Chicago, IL.
“All analyses were conducted in R (R version 4.0.1)”