05:00
PUBH 6199: Visualizing Data with R, Summer 2025
2025-05-20
Xindi (Cindy) Hu, ScD
Assistant Professor, Department of Environmental and Occupational Health
ScD in Environmental Health, Harvard University
Our Research:
Environmental Data Science, Drinking Water Quality, Health Equity, Climate Change, Geospatial Analysis, Machine Learning
Silas Horn
MPH Candidate
Environmental Health Science and Policy
GW SPH
Sayam Palrecha
MS Candidate
Data Science
GW CCAS
Develop a substantial visualization project!
Explain your problem out loud — as if you’re talking to a rubber duck.
“The practice of designing and creating graphic or visual representations of a large amount of complex quantitative and qualitative data and information with the help of static, dynamic or interactive visual items.”
-from Wikipedia
Made with {ggplot2}
Made with {ggplot2} and publication ready
Made with {gganimate}
Made with Shiny
1400 - 1532 AD, Inca Empire
Quipus (kee-poos) were recording devices for data collection, census records, calendaring…
Source: Smithsonian
1786, William Playfair
Created first bar chart (featuring Scottish trade data, 1780 - 1781), as well as line and pie charts.
Source: Wikipedia
1854, John Snow
Used a dot map and showed the clusters of cholera cases in the London epidemic of 1854
Source: Wikipedia
1869, Charles Minard
Created a flow map showing the number of troops lost during Napoleon’s 1812 Russian campaign.
Edward Tufte called this the greatest visualization created, displaying 6 types of data in 2D (# of troops, distance traveled, temperature, lat/lon, direction of travel, location relative to specific dates)
Source: Wikipedia
1900, William Edward Burghardt Du Bois
Organized an exhibit at the Paris 1900 Exposition, showcasing photographs, charts, and maps that documented the lives of African Americans at the time.
In 2021, people on Twitter recreated his historicall data visualizations using modern tools.
Source: Nightingale
Identify the effective types of data visualization for the data at hand and the intended audience
Critique data visualizations and provide constructive feedback
Prepare dataset for developing data visualization
Create effective, ethical, and aesthetically-pleasing visualizations using R programming language
Collaborate with classmates from diverse disciplinary background to carry out a visualization project
In-Class Activity:
In small group of 2, discuss your favorite example of data visualization, why do you like it? what functionality does that data visualization serve?
05:00
To reveal patterns that are hard to see in raw numbers…
Income Group | Males: Under 65 | Males: 65 or Over | Females: Under 65 | Females: 65 or Over |
---|---|---|---|---|
0–$24,999 | 250 | 200 | 375 | 550 |
$25,000+ | 430 | 300 | 700 | 500 |
Is the effect of age on cholesterol levels the same for all subgroups defined by sex and income?
“Exploratory Data Analysis, or EDA, is a process to use visualization and transformation to explore your data in a systemic way. EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.”
-from R for Data Science
Source: Ed Hawkins, Climate Stripes
Art by Allison Horst
Take a Break
~ This is the end of part 1 ~
05:00
{ggplot2}
Art by Allison Horst
“A graphic maps the data to the aesthetic attribuets (color, shape, size) of geometric objects (points, lines, bars). The plot may also include statistical transformations of the data and information about the plot’s coordinate system. Facetting can be used to plot for different subsets of the data. The combination of these independent components are what make up a graphic.”
First these:
Then these:
The airquality
dataset contains daily air quality measurements in New York, May to September 1973. The data frame has 153 observations and 6 variables:
Rows: 153
Columns: 6
$ Ozone <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 14, …
$ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290, 27…
$ Wind <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
$ Temp <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
$ Month <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
$ Day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
Initialize the plot using ggplot()
. It is empty because we haven’t told ggplot how to map the data to the plot yet.
The mapping
argument is used to specify how variables in the data are mapped to aesthetic attributes of the plot. The aes()
function is used to define the mapping.
geom_point
)Next, we add a geometric object (geom) that represents the data. In this case, we use geom_point()
to add points to the plot. There are many more geoms (geom_*()
) built into {ggplot2}
and extension packages.
If we like to add more information to the plot, we can use the color
aesthetic to map another variable to the color of the points. In this case, we will use Month
as the color aesthetic.
Instead of treating Month
as a continuous variable, maybe we want to treat Month
like a categorical variable.
geom_smooth
)We can add a smoother line to the plot using geom_smooth()
. The default method is linear regression, but we can also use other methods like LOESS (locally weighted scatterplot smoothing).
Global mapping are passed down to each subsequent layer
ggplot(airquality, aes(x = Temp, y = Ozone, color = as.factor(Month))) +
geom_point() +
geom_smooth(method = "loess")
color = as.factor(Month)
is passed to both geom_point()
and geom_smooth()
, so the points and the line are colored by month.
Local mapping are only used in that layer and don’t affect other layers
ggplot(airquality, aes(x = Temp, y = Ozone)) +
geom_point(aes(color = as.factor(Month))) +
geom_smooth(method = "loess")
color = as.factor(Month)
is only passed to geom_point()
, so the points are colored by month, but the line is not colored by month.
facet_wrap
)We can use facet_wrap()
to create small multiples of the plot, one for each month. This allows us to see how the relationship between temperature and ozone varies by month.
{ggplot2}
has a number of built-in themes, which control all non-data display.
Never use the default theme
Almost always the default font size in ggplot2
are too small. This is because the font size is set to 11 by default, but the size of the figure is set to 10 inches by 5 inches, so when you insert the figure to a Word or Powerpoint, it ends up being too small
You can explore other pre-built themes in the {ggthemes}
.
Theme economist
Theme 538
Make your own ggplot evolution using the {camcorder}
package
Fill out the end-of-class survey
~ This is the end of Lecture 1 ~
10:00
PUBH 6199: Visualizing Data with R