05:00
PUBH 6199: Visualizing Data with R, Summer 2025
2025-05-27
Helps us understand trends and patterns
Makes data more accessible to different audiences
Useful in decision-making and communication
Count all the 5s in the following image
Count all the 5s in the following image
Raise your hand when you see the red dot?
“A visual variable, in data visualization, is an aspect of a graphical object that can visually differentiate it from other objects, and can be controlled during the design process.”
- Jacques Bertin, 1967, Sémiologie Graphique
In-Class Activity:
Create at least three sketches to visualize these two quantities
42, 23
Which Bertin’s visual variables did you use in your sketches?
05:00
https://rockcontent.com/blog/45-ways-to-communicate-two-quantities/
🎯 Detection: Recognizing that a geometric object encodes a physical value.
🧩 Assembly: Grouping detected graphical elements into patterns.
📏 Estimation: Visually assessing the relative magnitude of two or more values.
Three levels of estimation
Level | Example |
---|---|
1. Discrimination | X = Y X != Y |
2. Ranking | X < Y X > Y |
3. Ratioing | X / Y = ? |
📏 We want to get as far down this list as possible with efficiency and accuracy
Source: Yau, N. (2013). Data Points: Visualization That Means Something. Wiley.
library(tidyverse)
library(kableExtra)
coffee_ratings <- readr::read_csv("data/coffee_ratings.csv")
glimpse(coffee_ratings)
Rows: 1,337
Columns: 43
$ total_cup_points <dbl> 90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75,…
$ species <chr> "Arabica", "Arabica", "Arabica", "Arabica", "Ara…
$ owner <chr> "metad plc", "metad plc", "grounds for health ad…
$ country_of_origin <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia",…
$ farm_name <chr> "metad plc", "metad plc", "san marcos barrancas …
$ lot_number <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ mill <chr> "metad plc", "metad plc", NA, "wolensu", "metad …
$ ico_number <chr> "2014/2015", "2014/2015", NA, NA, "2014/2015", N…
$ company <chr> "metad agricultural developmet plc", "metad agri…
$ altitude <chr> "1950-2200", "1950-2200", "1600 - 1800 m", "1800…
$ region <chr> "guji-hambela", "guji-hambela", NA, "oromia", "g…
$ producer <chr> "METAD PLC", "METAD PLC", NA, "Yidnekachew Dabes…
$ number_of_bags <dbl> 300, 300, 5, 320, 300, 100, 100, 300, 300, 50, 3…
$ bag_weight <chr> "60 kg", "60 kg", "1", "60 kg", "60 kg", "30 kg"…
$ in_country_partner <chr> "METAD Agricultural Development plc", "METAD Agr…
$ harvest_year <chr> "2014", "2014", NA, "2014", "2014", "2013", "201…
$ grading_date <chr> "April 4th, 2015", "April 4th, 2015", "May 31st,…
$ owner_1 <chr> "metad plc", "metad plc", "Grounds for Health Ad…
$ variety <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other"…
$ processing_method <chr> "Washed / Wet", "Washed / Wet", NA, "Natural / D…
$ aroma <dbl> 8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25, …
$ flavor <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, …
$ aftertaste <dbl> 8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50, …
$ acidity <dbl> 8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42, …
$ body <dbl> 8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33, …
$ balance <dbl> 8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50, …
$ uniformity <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
$ clean_cup <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
$ sweetness <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00,…
$ cupper_points <dbl> 8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.00, …
$ moisture <dbl> 0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03, …
$ category_one_defects <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ quakers <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ color <chr> "Green", "Green", NA, "Green", "Green", "Bluish-…
$ category_two_defects <dbl> 0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0, …
$ expiration <chr> "April 3rd, 2016", "April 3rd, 2016", "May 31st,…
$ certification_body <chr> "METAD Agricultural Development plc", "METAD Agr…
$ certification_address <chr> "309fcf77415a3661ae83e027f7e5f05dad786e44", "309…
$ certification_contact <chr> "19fef5a731de2db57d16da10287413f5f99bc2dd", "19f…
$ unit_of_measurement <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m"…
$ altitude_low_meters <dbl> 1950.0, 1950.0, 1600.0, 1800.0, 1950.0, NA, NA, …
$ altitude_high_meters <dbl> 2200.0, 2200.0, 1800.0, 2200.0, 2200.0, NA, NA, …
$ altitude_mean_meters <dbl> 2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA, …
For each country in the 18 most frequent levels, calculate the average total cup points and the number of coffee bean varieties, lump the other countries into the Other
category.
country_summary <- coffee_ratings %>%
mutate(country = fct_lump(country_of_origin, 18)) %>%
group_by(country) %>%
summarize(mean_rating = mean(total_cup_points, na.rm = TRUE),
n = n()) %>%
arrange(desc(mean_rating))
head(country_summary, 19)
# A tibble: 19 × 3
country mean_rating n
<fct> <dbl> <int>
1 Ethiopia 85.5 44
2 Kenya 84.3 25
3 Uganda 83.5 36
4 Colombia 83.1 183
5 El Salvador 83.1 21
6 China 82.9 16
7 Costa Rica 82.8 51
8 Thailand 82.6 32
9 Indonesia 82.6 20
10 Brazil 82.4 132
11 Tanzania, United Republic Of 82.4 40
12 Taiwan 82.0 75
13 Guatemala 81.8 181
14 United States (Hawaii) 81.8 73
15 Other 81.7 80
16 India 81.1 14
17 Mexico 80.9 236
18 Honduras 80.9 52
19 Nicaragua 80.5 26
Easy: which has higher ratings, Kenya or Indonesia?
Hard: which has higher ratings, Indonesia or Costa Rica?
Observation: alphabetical ordering of the categorical variable is almost never useful, re-rank as needed.
No legend?
No problem.
Because color saturation has natural ordering.
The ratio between Mexico and United States is…
2 or 3
Moving down to the third level of estimation
For categorical data, no more than 6 colors is best.
(Source: European Environment Agency)
Wait, I thought there is some difference…
Combined MMR vaccination rate, 1994/95 to 2014/15, England
Take another look, axis doesn’t start at zero
Re-ranking categorical variables still matters!
Which category has higher count: SI1-Premium or VS2-Premium?
Which category has higher count: SI1-Premium or VS2-Premium?
Angle is #4 on the accuracy list, we can do better.
Don’t do this!
Do this instead!
Label | Value |
---|---|
A | 25 |
B | 60 |
C | 15 |
Don’t do this!
Or this!
What is the relationship between Ozone concentrations and temperature?
Which country has higher population growth: Nigeria or India?
Most countries’ population growth are slowing down, which wasn’t obvious in the previous graph.
🎯 Detection: Recognizing that a geometric object encodes a physical value.
🧩 Assembly: Grouping detected graphical elements into patterns.
📏 Estimation: Visually assessing the relative magnitude of two or more values.
“Gestalt (German for form, shape, or configuration). Gestalt psychology proposes that the human brain perceives objects as part of a greater whole rather than as isolated elements.”
Reification
Emergence
“The law of Prãgnanz, also known as the law of good Gestalt. People tend to experience things as regular, orderly, symmetrical, and simple.”
Law of Continuity
Law of Similarity
Law of Closure
Law of Proximity
This hurts our brain.
This is much easier.
🎯 Detection: Recognizing that a geometric object encodes a physical value.
🧩 Assembly: Grouping detected graphical elements into patterns.
📏 Estimation: Visually assessing the relative magnitude of two or more values.
Take a Break
~ This is the end of part 1 ~
05:00
Graphical excellence is the well-designed presentation of interesting data - a matter of substance, of statistics, and of design.
Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency.
Graphical excellence is that which gives the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.
Graphical excellence is nearly always multivariate.
Graphical excellence requires telling the truth about the data.
\[ \text{Lie Factor} = \frac{\text{size of effect shown in graphic}}{\text{size of effect in data}} \]
Can you calculate the lie factor in this graph?
Source: the Guardian, 2008
\[ \begin{aligned} \text{Data-Ink Ratio} &= \frac{\text{Data ink}}{\text{Total ink used in graphic}} \\ &= \text{proportion of a graphic's ink devoted to the} \\ &\quad \text{non-redundant display of data-information} \\ &= 1 - \frac{\text{Redundant ink}}{\text{Total ink used in graphic}} \end{aligned} \]
Office of Management and Budget
Social Indicators, 1973
\[ \text{data density of a graphic} = \frac{\text{number of entries in data matrix}}{\text{area of data graphic}} \]
\[ \begin{aligned} \text{data density} &= \frac{\text{2 data points}}{\text{graph covres 26.5 square inch}} \\ &= 0.15 \text{ numbers per square inch} \end{aligned} \]
Jacques Bertin, Semiologie Graphique, 1973
\[ \text{data density of a graphic} = \frac{\text{number of entries in data matrix}}{\text{area of data graphic}} \]
\[ \begin{aligned} \text{data density} &= \frac{\text{240,000 data points}}{\text{graph covres 27 square inch}} \\ &= 9,000 \text{ numbers per square inch} \end{aligned} \]
Graphics can be shrunk way down
Default size
Appropriate size
“Small multiples resemble the frames of a movie: a series of graphics, showing the same combination of variables, indexed by changes in another variable.”
Tufte, E. R. (1983). The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press.
Audience may vary by:
Most useful for analytical or technical audience, e.g. scientists, engineers, and data analysts. Less useful for the general public or media campaigns.
In-Class Activity:
Choose one of the three visualizations and answer:
What message is this chart trying to convey?
How do the visuals help (or hurt) comprehension?
If you removed the embellishments, what would be lost or gained?
05:00
Color blindness affects approximately 1 in 12 men and 1 in 200 women. To ensure your visualizations remain accessible:
viridis
, Okabe-Ito
, or Color Universal Design (CUD)
colorblindr
Designing with color blindness in mind improves clarity for everyone.
Fill out the end-of-class survey
~ This is the end of Lecture 2 ~
10:00
PUBH 6199: Visualizing Data with R