library(tidyverse)
library(nycflights13)
library(gapminder)
data(flights)
?flightsdata(gapminder)
?gapminder
Lab 1
Before your start the lab notebook
- Update the header - put your name in the
author
argument and put today’s date in thedate
argument. - Click the “Render” button in RStudio and then open the rendered
1-lab1.html
page. - Then go back and try changing the
theme
argument in the header to something else - you can see other available themes here. Notice the difference when you render now!
Overview of Lab1
There are two parts of this lab notebook. In first part, we will practice data transformation using the built-in dataset from the {nycflights13}
package. In the second part, we will practice data tidying using both a synthetic data and the built-in dataset from the {gapminder}
package.
Skills practiced:
- Use key
dplyr
verbs to manipulate and summarize data - Use
pivot_longer()
to tidy untidy data - Apply these skills in the context of climate and public health data
Load Required Packages
Part 1: Data Transformation with dplyr
Example Walkthrough
Let’s explore one example for each type of dplyr
operation — row, column, and group — using the flights
dataset.
Row Operations
What flights are on January 1st?
# filter(): Flights on January 1st
%>%
flights filter(month == 1, day == 1)
#> # A tibble: 842 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> 7 2013 1 1 555 600 -5 913 854
#> 8 2013 1 1 557 600 -3 709 723
#> 9 2013 1 1 557 600 -3 838 846
#> 10 2013 1 1 558 600 -2 753 745
#> # ℹ 832 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
Which flights have the worst departure delay?
# arrange(): Flights ordered by descending departure delay
%>%
flights arrange(desc(dep_delay))
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 9 641 900 1301 1242 1530
#> 2 2013 6 15 1432 1935 1137 1607 2120
#> 3 2013 1 10 1121 1635 1126 1239 1810
#> 4 2013 9 20 1139 1845 1014 1457 2210
#> 5 2013 7 22 845 1600 1005 1044 1815
#> 6 2013 4 10 1100 1900 960 1342 2211
#> 7 2013 3 17 2321 810 911 135 1020
#> 8 2013 6 27 959 1900 899 1236 2226
#> 9 2013 7 22 2257 759 898 121 1026
#> 10 2013 12 5 756 1700 896 1058 2020
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
Where do flights between NYC (i.e. JFK, LGA or EWR) take you?
# distinct(): Unique origin-destination pairs
%>%
flights select(origin, dest) %>%
distinct()
#> # A tibble: 224 × 2
#> origin dest
#> <chr> <chr>
#> 1 EWR IAH
#> 2 LGA IAH
#> 3 JFK MIA
#> 4 JFK BQN
#> 5 LGA ATL
#> 6 EWR ORD
#> 7 EWR FLL
#> 8 LGA IAD
#> 9 JFK MCO
#> 10 LGA ORD
#> # ℹ 214 more rows
Column Operations
Only keep relevant columns.
# select(): Keep only flight number and delays
%>%
flights select(flight, arr_delay, dep_delay)
#> # A tibble: 336,776 × 3
#> flight arr_delay dep_delay
#> <int> <dbl> <dbl>
#> 1 1545 11 2
#> 2 1714 20 4
#> 3 1141 33 2
#> 4 725 -18 -1
#> 5 461 -25 -6
#> 6 1696 12 -4
#> 7 507 19 -5
#> 8 5708 -14 -3
#> 9 79 -8 -3
#> 10 301 8 -2
#> # ℹ 336,766 more rows
Rename arr_delay
to arrival_delay
.
# rename(): Rename arr_delay to arrival_delay
%>%
flights rename(arrival_delay = arr_delay)
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> 7 2013 1 1 555 600 -5 913 854
#> 8 2013 1 1 557 600 -3 709 723
#> 9 2013 1 1 557 600 -3 838 846
#> 10 2013 1 1 558 600 -2 753 745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arrival_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
Calculate the total delay as the sum of arrival delay and departure delay.
Common mistake:
# What is wrong with this code?
%>%
flights mutate(total_delay = arrival_delay + dep_delay)
Correct solution:
# mutate(): Add a column for total delay
%>%
flights mutate(total_delay = arr_delay + dep_delay)
#> # A tibble: 336,776 × 20
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> 7 2013 1 1 555 600 -5 913 854
#> 8 2013 1 1 557 600 -3 709 723
#> 9 2013 1 1 557 600 -3 838 846
#> 10 2013 1 1 558 600 -2 753 745
#> # ℹ 336,766 more rows
#> # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>, total_delay <dbl>
🔗 Group Operations
What is the average delay by carrier?
# group_by() + summarize(): Average delay by carrier
%>%
flights group_by(carrier) %>%
summarize(avg_delay = mean(arr_delay, na.rm = TRUE))
#> # A tibble: 16 × 2
#> carrier avg_delay
#> <chr> <dbl>
#> 1 9E 7.38
#> 2 AA 0.364
#> 3 AS -9.93
#> 4 B6 9.46
#> 5 DL 1.64
#> 6 EV 15.8
#> 7 F9 21.9
#> 8 FL 20.1
#> 9 HA -6.92
#> 10 MQ 10.8
#> 11 OO 11.9
#> 12 UA 3.56
#> 13 US 2.13
#> 14 VX 1.76
#> 15 WN 9.65
#> 16 YV 15.6
For each destination, what is the most delayed flight?
# slice_max(): Most delayed flight by destination
%>%
flights group_by(dest) %>%
slice_max(order_by = arr_delay, n = 1, with_ties = FALSE)
#> # A tibble: 105 × 19
#> # Groups: dest [105]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 7 22 2145 2007 98 132 2259
#> 2 2013 7 23 1139 800 219 1250 909
#> 3 2013 1 25 123 2000 323 229 2101
#> 4 2013 8 17 1740 1625 75 2042 2003
#> 5 2013 7 22 2257 759 898 121 1026
#> 6 2013 7 10 2056 1505 351 2347 1758
#> 7 2013 8 13 1156 832 204 1417 1029
#> 8 2013 2 21 1728 1316 252 1839 1413
#> 9 2013 12 1 1504 1056 248 1628 1230
#> 10 2013 4 10 25 1900 325 136 2045
#> # ℹ 95 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
Which airline carriers had the highest average arrival delays, and how do they compare overall? What if I want to keep all the rows for further data manipulation?
# ungroup(): Remove grouping structure
%>%
flights group_by(carrier) %>%
mutate(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
ungroup() %>%
arrange(desc(mean_delay))
#> # A tibble: 336,776 × 20
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 833 835 -2 1134 1102
#> 2 2013 1 1 1716 1730 -14 1947 1953
#> 3 2013 1 2 827 835 -8 1120 1102
#> 4 2013 1 2 1728 1730 -2 1952 1953
#> 5 2013 1 3 835 835 0 1102 1102
#> 6 2013 1 3 1933 1730 123 2131 1953
#> 7 2013 1 4 834 835 -1 1059 1102
#> 8 2013 1 4 1831 1730 61 2029 1953
#> 9 2013 1 5 835 835 0 1057 1102
#> 10 2013 1 5 1726 1730 -4 1948 1953
#> # ℹ 336,766 more rows
#> # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>, mean_delay <dbl>
Chaining Multiple Verbs Together
Which airlines had the highest average total delay (arrival delay and departure delay) on January 1st?
# Combine: filter → arrange → mutate → group_by → summarize
%>%
flights filter(month == 1, day == 1) %>%
arrange(desc(dep_delay)) %>%
mutate(total_delay = arr_delay + dep_delay) %>%
group_by(carrier) %>%
summarize(avg_delay = mean(total_delay, na.rm = TRUE))
#> # A tibble: 14 × 2
#> carrier avg_delay
#> <chr> <dbl>
#> 1 9E 28.6
#> 2 AA 19.4
#> 3 AS -18.5
#> 4 B6 19.2
#> 5 DL -7.64
#> 6 EV 74.7
#> 7 F9 5
#> 8 FL 0.2
#> 9 HA -17
#> 10 MQ 56.2
#> 11 UA 13.8
#> 12 US -0.938
#> 13 VX -12.9
#> 14 WN 19.7
Practice: Data Transformation
Use the gapminder
dataset to answer the following questions:
How does the average life expectancy differ across continents in 2007?
# Your code here
Which country had the highest life expectancy in each continent in 2007, and how much did it exceed the global benchmark of 60 years?
# Your code here
Which countries experienced the largest gain in life expectancy between 1952 and 2007?
# Your code here
Part 2: Tidying Data with pivot_longer()
Example: Climate Data
We’ll start by tidying a simulated climate dataset.
= 10
n_row <- tibble(
climate_raw id = 1:n_row,
date = rep(seq(as.Date("2022-07-01"), by = "1 day", length.out = n_row/2), 2),
city = rep(c("Phoenix", "Miami"), each = n_row/2),
zone = rep(c("urban", "coastal"), each = n_row/2),
temp_morning = sample(75:90, n_row, replace = TRUE),
temp_afternoon = sample(95:112, n_row, replace = TRUE),
humid_morning = sample(15:90, n_row, replace = TRUE),
humid_afternoon = sample(10:80, n_row, replace = TRUE)
)
head(climate_raw)
#> # A tibble: 6 × 8
#> id date city zone temp_morning temp_afternoon humid_morning
#> <int> <date> <chr> <chr> <int> <int> <int>
#> 1 1 2022-07-01 Phoenix urban 88 109 18
#> 2 2 2022-07-02 Phoenix urban 76 111 43
#> 3 3 2022-07-03 Phoenix urban 77 104 85
#> 4 4 2022-07-04 Phoenix urban 85 108 42
#> 5 5 2022-07-05 Phoenix urban 82 95 20
#> 6 6 2022-07-01 Miami coastal 88 104 52
#> # ℹ 1 more variable: humid_afternoon <int>
Using pivot_linger()
, we can tidy the dataset by transforming the wide format into a long format. This is useful for data analysis and visualization.
<- climate_raw %>%
climate_tidy pivot_longer(cols = starts_with(c("temp", "humid")),
names_to = c("variable", "time"),
names_sep = "_") %>%
pivot_wider(names_from = variable, values_from = value)
head(climate_tidy)
#> # A tibble: 6 × 7
#> id date city zone time temp humid
#> <int> <date> <chr> <chr> <chr> <int> <int>
#> 1 1 2022-07-01 Phoenix urban morning 88 18
#> 2 1 2022-07-01 Phoenix urban afternoon 109 17
#> 3 2 2022-07-02 Phoenix urban morning 76 43
#> 4 2 2022-07-02 Phoenix urban afternoon 111 24
#> 5 3 2022-07-03 Phoenix urban morning 77 85
#> 6 3 2022-07-03 Phoenix urban afternoon 104 76
Practice: Tidying Data using gapminder
data
Sometimes the tidy data is the wide format. For example, if we want to calculate the difference between the life expectancy in 1952 and 2007, we can create a wide version of the gapminder
dataset.
# Your code here
# gap_wide <-
Now imagine for a different task the long format of the data is better for the analytical task. Use pivot_longer()
to return this gap_wider
to long format. Expected columns are country
, continent
, year
, lifeExp
.
# Your code here
Save and Push Your Work
Remember to save your .qmd
and render the HTML output before committing to GitHub.
git add 1-lab1.qmd 1-lab1.html
git commit -m "Complete Lab 1"
git push