Lab 1

Author

Your Name

Published

Invalid Date

Before your start the lab notebook

  • Update the header - put your name in the author argument and put today’s date in the date argument.
  • Click the “Render” button in RStudio and then open the rendered 1-lab1.html page.
  • Then go back and try changing the theme argument in the header to something else - you can see other available themes here. Notice the difference when you render now!

Overview of Lab1

There are two parts of this lab notebook. In first part, we will practice data transformation using the built-in dataset from the {nycflights13} package. In the second part, we will practice data tidying using both a synthetic data and the built-in dataset from the {gapminder} package.

Skills practiced:

  • Use key dplyr verbs to manipulate and summarize data
  • Use pivot_longer() to tidy untidy data
  • Apply these skills in the context of climate and public health data

Load Required Packages

library(tidyverse)
library(nycflights13)
library(gapminder)
data(flights)
?flights
data(gapminder)
?gapminder

Part 1: Data Transformation with dplyr

Example Walkthrough

Let’s explore one example for each type of dplyr operation — row, column, and group — using the flights dataset.

Row Operations

What flights are on January 1st?

# filter(): Flights on January 1st
flights %>% 
  filter(month == 1, day == 1)
#> # A tibble: 842 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 832 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

Which flights have the worst departure delay?

# arrange(): Flights ordered by descending departure delay
flights %>% 
  arrange(desc(dep_delay))
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     9      641            900      1301     1242           1530
#>  2  2013     6    15     1432           1935      1137     1607           2120
#>  3  2013     1    10     1121           1635      1126     1239           1810
#>  4  2013     9    20     1139           1845      1014     1457           2210
#>  5  2013     7    22      845           1600      1005     1044           1815
#>  6  2013     4    10     1100           1900       960     1342           2211
#>  7  2013     3    17     2321            810       911      135           1020
#>  8  2013     6    27      959           1900       899     1236           2226
#>  9  2013     7    22     2257            759       898      121           1026
#> 10  2013    12     5      756           1700       896     1058           2020
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

Where do flights between NYC (i.e. JFK, LGA or EWR) take you?

# distinct(): Unique origin-destination pairs
flights %>% 
  select(origin, dest) %>% 
  distinct()
#> # A tibble: 224 × 2
#>    origin dest 
#>    <chr>  <chr>
#>  1 EWR    IAH  
#>  2 LGA    IAH  
#>  3 JFK    MIA  
#>  4 JFK    BQN  
#>  5 LGA    ATL  
#>  6 EWR    ORD  
#>  7 EWR    FLL  
#>  8 LGA    IAD  
#>  9 JFK    MCO  
#> 10 LGA    ORD  
#> # ℹ 214 more rows

Column Operations

Only keep relevant columns.

# select(): Keep only flight number and delays
flights %>% 
  select(flight, arr_delay, dep_delay)
#> # A tibble: 336,776 × 3
#>    flight arr_delay dep_delay
#>     <int>     <dbl>     <dbl>
#>  1   1545        11         2
#>  2   1714        20         4
#>  3   1141        33         2
#>  4    725       -18        -1
#>  5    461       -25        -6
#>  6   1696        12        -4
#>  7    507        19        -5
#>  8   5708       -14        -3
#>  9     79        -8        -3
#> 10    301         8        -2
#> # ℹ 336,766 more rows

Rename arr_delay to arrival_delay.

# rename(): Rename arr_delay to arrival_delay
flights %>% 
  rename(arrival_delay = arr_delay)
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arrival_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

Calculate the total delay as the sum of arrival delay and departure delay.

Common mistake:

# What is wrong with this code?
flights %>% 
  mutate(total_delay = arrival_delay + dep_delay)

Correct solution:

# mutate(): Add a column for total delay
flights %>% 
  mutate(total_delay = arr_delay + dep_delay)
#> # A tibble: 336,776 × 20
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, total_delay <dbl>

🔗 Group Operations

What is the average delay by carrier?

# group_by() + summarize(): Average delay by carrier
flights %>% 
  group_by(carrier) %>% 
  summarize(avg_delay = mean(arr_delay, na.rm = TRUE))
#> # A tibble: 16 × 2
#>    carrier avg_delay
#>    <chr>       <dbl>
#>  1 9E          7.38 
#>  2 AA          0.364
#>  3 AS         -9.93 
#>  4 B6          9.46 
#>  5 DL          1.64 
#>  6 EV         15.8  
#>  7 F9         21.9  
#>  8 FL         20.1  
#>  9 HA         -6.92 
#> 10 MQ         10.8  
#> 11 OO         11.9  
#> 12 UA          3.56 
#> 13 US          2.13 
#> 14 VX          1.76 
#> 15 WN          9.65 
#> 16 YV         15.6

For each destination, what is the most delayed flight?

# slice_max(): Most delayed flight by destination
flights %>% 
  group_by(dest) %>% 
  slice_max(order_by = arr_delay, n = 1, with_ties = FALSE)
#> # A tibble: 105 × 19
#> # Groups:   dest [105]
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     7    22     2145           2007        98      132           2259
#>  2  2013     7    23     1139            800       219     1250            909
#>  3  2013     1    25      123           2000       323      229           2101
#>  4  2013     8    17     1740           1625        75     2042           2003
#>  5  2013     7    22     2257            759       898      121           1026
#>  6  2013     7    10     2056           1505       351     2347           1758
#>  7  2013     8    13     1156            832       204     1417           1029
#>  8  2013     2    21     1728           1316       252     1839           1413
#>  9  2013    12     1     1504           1056       248     1628           1230
#> 10  2013     4    10       25           1900       325      136           2045
#> # ℹ 95 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

Which airline carriers had the highest average arrival delays, and how do they compare overall? What if I want to keep all the rows for further data manipulation?

# ungroup(): Remove grouping structure
flights %>% 
  group_by(carrier) %>% 
  mutate(mean_delay = mean(arr_delay, na.rm = TRUE)) %>% 
  ungroup() %>% 
  arrange(desc(mean_delay))
#> # A tibble: 336,776 × 20
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      833            835        -2     1134           1102
#>  2  2013     1     1     1716           1730       -14     1947           1953
#>  3  2013     1     2      827            835        -8     1120           1102
#>  4  2013     1     2     1728           1730        -2     1952           1953
#>  5  2013     1     3      835            835         0     1102           1102
#>  6  2013     1     3     1933           1730       123     2131           1953
#>  7  2013     1     4      834            835        -1     1059           1102
#>  8  2013     1     4     1831           1730        61     2029           1953
#>  9  2013     1     5      835            835         0     1057           1102
#> 10  2013     1     5     1726           1730        -4     1948           1953
#> # ℹ 336,766 more rows
#> # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, mean_delay <dbl>

Chaining Multiple Verbs Together

Which airlines had the highest average total delay (arrival delay and departure delay) on January 1st?

# Combine: filter → arrange → mutate → group_by → summarize
flights %>% 
  filter(month == 1, day == 1) %>% 
  arrange(desc(dep_delay)) %>% 
  mutate(total_delay = arr_delay + dep_delay) %>% 
  group_by(carrier) %>% 
  summarize(avg_delay = mean(total_delay, na.rm = TRUE))
#> # A tibble: 14 × 2
#>    carrier avg_delay
#>    <chr>       <dbl>
#>  1 9E         28.6  
#>  2 AA         19.4  
#>  3 AS        -18.5  
#>  4 B6         19.2  
#>  5 DL         -7.64 
#>  6 EV         74.7  
#>  7 F9          5    
#>  8 FL          0.2  
#>  9 HA        -17    
#> 10 MQ         56.2  
#> 11 UA         13.8  
#> 12 US         -0.938
#> 13 VX        -12.9  
#> 14 WN         19.7

Practice: Data Transformation

Use the gapminder dataset to answer the following questions:

How does the average life expectancy differ across continents in 2007?

# Your code here
Your Answer

Which country had the highest life expectancy in each continent in 2007, and how much did it exceed the global benchmark of 60 years?

# Your code here
Your Answer

Which countries experienced the largest gain in life expectancy between 1952 and 2007?

# Your code here
Your Answer

Part 2: Tidying Data with pivot_longer()

Example: Climate Data

We’ll start by tidying a simulated climate dataset.

n_row = 10
climate_raw <- tibble(
  id = 1:n_row,
  date = rep(seq(as.Date("2022-07-01"), by = "1 day", length.out = n_row/2), 2),
  city = rep(c("Phoenix", "Miami"), each = n_row/2),
  zone = rep(c("urban", "coastal"), each = n_row/2),
  temp_morning = sample(75:90, n_row, replace = TRUE),
  temp_afternoon = sample(95:112, n_row, replace = TRUE),
  humid_morning = sample(15:90, n_row, replace = TRUE),
  humid_afternoon = sample(10:80, n_row, replace = TRUE)
)

head(climate_raw)
#> # A tibble: 6 × 8
#>      id date       city    zone    temp_morning temp_afternoon humid_morning
#>   <int> <date>     <chr>   <chr>          <int>          <int>         <int>
#> 1     1 2022-07-01 Phoenix urban             88            109            18
#> 2     2 2022-07-02 Phoenix urban             76            111            43
#> 3     3 2022-07-03 Phoenix urban             77            104            85
#> 4     4 2022-07-04 Phoenix urban             85            108            42
#> 5     5 2022-07-05 Phoenix urban             82             95            20
#> 6     6 2022-07-01 Miami   coastal           88            104            52
#> # ℹ 1 more variable: humid_afternoon <int>

Using pivot_linger(), we can tidy the dataset by transforming the wide format into a long format. This is useful for data analysis and visualization.

climate_tidy <- climate_raw %>% 
  pivot_longer(cols = starts_with(c("temp", "humid")),
               names_to = c("variable", "time"),
               names_sep = "_") %>%
  pivot_wider(names_from = variable, values_from = value)

head(climate_tidy)
#> # A tibble: 6 × 7
#>      id date       city    zone  time       temp humid
#>   <int> <date>     <chr>   <chr> <chr>     <int> <int>
#> 1     1 2022-07-01 Phoenix urban morning      88    18
#> 2     1 2022-07-01 Phoenix urban afternoon   109    17
#> 3     2 2022-07-02 Phoenix urban morning      76    43
#> 4     2 2022-07-02 Phoenix urban afternoon   111    24
#> 5     3 2022-07-03 Phoenix urban morning      77    85
#> 6     3 2022-07-03 Phoenix urban afternoon   104    76

Practice: Tidying Data using gapminder data

Sometimes the tidy data is the wide format. For example, if we want to calculate the difference between the life expectancy in 1952 and 2007, we can create a wide version of the gapminder dataset.

# Your code here
# gap_wide <-

Now imagine for a different task the long format of the data is better for the analytical task. Use pivot_longer() to return this gap_wider to long format. Expected columns are country, continent, year, lifeExp.

# Your code here

Save and Push Your Work

Remember to save your .qmd and render the HTML output before committing to GitHub.

git add 1-lab1.qmd 1-lab1.html
git commit -m "Complete Lab 1"
git push