Lab 1. Data Wrangling

PUBH 6199: Visualizing Data with R, Summer 2025

Xindi (Cindy) Hu, ScD

2025-05-22

Outline for today

Introduction to GitHub and Git
Data transformation
Data tidying

About GitHub

GitHub is a cloud-based platform for version control and collaboration.
Storing your code in a “repository” on GitHub allows you to:
- Track changes to your code over time
- Collaborate with others on projects, including your “future self”
- Share your work with the world
Made possible by the open-source software, Git

About Git

Git is a version control system that allows you to track changes to files.
A typical Git-based workflow includes:
- Clone a repository from GitHub to your local machine
- Branch off the main copy of the files that you are working on
- Edit files independently and safely on your own branch
- Let Git keep track of the changes you and others make
- Let Git intelligently merge your changes back into the main copy of the files

How do Git and GitHub work together?

What is a Git repository?
- A collection of files and their history, can be local (on your computer) or remote (on GitHub)
- When you make changes (or commits) to the files, Git keeps track of the changes
Plenty to do in your browser
- Create a Git repository, create branches, upload and edit files
But, most people work locally, then continue to sync local changes with the remote repository on GitHub
- Use Git commands in the terminal or GitHub Desktop
- Pull the latest changes from the remote repository
- Push back your own changes to the same remote repository

In-Class Activity:

GitHub and RStudio tutorial

Prerequisites

You have a GitHub account
You have downloaded and installed Git
You have downloaded and installed RStudio

5 minutes to catch up on these if you haven’t done so already!

05:00

Create the remote repository on GitHub

Accept the invitation to join the GitHub Classroom, check your email
Accept the assignment titled lab 1, check your email or use this link https://classroom.github.com/a/XXXX
Navigate to GitHub, under the class organization, you should see a repository named lab1-<your-github-username>.

Clone the repository with RStudio

On GitHub, navigate to the Code tab of the repository
Click the green <> Codebutton
Clcik the Copy to clipboard button to copy the repository URL
Open RStudio on your local environment
Click File, New Project, Version Control, Git
Paste the URL you copied from GitHub into the Repository URL field and enter TAB to move to the Project directory name field
Click Create Project

Edit the lab notebook in RStudio

In RStudio, click Files, Open File, and select 1-lab1.qmd
Update the header - put your name in the author argument and put today’s date in the date argument.
Save the file, and click Render to generate the HTML file.

Commit and push the changes to GitHub

In RStudio, click the Git tab in the upper right pane
Click Commit
In the Commit window, check the box next to the file you want to commit (1-lab1.qmd and 1-lab1.html)
Enter a commit message in the Commit message field (e.g., “Update lab notebook header”)
Click the Commit button
Click the Pull button to fetch any remote changes
Click the Push button to push your changes to GitHub
Navigate to your GitHub repository in your browser and check that the changes have been pushed successfully

Congratulations!

Introducing GitHub Flow

Image by Yan Min Thwin

Create local branches with Git

Tip

You can do these using the Git GUI in RStudio, I am showing you the command line version so you can learn a different method and choose what you prefer.

In RStudio click the Terminal tab in the lower left pane, next to the Console tab

Note

If you cannot find the Terminal tab, you can also open a terminal window by clicking on the Tools menu and selecting Terminal > New Terminal. If that doesn’t work, check if your RStudio is out of date. Click Help, About RStudio to check the current version.

Create local branches with Git

In the terminal, type the following command to create a new branch called feat/clean-data:

git checkout -b feat/clean-data

Type the following command to check that you are on the new branch:

git status

You should see a message that says “On branch feat/clean-data” and “nothing to commit, working tree clean”.

You are ready to start making changes to your files!

Make local changes with Git

In RStudio, open the 1-lab1.qmd file and make some changes to the text.

For example, you can add a new section called “Data Wrangling” and write a few sentences about what tidy data is about.

You can also add a new code chunk to the file and write some R code to load the tidyverse package and read in a CSV file.

library(tidyverse)
raw_data <- read_csv("raw_data.csv")

After you are satisfied with your changes, save the file and knit the 1-lab1.qmd file to generate the HTML file.

Commit local changes with Git

Determine your file’s status.

git status

You should see a message that says “On branch feat/clean-data” and “Changes not staged for commit”.

Add the changes to the staging area.

git add .

See your file’s current status.

git status

Your files should be listed under Changes to be committed.

Commit the changes with a message. Replace with a log message describing the changes.

git commit -m "<COMMIT-MESSAGE>"

Open a pull request on GitHub

Push the changes to the remote repository, replace with the name of your branch, in this case feat/clean-data

git push origin <BRANCH-NAME>

Navigate to your GitHub repository in your browser
Click the Compare & pull requests button, if you don’t see it, navigate to the “Pull requests” tab and click the New pull reques button.
In the “Open a pull request” page, enter a title and description for your pull request. You can add a reviewer, for example your teammate on this pull request.

Merge your pull request on GitHub

Note

Since this is your repository, you probably don’t have anyone to collaborate with (yet). Go ahead and merge your Pull Request now. Later in the semester you may want your teammate to look over your code before they merge.

On GitHub, navigate to the Pull Request that you just opened.
Scroll down and click the big green Merge Pull Request button.
Click Confirm Merge.
Delete the branch .

Reference: GitHub and RStudio

Take a Break

~ This is the end of part 1 ~

05:00

Outline for today

Introduction to GitHub and Git
Data transformation
Data tidying

“80% of data scientists’ time is spent on data wrangling”

Data wrangling: also known as data cleaning or data preparation, is the process of collecting, cleaning, transforming and organizing data from one “raw” form into another format with the intent of making it more appropriate for analysis.

Source: R for Data Science

Manipulate data in R using `dplyr`

Commonality:

The first argument is always a data frame
The subsequent arguments are the columns of the data frame (without quotes)
The output is a new data frame . . .

Individuality:

Rows

filter(): filter rows
arrange(): change the order of rows
distinct(): remove duplicate rows

Columns

select(): select columns
rename(): rename columns
mutate(): add new columns

Groups

group_by(): group rows by one or more columns
summarize(): summarize data by groups
slice_*(): extract specific rows
ungroup(): remove grouping

A word on pipe

%>% in {magrittr} or |> in base R

Pipe is a tool to combine multiple verbs.
It takes the thing on the left and passes it to the function on the right.
x |> f(y) is equivalent to f(x, y)
x |> f(y) |> g(z) is equivalent to g(f(x, y), z).
Pronounces as “then”
Add pipe to your code using keyboard shortcut Ctrl/Cmd + Shift + M

flights |>
  filter(dest == "IAH") |> 
  group_by(year, month, day) |> 
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  )

Lots of verbs to remember!

Refer to this cheat sheet

Practice makes perfect!

~ Head over to lab1 notebook! ~

Outline for today

Introduction to GitHub and Git
Data transformation
Data tidying

Introduction to tidy data

“Happy families are all alike; every unhappy family is unhappy in its own way.”

- Leo Tolstoy, Anna Karenina

“Tidy datasets are all alike, but every messy dataset is messy in its own way.”

- Hadley Wickham, Tidy Data

What is tidy data?

By Julia Lowndes and Allison Horst

What is an example of untidy data?

Source: National Center for Ecological Analysis & Synthesis

Multiple tables, not machine-readable

Inconsistent columns

Inconsistent rows

Marginal sums and statistics

A single untidy table, `climate_raw`

date	city	zone	temp_morning	temp_afternoon	humid_morning	humid_afternoon
2022-07-01	Phoenix	urban	83	112	58	47
2022-07-02	Phoenix	urban	78	98	85	80
2022-07-03	Phoenix	urban	81	110	77	55
2022-07-04	Phoenix	urban	78	100	41	67
2022-07-05	Phoenix	urban	78	104	69	78
2022-07-01	Miami	coastal	81	110	83	26
2022-07-02	Miami	coastal	89	98	67	24
2022-07-03	Miami	coastal	82	96	51	38
2022-07-04	Miami	coastal	78	97	23	25
2022-07-05	Miami	coastal	85	95	29	31

In-Class Activity:

In pairs, discuss the following:

What makes climate_raw untidy?
Sketch out on paper what a tidy version of climate_raw would look like.

05:00

Why do untidy data exist and what to do about it?

Data is collected in a way that is convenient for the collector, not the analyst
Most people aren’t familiar with the principles of tidy data unless you are a data professional

To tidy data:
- Begin by figuring out what are the variables and observations
- Talk to the data curator if needed
- pivot your data into a tidy form

`pivot_longer()`

Suppose we have three patients with ids A, B, and C. Each patient has two blood pressure measurements: bp1 and bp2. The data is in wide format:

df <- tibble::tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)

We want our new dataset to have three variables: id (already exists), measurement (the column names), and value (the cell values). To achieve this, we pivot df longer:

df |> 
  tidyr::pivot_longer(
    cols = bp1:bp2,
    names_to = "measurement",
    values_to = "value"
  )

# A tibble: 6 × 3
  id    measurement value
  <chr> <chr>       <dbl>
1 A     bp1           100
2 A     bp2           120
3 B     bp1           140
4 B     bp2           115
5 C     bp1           120
6 C     bp2           125

How does `pivot_longer()` work?

Repeat id twice

bp1 and bp2 become values in a new column

The number of values is preserved and unwound row-by-row.

`pivot_wider()`

Suppose we have two patients with ids A and B. We have three blood measurements on patient A and two on patient B. The data is in long format:

df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)

We’ll take the values from the value column and the names from the measurement column:

df |> 
  tidyr::pivot_wider(
    names_from = measurement,
    values_from = value
  )

# A tibble: 2 × 4
  id      bp1   bp2   bp3
  <chr> <dbl> <dbl> <dbl>
1 A       100   120   105
2 B       140   115    NA

pivot_wider() can make missing values.

How does `pivot_wider()` work?

First, figure out what will be the new column names, taken from measurement.

library(tidyverse)
df |> 
  distinct(measurement) |> 
  pull()

[1] "bp1" "bp2" "bp3"

Then, figure out what will be the rows in the output, determined by all the variables that aren’t going into the new names or values. Can be one or many.

df |> 
  select(-measurement, -value) |> 
  distinct()

# A tibble: 2 × 1
  id   
  <chr>
1 A    
2 B

pivot_wider() then combine the columns and rows to generate an empty data frame, then fill it with value in the input.

df |> 
  select(-measurement, -value) |> 
  distinct() |> 
  mutate(bp1 = NA, bp2 = NA, bp3 = NA)

# A tibble: 2 × 4
  id    bp1   bp2   bp3  
  <chr> <lgl> <lgl> <lgl>
1 A     NA    NA    NA   
2 B     NA    NA    NA

pivot_wider() can make missing values.

Why do we need `pivot_wider()`?

Isn’t tidy data long?

Yes — tidy data often means long format, especially for:
- plotting
- filtering
- grouping
But tidy ≠ always long!

Tidy = Structure

Each variable in a column, each observation in a row
Sometimes wide format is tidy — it depends on context.

When do we need pivot_wider()?

✅ For modeling:
- lm(bp1 ~ bp2) needs one column per variable
✅ For presentation:
- Easier to read tables with 1 row per subject
✅ For joining:
- Merge with spatial data or metadata
✅ To undo a pivot_longer()

Let’s tidy climate_raw

~ Head over to lab1 notebook! ~

End-of-Class Survey

Fill out the end-of-class survey

~ This is the end of Lab 1 ~

10:00

Lab 1. Data Wrangling

Outline for today

About GitHub

About Git

How do Git and GitHub work together?

Prerequisites

Create the remote repository on GitHub

Clone the repository with RStudio

Edit the lab notebook in RStudio

Commit and push the changes to GitHub

Congratulations!

Introducing GitHub Flow

Create local branches with Git

Create local branches with Git

Make local changes with Git

Commit local changes with Git

Open a pull request on GitHub

Merge your pull request on GitHub

Outline for today

“80% of data scientists’ time is spent on data wrangling”

Manipulate data in R using dplyr

A word on pipe

Lots of verbs to remember!

Outline for today

Introduction to tidy data

What is tidy data?

What is an example of untidy data?

A single untidy table, climate_raw

Why do untidy data exist and what to do about it?

pivot_longer()

How does pivot_longer() work?

pivot_wider()

How does pivot_wider() work?

Why do we need pivot_wider()?

End-of-Class Survey

Manipulate data in R using `dplyr`

A single untidy table, `climate_raw`

`pivot_longer()`

How does `pivot_longer()` work?

`pivot_wider()`

How does `pivot_wider()` work?

Why do we need `pivot_wider()`?