Lab 1. Data Wrangling

PUBH 6199: Visualizing Data with R, Summer 2025

Xindi (Cindy) Hu, ScD

2025-05-22

Outline for today

  • Introduction to GitHub and Git
  • Data transformation
  • Data tidying

About GitHub

  • GitHub is a cloud-based platform for version control and collaboration.
  • Storing your code in a “repository” on GitHub allows you to:
    • Track changes to your code over time
    • Collaborate with others on projects, including your “future self”
    • Share your work with the world
  • Made possible by the open-source software, Git

About Git

  • Git is a version control system that allows you to track changes to files.
  • A typical Git-based workflow includes:
    • Clone a repository from GitHub to your local machine
    • Branch off the main copy of the files that you are working on
    • Edit files independently and safely on your own branch
    • Let Git keep track of the changes you and others make
    • Let Git intelligently merge your changes back into the main copy of the files

How do Git and GitHub work together?

  • What is a Git repository?

    • A collection of files and their history, can be local (on your computer) or remote (on GitHub)
    • When you make changes (or commits) to the files, Git keeps track of the changes
  • Plenty to do in your browser

    • Create a Git repository, create branches, upload and edit files
  • But, most people work locally, then continue to sync local changes with the remote repository on GitHub

    • Use Git commands in the terminal or GitHub Desktop
    • Pull the latest changes from the remote repository
    • Push back your own changes to the same remote repository

In-Class Activity:

GitHub and RStudio tutorial

Prerequisites

5 minutes to catch up on these if you haven’t done so already!

05:00

Create the remote repository on GitHub

  • Accept the invitation to join the GitHub Classroom, check your email
  • Accept the assignment titled lab 1, check your email or use this link https://classroom.github.com/a/XXXX
  • Navigate to GitHub, under the class organization, you should see a repository named lab1-<your-github-username>.

Clone the repository with RStudio

  1. On GitHub, navigate to the Code tab of the repository
  2. Click the green <> Codebutton
  3. Clcik the Copy to clipboard button to copy the repository URL
  4. Open RStudio on your local environment
  5. Click File, New Project, Version Control, Git
  6. Paste the URL you copied from GitHub into the Repository URL field and enter TAB to move to the Project directory name field
  7. Click Create Project

Edit the lab notebook in RStudio

  1. In RStudio, click Files, Open File, and select 1-lab1.qmd
  2. Update the header - put your name in the author argument and put today’s date in the date argument.
  3. Save the file, and click Render to generate the HTML file.

Commit and push the changes to GitHub

  1. In RStudio, click the Git tab in the upper right pane
  2. Click Commit
  3. In the Commit window, check the box next to the file you want to commit (1-lab1.qmd and 1-lab1.html)
  4. Enter a commit message in the Commit message field (e.g., “Update lab notebook header”)
  5. Click the Commit button
  6. Click the Pull button to fetch any remote changes
  7. Click the Push button to push your changes to GitHub
  8. Navigate to your GitHub repository in your browser and check that the changes have been pushed successfully

Congratulations!

Introducing GitHub Flow

Image by Yan Min Thwin

Create local branches with Git

Tip

You can do these using the Git GUI in RStudio, I am showing you the command line version so you can learn a different method and choose what you prefer.

  1. In RStudio click the Terminal tab in the lower left pane, next to the Console tab

Note

If you cannot find the Terminal tab, you can also open a terminal window by clicking on the Tools menu and selecting Terminal > New Terminal. If that doesn’t work, check if your RStudio is out of date. Click Help, About RStudio to check the current version.

Create local branches with Git

  1. In the terminal, type the following command to create a new branch called feat/clean-data:
git checkout -b feat/clean-data
  1. Type the following command to check that you are on the new branch:
git status

You should see a message that says “On branch feat/clean-data” and “nothing to commit, working tree clean”.

You are ready to start making changes to your files!

Make local changes with Git

In RStudio, open the 1-lab1.qmd file and make some changes to the text.

For example, you can add a new section called “Data Wrangling” and write a few sentences about what tidy data is about.

You can also add a new code chunk to the file and write some R code to load the tidyverse package and read in a CSV file.

library(tidyverse)
raw_data <- read_csv("raw_data.csv")

After you are satisfied with your changes, save the file and knit the 1-lab1.qmd file to generate the HTML file.

Commit local changes with Git

  1. Determine your file’s status.
git status

You should see a message that says “On branch feat/clean-data” and “Changes not staged for commit”.

  1. Add the changes to the staging area.
git add .
  1. See your file’s current status.
git status

Your files should be listed under Changes to be committed.

  1. Commit the changes with a message. Replace with a log message describing the changes.
git commit -m "<COMMIT-MESSAGE>"

Open a pull request on GitHub

  1. Push the changes to the remote repository, replace with the name of your branch, in this case feat/clean-data
git push origin <BRANCH-NAME>
  1. Navigate to your GitHub repository in your browser
  2. Click the Compare & pull requests button, if you don’t see it, navigate to the “Pull requests” tab and click the New pull reques button.
  3. In the “Open a pull request” page, enter a title and description for your pull request. You can add a reviewer, for example your teammate on this pull request.

Merge your pull request on GitHub

Note

Since this is your repository, you probably don’t have anyone to collaborate with (yet). Go ahead and merge your Pull Request now. Later in the semester you may want your teammate to look over your code before they merge.

  1. On GitHub, navigate to the Pull Request that you just opened.
  2. Scroll down and click the big green Merge Pull Request button.
  3. Click Confirm Merge.
  4. Delete the branch .




Reference: GitHub and RStudio




Take a Break

~ This is the end of part 1 ~

05:00

Outline for today

  • Introduction to GitHub and Git
  • Data transformation
  • Data tidying

“80% of data scientists’ time is spent on data wrangling”

Data wrangling: also known as data cleaning or data preparation, is the process of collecting, cleaning, transforming and organizing data from one “raw” form into another format with the intent of making it more appropriate for analysis.

Source: R for Data Science

Manipulate data in R using dplyr

Commonality:

  • The first argument is always a data frame
  • The subsequent arguments are the columns of the data frame (without quotes)
  • The output is a new data frame . . .

Individuality:

Rows

  • filter(): filter rows
  • arrange(): change the order of rows
  • distinct(): remove duplicate rows

Columns

  • select(): select columns
  • rename(): rename columns
  • mutate(): add new columns

Groups

  • group_by(): group rows by one or more columns
  • summarize(): summarize data by groups
  • slice_*(): extract specific rows
  • ungroup(): remove grouping

A word on pipe

%>% in {magrittr} or |> in base R

  • Pipe is a tool to combine multiple verbs.
  • It takes the thing on the left and passes it to the function on the right.
  • x |> f(y) is equivalent to f(x, y)
  • x |> f(y) |> g(z) is equivalent to g(f(x, y), z).
  • Pronounces as “then”
  • Add pipe to your code using keyboard shortcut Ctrl/Cmd + Shift + M
flights |>
  filter(dest == "IAH") |> 
  group_by(year, month, day) |> 
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  )

Lots of verbs to remember!

Refer to this cheat sheet




Practice makes perfect!

~ Head over to lab1 notebook! ~

Outline for today

  • Introduction to GitHub and Git
  • Data transformation
  • Data tidying

Introduction to tidy data

“Happy families are all alike; every unhappy family is unhappy in its own way.”

- Leo Tolstoy, Anna Karenina

“Tidy datasets are all alike, but every messy dataset is messy in its own way.”

- Hadley Wickham, Tidy Data

What is tidy data?

By Julia Lowndes and Allison Horst

What is an example of untidy data?

Source: National Center for Ecological Analysis & Synthesis

Multiple tables, not machine-readable

Inconsistent columns

Inconsistent rows

Marginal sums and statistics

A single untidy table, climate_raw

date city zone temp_morning temp_afternoon humid_morning humid_afternoon
2022-07-01 Phoenix urban 83 112 58 47
2022-07-02 Phoenix urban 78 98 85 80
2022-07-03 Phoenix urban 81 110 77 55
2022-07-04 Phoenix urban 78 100 41 67
2022-07-05 Phoenix urban 78 104 69 78
2022-07-01 Miami coastal 81 110 83 26
2022-07-02 Miami coastal 89 98 67 24
2022-07-03 Miami coastal 82 96 51 38
2022-07-04 Miami coastal 78 97 23 25
2022-07-05 Miami coastal 85 95 29 31



In-Class Activity:

In pairs, discuss the following:

  1. What makes climate_raw untidy?
  2. Sketch out on paper what a tidy version of climate_raw would look like.
05:00

Why do untidy data exist and what to do about it?

  • Data is collected in a way that is convenient for the collector, not the analyst
  • Most people aren’t familiar with the principles of tidy data unless you are a data professional
  • To tidy data:
    • Begin by figuring out what are the variables and observations
    • Talk to the data curator if needed
    • pivot your data into a tidy form

pivot_longer()

Suppose we have three patients with ids A, B, and C. Each patient has two blood pressure measurements: bp1 and bp2. The data is in wide format:

df <- tibble::tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)

We want our new dataset to have three variables: id (already exists), measurement (the column names), and value (the cell values). To achieve this, we pivot df longer:

df |> 
  tidyr::pivot_longer(
    cols = bp1:bp2,
    names_to = "measurement",
    values_to = "value"
  )
# A tibble: 6 × 3
  id    measurement value
  <chr> <chr>       <dbl>
1 A     bp1           100
2 A     bp2           120
3 B     bp1           140
4 B     bp2           115
5 C     bp1           120
6 C     bp2           125

How does pivot_longer() work?

Repeat id twice

bp1 and bp2 become values in a new column




The number of values is preserved and unwound row-by-row.

pivot_wider()

Suppose we have two patients with ids A and B. We have three blood measurements on patient A and two on patient B. The data is in long format:

df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)

We’ll take the values from the value column and the names from the measurement column:

df |> 
  tidyr::pivot_wider(
    names_from = measurement,
    values_from = value
  )
# A tibble: 2 × 4
  id      bp1   bp2   bp3
  <chr> <dbl> <dbl> <dbl>
1 A       100   120   105
2 B       140   115    NA

pivot_wider() can make missing values.

How does pivot_wider() work?

First, figure out what will be the new column names, taken from measurement.

library(tidyverse)
df |> 
  distinct(measurement) |> 
  pull()
[1] "bp1" "bp2" "bp3"

Then, figure out what will be the rows in the output, determined by all the variables that aren’t going into the new names or values. Can be one or many.

df |> 
  select(-measurement, -value) |> 
  distinct()
# A tibble: 2 × 1
  id   
  <chr>
1 A    
2 B    



pivot_wider() then combine the columns and rows to generate an empty data frame, then fill it with value in the input.

df |> 
  select(-measurement, -value) |> 
  distinct() |> 
  mutate(bp1 = NA, bp2 = NA, bp3 = NA)
# A tibble: 2 × 4
  id    bp1   bp2   bp3  
  <chr> <lgl> <lgl> <lgl>
1 A     NA    NA    NA   
2 B     NA    NA    NA   

pivot_wider() can make missing values.

Why do we need pivot_wider()?

Isn’t tidy data long?

  • Yes — tidy data often means long format, especially for:
    • plotting
    • filtering
    • grouping
  • But tidy ≠ always long!

Tidy = Structure

  • Each variable in a column, each observation in a row
  • Sometimes wide format is tidy — it depends on context.

When do we need pivot_wider()?

  • ✅ For modeling:
    • lm(bp1 ~ bp2) needs one column per variable
  • ✅ For presentation:
    • Easier to read tables with 1 row per subject
  • ✅ For joining:
    • Merge with spatial data or metadata
  • ✅ To undo a pivot_longer()




Let’s tidy climate_raw

~ Head over to lab1 notebook! ~

End-of-Class Survey




Fill out the end-of-class survey

~ This is the end of Lab 1 ~

10:00