05:00
PUBH 6199: Visualizing Data with R, Summer 2025
2025-05-22
What is a Git repository?
Plenty to do in your browser
But, most people work locally, then continue to sync local changes with the remote repository on GitHub
In-Class Activity:
GitHub and RStudio tutorial
5 minutes to catch up on these if you haven’t done so already!
05:00
lab1-<your-github-username>
.author
argument and put today’s date in the date
argument.1-lab1.qmd
and 1-lab1.html
)Image by Yan Min Thwin
Tip
You can do these using the Git GUI in RStudio, I am showing you the command line version so you can learn a different method and choose what you prefer.
Note
If you cannot find the Terminal tab, you can also open a terminal window by clicking on the Tools menu and selecting Terminal > New Terminal. If that doesn’t work, check if your RStudio is out of date. Click Help, About RStudio to check the current version.
feat/clean-data
:In RStudio, open the 1-lab1.qmd
file and make some changes to the text.
For example, you can add a new section called “Data Wrangling” and write a few sentences about what tidy data is about.
You can also add a new code chunk to the file and write some R code to load the tidyverse
package and read in a CSV file.
After you are satisfied with your changes, save the file and knit the 1-lab1.qmd
file to generate the HTML file.
You should see a message that says “On branch feat/clean-data” and “Changes not staged for commit”.
Your files should be listed under Changes to be committed.
Note
Since this is your repository, you probably don’t have anyone to collaborate with (yet). Go ahead and merge your Pull Request now. Later in the semester you may want your teammate to look over your code before they merge.
Reference: GitHub and RStudio
Take a Break
~ This is the end of part 1 ~
05:00
Data wrangling: also known as data cleaning or data preparation, is the process of collecting, cleaning, transforming and organizing data from one “raw” form into another format with the intent of making it more appropriate for analysis.
Source: R for Data Science
dplyr
Commonality:
Individuality:
Rows
filter()
: filter rowsarrange()
: change the order of rowsdistinct()
: remove duplicate rowsColumns
select()
: select columnsrename()
: rename columnsmutate()
: add new columnsGroups
group_by()
: group rows by one or more columnssummarize()
: summarize data by groupsslice_*()
: extract specific rowsungroup()
: remove grouping%>%
in {magrittr}
or |>
in base R
x |> f(y)
is equivalent to f(x, y)
x |> f(y) |> g(z)
is equivalent to g(f(x, y), z)
.Refer to this cheat sheet
Practice makes perfect!
~ Head over to lab1 notebook! ~
“Happy families are all alike; every unhappy family is unhappy in its own way.”
- Leo Tolstoy, Anna Karenina
“Tidy datasets are all alike, but every messy dataset is messy in its own way.”
- Hadley Wickham, Tidy Data
Multiple tables, not machine-readable
Inconsistent columns
Inconsistent rows
Marginal sums and statistics
climate_raw
date | city | zone | temp_morning | temp_afternoon | humid_morning | humid_afternoon |
---|---|---|---|---|---|---|
2022-07-01 | Phoenix | urban | 83 | 112 | 58 | 47 |
2022-07-02 | Phoenix | urban | 78 | 98 | 85 | 80 |
2022-07-03 | Phoenix | urban | 81 | 110 | 77 | 55 |
2022-07-04 | Phoenix | urban | 78 | 100 | 41 | 67 |
2022-07-05 | Phoenix | urban | 78 | 104 | 69 | 78 |
2022-07-01 | Miami | coastal | 81 | 110 | 83 | 26 |
2022-07-02 | Miami | coastal | 89 | 98 | 67 | 24 |
2022-07-03 | Miami | coastal | 82 | 96 | 51 | 38 |
2022-07-04 | Miami | coastal | 78 | 97 | 23 | 25 |
2022-07-05 | Miami | coastal | 85 | 95 | 29 | 31 |
In-Class Activity:
In pairs, discuss the following:
climate_raw
untidy?climate_raw
would look like.05:00
pivot_longer()
Suppose we have three patients with id
s A, B, and C. Each patient has two blood pressure measurements: bp1
and bp2
. The data is in wide format:
We want our new dataset to have three variables: id
(already exists), measurement
(the column names), and value
(the cell values). To achieve this, we pivot df
longer:
pivot_longer()
work?Repeat id
twice
bp1
and bp2
become values in a new column
The number of values is preserved and unwound row-by-row.
pivot_wider()
Suppose we have two patients with id
s A and B. We have three blood measurements on patient A and two on patient B. The data is in long format:
pivot_wider()
work?First, figure out what will be the new column names, taken from measurement
.
pivot_wider()
then combine the columns and rows to generate an empty data frame, then fill it with value
in the input.
# A tibble: 2 × 4
id bp1 bp2 bp3
<chr> <lgl> <lgl> <lgl>
1 A NA NA NA
2 B NA NA NA
pivot_wider()
can make missing values.
pivot_wider()
?Isn’t tidy data long?
Tidy = Structure
When do we need pivot_wider()
?
lm(bp1 ~ bp2)
needs one column per variable
Let’s tidy climate_raw
~ Head over to lab1 notebook! ~
Fill out the end-of-class survey
~ This is the end of Lab 1 ~
10:00
PUBH 6199: Visualizing Data with R