dplyr & tidy data

Why tidy data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

Introducing the tidyverse:

library(tidyr)   # cleaning
library(dplyr)   # wrangling 
library(ggplot2) # plotting

Tibbles

opinionated improvement of data.frame
does not have rownames
can have arbitrary column names
consistent behavior
pretty printing

tibble(A=1:1000, `log10(TPM+1)` = log1p(1:1000))

## # A tibble: 1,000 × 2
##       A `log10(TPM+1)`
##   <int>          <dbl>
## 1     1          0.693
## 2     2          1.10 
## 3     3          1.39 
## # … with 997 more rows

Pipes

Use pipes

df |>
  select(A) |>
  filter(A == 42) |>
  mutate(B=A*2)

Don’t read your code backwards

mutate(filter(select(df, A), A == 42), B=A*2)

Don’t waste your time inventing variable names

df_subset <- select(df, A)
df_filtered <- filter(df_subset, A == 42)
final_df <- mutate(df_filtered, B=A*2)

Tidy data

Tidying data: `pivot_longer`

table4a |> 
  pivot_longer(cols=-country, values_to="cases", names_to="year")

Tidying data: `pivot_wider`

table2 |>
  pivot_wider(names_from="type", values_from="count")

Tidying data: `separate`

markers = tibble(
  id=c("patient1_cell1", "patient1_cell2", "patient2_cell1", "patient2_cell2"),
  markers=c("CD8A", "CD45_CD8A", "CD45_CD4_FOXP3", "CD4")
)

markers |> 
  separate(id, into=c("patient", "cell")) |>
  separate_rows("markers", sep="_")

## # A tibble: 7 × 3
##   patient  cell  markers
##   <chr>    <chr> <chr>  
## 1 patient1 cell1 CD8A   
## 2 patient1 cell2 CD45   
## 3 patient1 cell2 CD8A   
## # … with 4 more rows

Select data

select: choose some columns
filter: choose some rows

iris2 <- iris |> 
  select(sepal_length=Sepal.Length, sepal_width=Sepal.Width, Species) |>
  filter(Species %in% c("setosa", "virginica"))

Modify data: `mutate`

iris2 |> 
  mutate(sepal_length_in = sepal_length / 25.4)

## # A tibble: 100 × 4
##   sepal_length sepal_width Species sepal_length_in
##          <dbl>       <dbl> <fct>             <dbl>
## 1          5.1         3.5 setosa            0.201
## 2          4.9         3   setosa            0.193
## 3          4.7         3.2 setosa            0.185
## # … with 97 more rows

Modify data: `group_by` + `summarise`

iris2 |> 
  group_by(Species) |> 
  summarise(max_length = max(sepal_length), mean_length = mean(sepal_length), n_samples=n())

## # A tibble: 2 × 4
##   Species   max_length mean_length n_samples
##   <fct>          <dbl>       <dbl>     <int>
## 1 setosa           5.8        5.01        50
## 2 virginica        7.9        6.59        50

Combining data

bind_rows: concatenate two dataframes with same columns
bind_cols: concatenate two dataframes with same rows

Inner join

inner_join(x, y)
inner_join(x, y, by="key")
inner_join(x, y, by=c("key_x"="key_y"))

More joins

left_join, right_join
full_join

See understanding joins for nice illustrations.

Why tidy data?

Tibbles

Pipes

Tidy data

Tidying data: pivot_longer

Tidying data: pivot_longer

Tidying data: pivot_wider

Tidying data: separate

Select data

Modify data: mutate

Modify data: group_by + summarise

Combining data

Inner join

More joins

Further Reading & Credits

Tidying data: `pivot_longer`

Tidying data: `pivot_longer`

Tidying data: `pivot_wider`

Tidying data: `separate`

Modify data: `mutate`

Modify data: `group_by` + `summarise`