Why tidy data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham


Introducing the tidyverse:

library(tidyr)   # cleaning
library(dplyr)   # wrangling 
library(ggplot2) # plotting

Tibbles

  • opinionated improvement of data.frame
  • does not have rownames
  • can have arbitrary column names
  • consistent behavior
  • pretty printing
tibble(A=1:1000, `log10(TPM+1)` = log1p(1:1000))
## # A tibble: 1,000 × 2
##       A `log10(TPM+1)`
##   <int>          <dbl>
## 1     1          0.693
## 2     2          1.10 
## 3     3          1.39 
## # … with 997 more rows

Pipes

Use pipes

df |>
  select(A) |>
  filter(A == 42) |>
  mutate(B=A*2)

Don’t read your code backwards

mutate(filter(select(df, A), A == 42), B=A*2)

Don’t waste your time inventing variable names

df_subset <- select(df, A)
df_filtered <- filter(df_subset, A == 42)
final_df <- mutate(df_filtered, B=A*2)

Tidy data

Tidying data: pivot_longer

Tidying data: pivot_longer

table4a |> 
  pivot_longer(cols=-country, values_to="cases", names_to="year")

Tidying data: pivot_wider

table2 |>
  pivot_wider(names_from="type", values_from="count")

Tidying data: separate

markers = tibble(
  id=c("patient1_cell1", "patient1_cell2", "patient2_cell1", "patient2_cell2"),
  markers=c("CD8A", "CD45_CD8A", "CD45_CD4_FOXP3", "CD4")
)
markers |> 
  separate(id, into=c("patient", "cell")) |>
  separate_rows("markers", sep="_")
## # A tibble: 7 × 3
##   patient  cell  markers
##   <chr>    <chr> <chr>  
## 1 patient1 cell1 CD8A   
## 2 patient1 cell2 CD45   
## 3 patient1 cell2 CD8A   
## # … with 4 more rows

Select data

  • select: choose some columns
  • filter: choose some rows
iris2 <- iris |> 
  select(sepal_length=Sepal.Length, sepal_width=Sepal.Width, Species) |>
  filter(Species %in% c("setosa", "virginica"))

Modify data: mutate

iris2 |> 
  mutate(sepal_length_in = sepal_length / 25.4)
## # A tibble: 100 × 4
##   sepal_length sepal_width Species sepal_length_in
##          <dbl>       <dbl> <fct>             <dbl>
## 1          5.1         3.5 setosa            0.201
## 2          4.9         3   setosa            0.193
## 3          4.7         3.2 setosa            0.185
## # … with 97 more rows

Modify data: group_by + summarise

iris2 |> 
  group_by(Species) |> 
  summarise(max_length = max(sepal_length), mean_length = mean(sepal_length), n_samples=n())
## # A tibble: 2 × 4
##   Species   max_length mean_length n_samples
##   <fct>          <dbl>       <dbl>     <int>
## 1 setosa           5.8        5.01        50
## 2 virginica        7.9        6.59        50

Combining data

  • bind_rows: concatenate two dataframes with same columns
  • bind_cols: concatenate two dataframes with same rows

Inner join

  • inner_join(x, y)
  • inner_join(x, y, by="key")
  • inner_join(x, y, by=c("key_x"="key_y"))

More joins

  • left_join, right_join
  • full_join

See understanding joins for nice illustrations.

Further Reading & Credits