“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Introducing the tidyverse
:
library(tidyr) # cleaning library(dplyr) # wrangling library(ggplot2) # plotting
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Introducing the tidyverse
:
library(tidyr) # cleaning library(dplyr) # wrangling library(ggplot2) # plotting
data.frame
tibble(A=1:1000, `log10(TPM+1)` = log1p(1:1000))
## # A tibble: 1,000 × 2 ## A `log10(TPM+1)` ## <int> <dbl> ## 1 1 0.693 ## 2 2 1.10 ## 3 3 1.39 ## # … with 997 more rows
Use pipes
df |> select(A) |> filter(A == 42) |> mutate(B=A*2)
Don’t read your code backwards
mutate(filter(select(df, A), A == 42), B=A*2)
Don’t waste your time inventing variable names
df_subset <- select(df, A) df_filtered <- filter(df_subset, A == 42) final_df <- mutate(df_filtered, B=A*2)
pivot_longer
pivot_longer
table4a |> pivot_longer(cols=-country, values_to="cases", names_to="year")
pivot_wider
table2 |> pivot_wider(names_from="type", values_from="count")
separate
markers = tibble( id=c("patient1_cell1", "patient1_cell2", "patient2_cell1", "patient2_cell2"), markers=c("CD8A", "CD45_CD8A", "CD45_CD4_FOXP3", "CD4") )
markers |> separate(id, into=c("patient", "cell")) |> separate_rows("markers", sep="_")
## # A tibble: 7 × 3 ## patient cell markers ## <chr> <chr> <chr> ## 1 patient1 cell1 CD8A ## 2 patient1 cell2 CD45 ## 3 patient1 cell2 CD8A ## # … with 4 more rows
select
: choose some columnsfilter
: choose some rowsiris2 <- iris |> select(sepal_length=Sepal.Length, sepal_width=Sepal.Width, Species) |> filter(Species %in% c("setosa", "virginica"))
mutate
iris2 |> mutate(sepal_length_in = sepal_length / 25.4)
## # A tibble: 100 × 4 ## sepal_length sepal_width Species sepal_length_in ## <dbl> <dbl> <fct> <dbl> ## 1 5.1 3.5 setosa 0.201 ## 2 4.9 3 setosa 0.193 ## 3 4.7 3.2 setosa 0.185 ## # … with 97 more rows
group_by
+ summarise
iris2 |> group_by(Species) |> summarise(max_length = max(sepal_length), mean_length = mean(sepal_length), n_samples=n())
## # A tibble: 2 × 4 ## Species max_length mean_length n_samples ## <fct> <dbl> <dbl> <int> ## 1 setosa 5.8 5.01 50 ## 2 virginica 7.9 6.59 50
bind_rows
: concatenate two dataframes with same columnsbind_cols
: concatenate two dataframes with same rowsinner_join(x, y)
inner_join(x, y, by="key")
inner_join(x, y, by=c("key_x"="key_y"))
left_join
, right_join
full_join
See understanding joins for nice illustrations.
Examples are taken from the following resources:
For more details, I recommend reading: