“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Introducing the tidyverse:
library(tidyr) # cleaning library(dplyr) # wrangling library(ggplot2) # plotting
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Introducing the tidyverse:
library(tidyr) # cleaning library(dplyr) # wrangling library(ggplot2) # plotting
data.frametibble(A=1:1000, `log10(TPM+1)` = log1p(1:1000))
## # A tibble: 1,000 × 2 ## A `log10(TPM+1)` ## <int> <dbl> ## 1 1 0.693 ## 2 2 1.10 ## 3 3 1.39 ## # … with 997 more rows
Use pipes
df |> select(A) |> filter(A == 42) |> mutate(B=A*2)
Don’t read your code backwards
mutate(filter(select(df, A), A == 42), B=A*2)
Don’t waste your time inventing variable names
df_subset <- select(df, A) df_filtered <- filter(df_subset, A == 42) final_df <- mutate(df_filtered, B=A*2)
pivot_longerpivot_longertable4a |> pivot_longer(cols=-country, values_to="cases", names_to="year")
pivot_widertable2 |> pivot_wider(names_from="type", values_from="count")
separatemarkers = tibble(
id=c("patient1_cell1", "patient1_cell2", "patient2_cell1", "patient2_cell2"),
markers=c("CD8A", "CD45_CD8A", "CD45_CD4_FOXP3", "CD4")
)
markers |>
separate(id, into=c("patient", "cell")) |>
separate_rows("markers", sep="_")
## # A tibble: 7 × 3 ## patient cell markers ## <chr> <chr> <chr> ## 1 patient1 cell1 CD8A ## 2 patient1 cell2 CD45 ## 3 patient1 cell2 CD8A ## # … with 4 more rows
select: choose some columnsfilter: choose some rowsiris2 <- iris |>
select(sepal_length=Sepal.Length, sepal_width=Sepal.Width, Species) |>
filter(Species %in% c("setosa", "virginica"))
mutateiris2 |> mutate(sepal_length_in = sepal_length / 25.4)
## # A tibble: 100 × 4 ## sepal_length sepal_width Species sepal_length_in ## <dbl> <dbl> <fct> <dbl> ## 1 5.1 3.5 setosa 0.201 ## 2 4.9 3 setosa 0.193 ## 3 4.7 3.2 setosa 0.185 ## # … with 97 more rows
group_by + summariseiris2 |> group_by(Species) |> summarise(max_length = max(sepal_length), mean_length = mean(sepal_length), n_samples=n())
## # A tibble: 2 × 4 ## Species max_length mean_length n_samples ## <fct> <dbl> <dbl> <int> ## 1 setosa 5.8 5.01 50 ## 2 virginica 7.9 6.59 50
bind_rows: concatenate two dataframes with same columnsbind_cols: concatenate two dataframes with same rowsinner_join(x, y)inner_join(x, y, by="key")inner_join(x, y, by=c("key_x"="key_y"))left_join, right_joinfull_joinSee understanding joins for nice illustrations.
Examples are taken from the following resources:
For more details, I recommend reading: