Input data and configuration

# get default parameters. Either papermill or rmarkdown way.
try:
    input_file = r.params["input_file"]
    output_file = r.params["output_file"]
except:
    print("Could not access params from `r` object. Don't worry if your are running papermill. ")
    input_file = "results/03_correct_data/adata.h5ad"
    output_file = "results/04_annotate_cell-types/adata.h5ad"
markers = pd.read_csv("tables/cell_type_markers.csv")
adata = sc.read_h5ad(input_file)

Determine optimal resolution

We use the Leiden algorithm (Traag et al.) to determine cell-type clusters.

The algorithm depends on a resolution parameter. The higher the resolution, the more clusters will be found. We perform a grid search to test all parameters in a certain range, and hope to find the number of clusters to be stable across a range of resolutions, indicating biologically meaningful clustering.

resolutions = np.arange(0.1, 3, 0.05)

sc.tl.leiden(adata, resolution=LEIDEN_RES)
## running Leiden clustering
##     finished

There does not seem to be a clear plateau in the curve except the (arguably small) around ~1.0. The clustering with r=1.0 looks reasonable for assigning cell-types therefore we stick with that for this task.

fig, ax = plt.subplots(figsize=(14, 10))
sc.pl.umap(adata, color="leiden", ax=ax, legend_loc="on data")

Visualize cell-type markers

Perform final clustering with resolution=1:

for ct in cell_types:
    marker_genes = markers.loc[markers["cell_type"] == ct,"gene_identifier"]
    sc.pl.umap(adata, color=marker_genes, title=["{}: {}".format(ct, g) for g in marker_genes])

Assign cell types

fig, ax = plt.subplots(figsize=(14, 10))
sc.pl.umap(adata, legend_loc="on data", color="leiden", ax=ax)

Assign clusters to cell types using the following mapping:

annotation = {
    "B cell": [8, 6, 2, 12, 19, 10],
    "CAF": [17],
    "Endothelial cell": [16, 20],
    "Mast cell": [21],
    "NK cell": [5, 23],
    "T cell CD8+": [11, 13, 0, 7],
    "T cell regulatory": [4],
    "T cell CD4+ non-regulatory":[1, 9, 3],
    "myeloid": [14],
    "pDC": [22]
}

Results

sc.pl.umap(adata, color=["cell_type_unknown", "cell_type_coarse", "cell_type"])
## ... storing 'cell_type' as categorical
## ... storing 'cell_type_unknown' as categorical
## ... storing 'cell_type_coarse' as categorical

display(adata.obs.groupby("cell_type")[["samples"]].count().sort_values("samples"), n=50)
##                             samples
## cell_type                          
## pDC                              70
## Mast cell                        97
## CAF                             235
## myeloid                         545
## Endothelial cell                546
## unknown                         686
## NK cell                        2780
## T cell regulatory              2832
## T cell CD8+                    8663
## T cell CD4+ non-regulatory     8750
## B cell                         9372

Cell-type distribution per sample

## <ggplot: (-9223369067435351671)>

display(cell_type_fractions, n=50)
##    samples  facs_purity_cd3  facs_purity_cd56  frac_t_cell  frac_nk_cell
## 0      H68            0.797             0.138     0.854556      0.107450
## 1     H141            0.288             0.025     0.468464      0.019890
## 2     H143            0.653             0.008     0.758050      0.004352
## 3     H149            0.644             0.033     0.803313      0.043478
## 4     H160            0.342             0.067     0.392867      0.029855
## 5     H176            0.558             0.108     0.730668      0.086555
## 6     H182            0.303             0.109     0.462692      0.090971
## 7     H185            0.493             0.163     0.746032      0.141696
## 8     H188            0.657             0.087     0.822634      0.064609
## 9     H197            0.271             0.171     0.495413      0.173853
## 10    H205            0.485             0.028     0.341246      0.034866
## 11    H208            0.336             0.323     0.586389      0.232017
## 12    H211            0.382             0.029     0.316289      0.026026

Compare annotations with FACS markers

The correlation between single-cell annotations and FACS is very strong:

Save output

adata.write(output_file, compression="lzf")

Summary

The purpose of this notebook is:

  • load the normalized and corrected data from the previous step
  • Use the Leiden algorithm Traag et al. 2019 to cluster the single cell data
  • Use known marker genes to annotate the clusters

Results

UMAP plot colored by celltype

cells per cell-type

##                             samples
## cell_type                          
## pDC                              70
## Mast cell                        97
## CAF                             235
## myeloid                         545
## Endothelial cell                546
## unknown                         686
## NK cell                        2780
## T cell regulatory              2832
## T cell CD8+                    8663
## T cell CD4+ non-regulatory     8750
## B cell                         9372

cell-distribution per sample

## <ggplot: (2969419090117)>

Compare annotations with FACS markers:

Overall, the correlation is very strong: