Input data and configuration
Determine optimal resolution
Visualize cell-type markers
- Assign cell types
Results
- Cell-type distribution per sample
- Compare annotations with FACS markers
Save output
Summary
- Results

Input data and configuration

# get default parameters. Either papermill or rmarkdown way.
try:
    input_file = r.params["input_file"]
    output_file = r.params["output_file"]
except:
    print("Could not access params from `r` object. Don't worry if your are running papermill. ")
    input_file = "results/03_correct_data/adata.h5ad"
    output_file = "results/04_annotate_cell-types/adata.h5ad"

markers = pd.read_csv("tables/cell_type_markers.csv")

adata = sc.read_h5ad(input_file)

Determine optimal resolution

We use the Leiden algorithm (Traag et al.) to determine cell-type clusters.

The algorithm depends on a resolution parameter. The higher the resolution, the more clusters will be found. We perform a grid search to test all parameters in a certain range, and hope to find the number of clusters to be stable across a range of resolutions, indicating biologically meaningful clustering.

resolutions = np.arange(0.1, 3, 0.05)

sc.tl.leiden(adata, resolution=LEIDEN_RES)

## running Leiden clustering
##     finished

There does not seem to be a clear plateau in the curve except the (arguably small) around ~1.0. The clustering with r=1.0 looks reasonable for assigning cell-types therefore we stick with that for this task.

fig, ax = plt.subplots(figsize=(14, 10))
sc.pl.umap(adata, color="leiden", ax=ax, legend_loc="on data")

Visualize cell-type markers

Perform final clustering with resolution=1:

for ct in cell_types:
    marker_genes = markers.loc[markers["cell_type"] == ct,"gene_identifier"]
    sc.pl.umap(adata, color=marker_genes, title=["{}: {}".format(ct, g) for g in marker_genes])

Assign cell types

fig, ax = plt.subplots(figsize=(14, 10))
sc.pl.umap(adata, legend_loc="on data", color="leiden", ax=ax)

Assign clusters to cell types using the following mapping:

annotation = {
    "B cell": [8, 6, 2, 12, 19, 10],
    "CAF": [17],
    "Endothelial cell": [16, 20],
    "Mast cell": [21],
    "NK cell": [5, 23],
    "T cell CD8+": [11, 13, 0, 7],
    "T cell regulatory": [4],
    "T cell CD4+ non-regulatory":[1, 9, 3],
    "myeloid": [14],
    "pDC": [22]
}

Results

sc.pl.umap(adata, color=["cell_type_unknown", "cell_type_coarse", "cell_type"])

## ... storing 'cell_type' as categorical
## ... storing 'cell_type_unknown' as categorical
## ... storing 'cell_type_coarse' as categorical

display(adata.obs.groupby("cell_type")[["samples"]].count().sort_values("samples"), n=50)

##                             samples
## cell_type                          
## pDC                              70
## Mast cell                        97
## CAF                             235
## myeloid                         545
## Endothelial cell                546
## unknown                         686
## NK cell                        2780
## T cell regulatory              2832
## T cell CD8+                    8663
## T cell CD4+ non-regulatory     8750
## B cell                         9372

Cell-type distribution per sample

## <ggplot: (-9223369067435351671)>

display(cell_type_fractions, n=50)

##    samples  facs_purity_cd3  facs_purity_cd56  frac_t_cell  frac_nk_cell
## 0      H68            0.797             0.138     0.854556      0.107450
## 1     H141            0.288             0.025     0.468464      0.019890
## 2     H143            0.653             0.008     0.758050      0.004352
## 3     H149            0.644             0.033     0.803313      0.043478
## 4     H160            0.342             0.067     0.392867      0.029855
## 5     H176            0.558             0.108     0.730668      0.086555
## 6     H182            0.303             0.109     0.462692      0.090971
## 7     H185            0.493             0.163     0.746032      0.141696
## 8     H188            0.657             0.087     0.822634      0.064609
## 9     H197            0.271             0.171     0.495413      0.173853
## 10    H205            0.485             0.028     0.341246      0.034866
## 11    H208            0.336             0.323     0.586389      0.232017
## 12    H211            0.382             0.029     0.316289      0.026026

Compare annotations with FACS markers

The correlation between single-cell annotations and FACS is very strong:

Save output

adata.write(output_file, compression="lzf")

Summary

The purpose of this notebook is:

load the normalized and corrected data from the previous step
Use the Leiden algorithm Traag et al. 2019 to cluster the single cell data
Use known marker genes to annotate the clusters

Results

UMAP plot colored by celltype

cells per cell-type

##                             samples
## cell_type                          
## pDC                              70
## Mast cell                        97
## CAF                             235
## myeloid                         545
## Endothelial cell                546
## unknown                         686
## NK cell                        2780
## T cell regulatory              2832
## T cell CD8+                    8663
## T cell CD4+ non-regulatory     8750
## B cell                         9372

cell-distribution per sample

## <ggplot: (2969419090117)>

Compare annotations with FACS markers:

Overall, the correlation is very strong: