Input data and configuration

Could not access params from `r` object. Don't worry if your are running papermill.

# Parameters
input_file = "input_adata.h5ad"
output_file = "adata.h5ad"

adata = sc.read_h5ad(input_file)

print(adata)

AnnData object with n_obs × n_vars = 9584640 × 33514 
    obs: 'batch', 'samples', 'patient', 'origin', 'replicate', 'dataset', 'tumor_type', 'platform', 'age', 'sex', 'facs_purity_cd3', 'facs_purity_cd56'
    var: 'feature_types', 'genome', 'gene_ids'

Quality Metrics

Quality control follows the new "Best practice" tutorial for single cell analysis (Luecken & Theis 2019) and the accompanying case study notebook.

Library size and number of detected genes.

I use the following cutoffs for filtering by QC metrics. The cutoffs are derived empirically from the plots below. I do not use a "MAX_GENES" cutoff for doublet filtering and use Scrublet (Wolock et al.) for computational detection of doublets.

MIN_COUNTS = 2000
MIN_GENES = 700
MAX_MITO = .11
MIN_CELLS = 50
DOUBLET_THRES = .4

Coarse filtering to reduce amount of data:

sc.pp.filter_cells(adata, min_genes=100)
compute_quality_metrics(adata)
print_dim(adata)

filtered out 9513239 cells that haveless than 100 genes expressed

Current dimensions of adata: 71401 cells x 33514 genes.

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(14,4))
sc.pl.violin(adata, 'n_counts', groupby='samples', log=True, cut=0, ax=ax1, show=False)
sc.pl.violin(adata, 'mt_frac', groupby='samples', ax=ax2, show=False)
ax1.set_title("Count depth (counts per barcode)")
ax2.set_title("Mitochondrial fraction per barcode")
fig.show()
os.makedirs("./tmp", exist_ok=True)
fig.savefig("./tmp/qc_dist_per_sample_before.png")

The sample quality looks sufficiently consistent, so that we apply global filtering cut-offs instead of per-sample filtering

Gene level filtering

Filter genes that occur in less than MIN_CELLS of cells. Relates to the minimal expected cluster size. Given the number of cells, I expect the smallest cluster of interest to contain at least 50 cells.

sc.pp.filter_genes(adata, min_cells=MIN_CELLS)
print_dim(adata)

filtered out 16522 genes that are detectedin less than 50 cells

Current dimensions of adata: 71401 cells x 16992 genes.

Ratio of counts to number of mitochondrial genes

ax1 = sc.pl.scatter(adata, x='n_counts', y='n_genes', color='mt_frac', title="detected genes vs. count depth", show=False)
ax1.hlines(y=MIN_GENES, xmin=0, xmax=ax1.get_xlim()[1], linewidth=1, color="red")
ax1.vlines(x=MIN_COUNTS, ymin=0, ymax=ax1.get_ylim()[1], linewidth=1, color="red")
plt.show()

ax2 = sc.pl.scatter(adata[adata.obs['n_counts'] < 10000], x='n_counts', y='n_genes', color='mt_frac', title="detected genes vs. count depth (zoomed in)", show=False)
ax2.hlines(y=MIN_GENES, xmin=0, xmax=ax2.get_xlim()[1], linewidth=1, color="red")
ax2.vlines(x=MIN_COUNTS, ymin=0, ymax=ax2.get_ylim()[1], linewidth=1, color="red")
plt.show()

Count depth and detected genes

ax1 = sns.distplot(adata.obs['n_genes'], kde=False, bins=60)
ax1.set_title("Distribution: detected genes")
ax1.vlines(x=MIN_GENES, color="red", linewidth=1, ymin=0, ymax=ax1.get_ylim()[1])
plt.show()

ax2 = sc.pl.scatter(adata, x='rk_n_genes', y='n_genes', color='mt_frac', legend_loc="none", title="Distribution: detected genes", show=False)
ax2.hlines(y=MIN_GENES, color="red", linewidth=1, xmin=0, xmax=ax2.get_xlim()[1])
plt.show()

ax3 = sc.pl.scatter(adata, x='rk_n_counts', y='n_counts', color='mt_frac', legend_loc="none", title="Distribution: read counts", show=False)
ax3.hlines(y=MIN_COUNTS, color="red", linewidth=1, xmin=0, xmax=ax3.get_xlim()[1])
ax3.set_yscale("log")
plt.show()

Mitochondrial reads

ax4 = sns.distplot(adata.obs['mt_frac'], kde=False, bins=60)
ax4.set_title("Distribution: fraction mito reads")
ax4.vlines(x=MAX_MITO, color="red", linewidth=1, ymin=0, ymax=ax4.get_ylim()[1])
plt.show()

ax5 = sc.pl.scatter(adata, x='rk_mt_frac', y='mt_frac', color='mt_frac', show=False, legend_loc="none", title="Distribution: fraction mito reads")
ax5.hlines(y=MAX_MITO, color="red", linewidth=1, xmin=0, xmax=ax5.get_xlim()[1])
plt.show()

Apply filtering by quality metrics

print_dim(adata)

Current dimensions of adata: 71401 cells x 16992 genes.

Apply MIN_GENES threshold:

sc.pp.filter_cells(adata, min_genes=MIN_GENES)
print_dim(adata)

filtered out 30847 cells that haveless than 700 genes expressed

Current dimensions of adata: 40554 cells x 16992 genes.

Apply MIN_COUNTS threshold:

sc.pp.filter_cells(adata, min_counts=MIN_COUNTS)
print_dim(adata)

filtered out 3793 cells that haveless than 2000 counts

Current dimensions of adata: 36761 cells x 16992 genes.

Apply MAX_MITO threshold:

adata = adata[adata.obs['mt_frac'] < MAX_MITO, :]
print_dim(adata)

Current dimensions of adata: 35052 cells x 16992 genes.

exclude ribosomal and mitochondrial genes

Ribosomal genes were downloaded from https://www.genenames.org/data/genegroup/#!/group/1054

adata = adata[:, ~adata.var_names.isin(np.append(mito_genes, ribo_genes))]
print_dim(adata)

Current dimensions of adata: 35052 cells x 16818 genes.

QC plots after filtering

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(14,4))
sc.pl.violin(adata, 'n_counts', groupby='samples', log=True, cut=0, ax=ax1, show=False)
sc.pl.violin(adata, 'mt_frac', groupby='samples', ax=ax2, show=False)
ax1.set_title("Count depth (counts per barcode)")
ax2.set_title("Mitochondrial fraction per barcode")
fig.show()
fig.savefig("./tmp/qc_dist_per_sample_after.png")

ax1 = sc.pl.scatter(adata, x='n_counts', y='n_genes', color='mt_frac', title="detected genes vs. count depth", show=False)
ax1.hlines(y=MIN_GENES, xmin=0, xmax=ax1.get_xlim()[1], linewidth=1, color="red")
ax1.vlines(x=MIN_COUNTS, ymin=0, ymax=ax1.get_ylim()[1], linewidth=1, color="red")
plt.show()

ax2 = sc.pl.scatter(adata[adata.obs['n_counts'] < 10000], x='n_counts', y='n_genes', color='mt_frac', title="detected genes vs. count depth (zoomed in)", show=False)
ax2.hlines(y=MIN_GENES, xmin=0, xmax=ax2.get_xlim()[1], linewidth=1, color="red")
ax2.vlines(x=MIN_COUNTS, ymin=0, ymax=ax2.get_ylim()[1], linewidth=1, color="red")
plt.show()

Doublet detection

Perform doublet-detection for each sample individually

We use scrublet for doublet detection.

Scrublet needs to run on each sample individually, because doublets can only arise from within the same sample
Scrublet simulates 'artificial' doublets from the cell populations and computes the similarity of each cell to these simulated doublets. Cells highly similar to the simulation are marked as doublets.

## working on sample H141
n_cells = 3947

Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.29
Detected doublet rate = 4.9%
Estimated detectable doublet fraction = 44.8%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 10.9%
Elapsed time: 7.8 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 511 genes that are detectedin less than 1 cells

Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40

    finished
computing UMAP

    finished




## working on sample H143
n_cells = 2311

Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.45
Detected doublet rate = 0.4%
Estimated detectable doublet fraction = 25.0%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 1.7%
Elapsed time: 2.0 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 818 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished
computing UMAP

    finished




## working on sample H149
n_cells = 1454
Preprocessing...

Simulating doublets...
Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.47
Detected doublet rate = 0.2%
Estimated detectable doublet fraction = 22.2%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 0.9%
Elapsed time: 1.0 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 1086 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished
computing UMAP

    finished




## working on sample H160
n_cells = 3863

Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.40
Detected doublet rate = 2.0%
Estimated detectable doublet fraction = 45.2%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 4.5%
Elapsed time: 3.7 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 166 genes that are detectedin less than 1 cells

Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40

    finished
computing UMAP

/home/sturm/scratch/projects/2020/kortekaas2020_paper/work/conda/run_notebook-b99176dfb05e0522988294512373169f/lib/python3.7/site-packages/umap/spectral.py:229: UserWarning: Embedding a total of 3 separate connected components using meta-embedding (experimental)
  n_components

    finished




## working on sample H176
n_cells = 2290

Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.48
Detected doublet rate = 0.5%
Estimated detectable doublet fraction = 20.2%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 2.6%
Elapsed time: 2.1 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 501 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished
computing UMAP

    finished




## working on sample H182
n_cells = 2969

Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.43
Detected doublet rate = 1.0%
Estimated detectable doublet fraction = 32.9%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 3.2%
Elapsed time: 3.2 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 294 genes that are detectedin less than 1 cells

Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40

    finished
computing UMAP

    finished




## working on sample H185
n_cells = 2611

Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.43
Detected doublet rate = 0.9%
Estimated detectable doublet fraction = 21.5%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 4.1%
Elapsed time: 2.3 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 386 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished
computing UMAP

/home/sturm/scratch/projects/2020/kortekaas2020_paper/work/conda/run_notebook-b99176dfb05e0522988294512373169f/lib/python3.7/site-packages/umap/spectral.py:229: UserWarning: Embedding a total of 2 separate connected components using meta-embedding (experimental)
  n_components

    finished




## working on sample H188
n_cells = 2457
Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.51
Detected doublet rate = 0.4%
Estimated detectable doublet fraction = 12.1%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 3.7%
Elapsed time: 2.2 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 356 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished
computing UMAP

    finished




## working on sample H197
n_cells = 2202
Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.62
Detected doublet rate = 0.5%
Estimated detectable doublet fraction = 10.4%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 4.8%
Elapsed time: 2.0 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 165 genes that are detectedin less than 1 cells

Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished
computing UMAP

/home/sturm/scratch/projects/2020/kortekaas2020_paper/work/conda/run_notebook-b99176dfb05e0522988294512373169f/lib/python3.7/site-packages/umap/spectral.py:229: UserWarning: Embedding a total of 2 separate connected components using meta-embedding (experimental)
  n_components

    finished




## working on sample H205
n_cells = 1357
Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.54
Detected doublet rate = 0.3%
Estimated detectable doublet fraction = 14.1%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 2.1%
Elapsed time: 1.2 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 170 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished
computing UMAP

    finished




## working on sample H208
n_cells = 2859

Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.46
Detected doublet rate = 0.6%
Estimated detectable doublet fraction = 21.1%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 3.0%
Elapsed time: 3.4 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 308 genes that are detectedin less than 1 cells

Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40

    finished
computing UMAP

/home/sturm/scratch/projects/2020/kortekaas2020_paper/work/conda/run_notebook-b99176dfb05e0522988294512373169f/lib/python3.7/site-packages/umap/spectral.py:229: UserWarning: Embedding a total of 3 separate connected components using meta-embedding (experimental)
  n_components

    finished




## working on sample H211
n_cells = 3323

Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.42
Detected doublet rate = 1.7%
Estimated detectable doublet fraction = 38.1%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 4.5%
Elapsed time: 3.1 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 182 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40

    finished
computing UMAP

    finished




## working on sample H68
n_cells = 3409

Preprocessing...

Simulating doublets...

Embedding transcriptomes using PCA...

Calculating doublet scores...

Automatically set threshold at doublet score = 0.65
Detected doublet rate = 0.1%
Estimated detectable doublet fraction = 4.4%
Overall doublet rate:
	Expected   = 10.0%
	Estimated  = 3.3%
Elapsed time: 3.4 seconds

normalizing by total count per cell
    finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
filtered out 711 genes that are detectedin less than 1 cells

Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40

    finished
computing UMAP

    finished

adata.obs["is_doublet"] = adata.obs["doublet_score"] > DOUBLET_THRES

visualize confounders

visualize doublets and remove them

sc.tl.pca(adata_vis, svd_solver='arpack')
sc.pp.neighbors(adata_vis, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata_vis)
sc.pl.umap(adata_vis, color=["is_doublet", "n_genes"])

computing PCA with n_comps = 50

    finished
computing neighbors
    using 'X_pca' with n_pcs = 40

    finished
computing UMAP

    finished

Dimensions before and after doublet filtering:

Current dimensions of adata: 35052 cells x 16818 genes.

filtered out 572 genes that are detectedin less than 50 cells
Trying to set attribute `.var` of view, making a copy.

Current dimensions of adata: 34576 cells x 16818 genes.

show confounding factors

We visualize various confounding factors on the UMAP plot. Additional we perform clustering using the Leiden algorithm (Traag et al.).

sc.tl.pca(adata_vis, svd_solver='arpack')
sc.pp.neighbors(adata_vis, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata_vis)

sc.pl.umap(adata_vis, color=["n_genes", "n_counts", "mt_frac", "samples", "origin", "leiden"], ncols=2)

save result

adata.write(output_file, compression="lzf")

Summary

The purpose of this Notebook is to

load the single cell transcriptomics data from cellranger
check the quality of the data
filter out cells that don't match the quality criteria.

Quality control follows the new "Best practice" tutorial for single cell analysis (Luecken & Theis 2019)

Quality metrics

Common quality metrics are

the number of counts per cell
the number of detected genes per cell
the fraction of reads originating from mitochondrial genes.

Based on these criteria, we defined cutoffs to filter out low-quality cells. The cutoffs were designed empirically based on the distribution of the metric (see Quality Metrics).

The chosen cutoffs are:

Min #counts per cell:  2000
Min #genes per cell:  700
Max frac. mitochondrial reads:  0.11

Doublet detection

Doublets are technical artifacts that arise from two cells ending up in a single droplet. They may appear as 'rare, intermediate cell-types' in the analysis. Previously, it has been suggested to filter out cells with a suspiciously high number of counts. Now, more advanced computational methods for detecting doublets are available.

We use scrublet for doublet-detection. Scrublet simulates 'artificial' doublets from the cell populations and computes the similarity of each cell to these simulated doublets. Cells highly similar to the simulation are marked as doublets.

Filtering results

After filtering, we retained the following number of cells:

Number of cells after filtering:  34576

The following figure shows the number of cells per sample:

<matplotlib.axes._subplots.AxesSubplot at 0x2b7df8b39630>

Distribution of counts and frac. mitochondrial reads

before filtering:

dist before filtering

after filtering:

dist after filtering