Could not access params from `r` object. Don't worry if your are running papermill.
AnnData object with n_obs × n_vars = 9584640 × 33514
obs: 'batch', 'samples', 'patient', 'origin', 'replicate', 'dataset', 'tumor_type', 'platform', 'age', 'sex', 'facs_purity_cd3', 'facs_purity_cd56'
var: 'feature_types', 'genome', 'gene_ids'
Quality control follows the new "Best practice" tutorial for single cell analysis (Luecken & Theis 2019) and the accompanying case study notebook.
I use the following cutoffs for filtering by QC metrics. The cutoffs are derived empirically from the plots below. I do not use a "MAX_GENES" cutoff for doublet filtering and use Scrublet (Wolock et al.) for computational detection of doublets.
Coarse filtering to reduce amount of data:
filtered out 9513239 cells that haveless than 100 genes expressed
Current dimensions of adata: 71401 cells x 33514 genes.
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(14,4))
sc.pl.violin(adata, 'n_counts', groupby='samples', log=True, cut=0, ax=ax1, show=False)
sc.pl.violin(adata, 'mt_frac', groupby='samples', ax=ax2, show=False)
ax1.set_title("Count depth (counts per barcode)")
ax2.set_title("Mitochondrial fraction per barcode")
fig.show()
os.makedirs("./tmp", exist_ok=True)
fig.savefig("./tmp/qc_dist_per_sample_before.png")
The sample quality looks sufficiently consistent, so that we apply global filtering cut-offs instead of per-sample filtering
Filter genes that occur in less than MIN_CELLS of cells. Relates to the minimal expected cluster size. Given the number of cells, I expect the smallest cluster of interest to contain at least 50 cells.
filtered out 16522 genes that are detectedin less than 50 cells
Current dimensions of adata: 71401 cells x 16992 genes.
ax1 = sc.pl.scatter(adata, x='n_counts', y='n_genes', color='mt_frac', title="detected genes vs. count depth", show=False)
ax1.hlines(y=MIN_GENES, xmin=0, xmax=ax1.get_xlim()[1], linewidth=1, color="red")
ax1.vlines(x=MIN_COUNTS, ymin=0, ymax=ax1.get_ylim()[1], linewidth=1, color="red")
plt.show()
ax2 = sc.pl.scatter(adata[adata.obs['n_counts'] < 10000], x='n_counts', y='n_genes', color='mt_frac', title="detected genes vs. count depth (zoomed in)", show=False)
ax2.hlines(y=MIN_GENES, xmin=0, xmax=ax2.get_xlim()[1], linewidth=1, color="red")
ax2.vlines(x=MIN_COUNTS, ymin=0, ymax=ax2.get_ylim()[1], linewidth=1, color="red")
plt.show()
ax1 = sns.distplot(adata.obs['n_genes'], kde=False, bins=60)
ax1.set_title("Distribution: detected genes")
ax1.vlines(x=MIN_GENES, color="red", linewidth=1, ymin=0, ymax=ax1.get_ylim()[1])
plt.show()
ax2 = sc.pl.scatter(adata, x='rk_n_genes', y='n_genes', color='mt_frac', legend_loc="none", title="Distribution: detected genes", show=False)
ax2.hlines(y=MIN_GENES, color="red", linewidth=1, xmin=0, xmax=ax2.get_xlim()[1])
plt.show()
ax3 = sc.pl.scatter(adata, x='rk_n_counts', y='n_counts', color='mt_frac', legend_loc="none", title="Distribution: read counts", show=False)
ax3.hlines(y=MIN_COUNTS, color="red", linewidth=1, xmin=0, xmax=ax3.get_xlim()[1])
ax3.set_yscale("log")
plt.show()
ax4 = sns.distplot(adata.obs['mt_frac'], kde=False, bins=60)
ax4.set_title("Distribution: fraction mito reads")
ax4.vlines(x=MAX_MITO, color="red", linewidth=1, ymin=0, ymax=ax4.get_ylim()[1])
plt.show()
ax5 = sc.pl.scatter(adata, x='rk_mt_frac', y='mt_frac', color='mt_frac', show=False, legend_loc="none", title="Distribution: fraction mito reads")
ax5.hlines(y=MAX_MITO, color="red", linewidth=1, xmin=0, xmax=ax5.get_xlim()[1])
plt.show()
Apply MIN_GENES
threshold:
filtered out 30847 cells that haveless than 700 genes expressed
Current dimensions of adata: 40554 cells x 16992 genes.
Apply MIN_COUNTS
threshold:
filtered out 3793 cells that haveless than 2000 counts
Current dimensions of adata: 36761 cells x 16992 genes.
Apply MAX_MITO
threshold:
Current dimensions of adata: 35052 cells x 16992 genes.
Ribosomal genes were downloaded from https://www.genenames.org/data/genegroup/#!/group/1054
Current dimensions of adata: 35052 cells x 16818 genes.
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(14,4))
sc.pl.violin(adata, 'n_counts', groupby='samples', log=True, cut=0, ax=ax1, show=False)
sc.pl.violin(adata, 'mt_frac', groupby='samples', ax=ax2, show=False)
ax1.set_title("Count depth (counts per barcode)")
ax2.set_title("Mitochondrial fraction per barcode")
fig.show()
fig.savefig("./tmp/qc_dist_per_sample_after.png")
ax1 = sc.pl.scatter(adata, x='n_counts', y='n_genes', color='mt_frac', title="detected genes vs. count depth", show=False)
ax1.hlines(y=MIN_GENES, xmin=0, xmax=ax1.get_xlim()[1], linewidth=1, color="red")
ax1.vlines(x=MIN_COUNTS, ymin=0, ymax=ax1.get_ylim()[1], linewidth=1, color="red")
plt.show()
ax2 = sc.pl.scatter(adata[adata.obs['n_counts'] < 10000], x='n_counts', y='n_genes', color='mt_frac', title="detected genes vs. count depth (zoomed in)", show=False)
ax2.hlines(y=MIN_GENES, xmin=0, xmax=ax2.get_xlim()[1], linewidth=1, color="red")
ax2.vlines(x=MIN_COUNTS, ymin=0, ymax=ax2.get_ylim()[1], linewidth=1, color="red")
plt.show()
We use scrublet for doublet detection.
## working on sample H141
n_cells = 3947
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.29
Detected doublet rate = 4.9%
Estimated detectable doublet fraction = 44.8%
Overall doublet rate:
Expected = 10.0%
Estimated = 10.9%
Elapsed time: 7.8 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 511 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
finished
## working on sample H143
n_cells = 2311
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.45
Detected doublet rate = 0.4%
Estimated detectable doublet fraction = 25.0%
Overall doublet rate:
Expected = 10.0%
Estimated = 1.7%
Elapsed time: 2.0 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 818 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
finished
## working on sample H149
n_cells = 1454
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.47
Detected doublet rate = 0.2%
Estimated detectable doublet fraction = 22.2%
Overall doublet rate:
Expected = 10.0%
Estimated = 0.9%
Elapsed time: 1.0 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 1086 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
finished
## working on sample H160
n_cells = 3863
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.40
Detected doublet rate = 2.0%
Estimated detectable doublet fraction = 45.2%
Overall doublet rate:
Expected = 10.0%
Estimated = 4.5%
Elapsed time: 3.7 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 166 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
/home/sturm/scratch/projects/2020/kortekaas2020_paper/work/conda/run_notebook-b99176dfb05e0522988294512373169f/lib/python3.7/site-packages/umap/spectral.py:229: UserWarning: Embedding a total of 3 separate connected components using meta-embedding (experimental)
n_components
finished
## working on sample H176
n_cells = 2290
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.48
Detected doublet rate = 0.5%
Estimated detectable doublet fraction = 20.2%
Overall doublet rate:
Expected = 10.0%
Estimated = 2.6%
Elapsed time: 2.1 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 501 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
finished
## working on sample H182
n_cells = 2969
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.43
Detected doublet rate = 1.0%
Estimated detectable doublet fraction = 32.9%
Overall doublet rate:
Expected = 10.0%
Estimated = 3.2%
Elapsed time: 3.2 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 294 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
finished
## working on sample H185
n_cells = 2611
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.43
Detected doublet rate = 0.9%
Estimated detectable doublet fraction = 21.5%
Overall doublet rate:
Expected = 10.0%
Estimated = 4.1%
Elapsed time: 2.3 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 386 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
/home/sturm/scratch/projects/2020/kortekaas2020_paper/work/conda/run_notebook-b99176dfb05e0522988294512373169f/lib/python3.7/site-packages/umap/spectral.py:229: UserWarning: Embedding a total of 2 separate connected components using meta-embedding (experimental)
n_components
finished
## working on sample H188
n_cells = 2457
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.51
Detected doublet rate = 0.4%
Estimated detectable doublet fraction = 12.1%
Overall doublet rate:
Expected = 10.0%
Estimated = 3.7%
Elapsed time: 2.2 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 356 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
finished
## working on sample H197
n_cells = 2202
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.62
Detected doublet rate = 0.5%
Estimated detectable doublet fraction = 10.4%
Overall doublet rate:
Expected = 10.0%
Estimated = 4.8%
Elapsed time: 2.0 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 165 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
/home/sturm/scratch/projects/2020/kortekaas2020_paper/work/conda/run_notebook-b99176dfb05e0522988294512373169f/lib/python3.7/site-packages/umap/spectral.py:229: UserWarning: Embedding a total of 2 separate connected components using meta-embedding (experimental)
n_components
finished
## working on sample H205
n_cells = 1357
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.54
Detected doublet rate = 0.3%
Estimated detectable doublet fraction = 14.1%
Overall doublet rate:
Expected = 10.0%
Estimated = 2.1%
Elapsed time: 1.2 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 170 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
finished
## working on sample H208
n_cells = 2859
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.46
Detected doublet rate = 0.6%
Estimated detectable doublet fraction = 21.1%
Overall doublet rate:
Expected = 10.0%
Estimated = 3.0%
Elapsed time: 3.4 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 308 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
/home/sturm/scratch/projects/2020/kortekaas2020_paper/work/conda/run_notebook-b99176dfb05e0522988294512373169f/lib/python3.7/site-packages/umap/spectral.py:229: UserWarning: Embedding a total of 3 separate connected components using meta-embedding (experimental)
n_components
finished
## working on sample H211
n_cells = 3323
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.42
Detected doublet rate = 1.7%
Estimated detectable doublet fraction = 38.1%
Overall doublet rate:
Expected = 10.0%
Estimated = 4.5%
Elapsed time: 3.1 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 182 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
finished
## working on sample H68
n_cells = 3409
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.65
Detected doublet rate = 0.1%
Estimated detectable doublet fraction = 4.4%
Overall doublet rate:
Expected = 10.0%
Estimated = 3.3%
Elapsed time: 3.4 seconds
normalizing by total count per cell
finished ({time_passed}): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)
filtered out 711 genes that are detectedin less than 1 cells
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
finished
sc.tl.pca(adata_vis, svd_solver='arpack')
sc.pp.neighbors(adata_vis, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata_vis)
sc.pl.umap(adata_vis, color=["is_doublet", "n_genes"])
computing PCA with n_comps = 50
finished
computing neighbors
using 'X_pca' with n_pcs = 40
finished
computing UMAP
finished
Dimensions before and after doublet filtering:
Current dimensions of adata: 35052 cells x 16818 genes.
filtered out 572 genes that are detectedin less than 50 cells
Trying to set attribute `.var` of view, making a copy.
Current dimensions of adata: 34576 cells x 16818 genes.
We visualize various confounding factors on the UMAP plot. Additional we perform clustering using the Leiden algorithm (Traag et al.).
sc.pl.umap(adata_vis, color=["n_genes", "n_counts", "mt_frac", "samples", "origin", "leiden"], ncols=2)
The purpose of this Notebook is to
Quality control follows the new "Best practice" tutorial for single cell analysis (Luecken & Theis 2019)
Common quality metrics are
Based on these criteria, we defined cutoffs to filter out low-quality cells. The cutoffs were designed empirically based on the distribution of the metric (see Quality Metrics).
The chosen cutoffs are:
Min #counts per cell: 2000
Min #genes per cell: 700
Max frac. mitochondrial reads: 0.11
Doublets are technical artifacts that arise from two cells ending up in a single droplet. They may appear as 'rare, intermediate cell-types' in the analysis. Previously, it has been suggested to filter out cells with a suspiciously high number of counts. Now, more advanced computational methods for detecting doublets are available.
We use scrublet for doublet-detection. Scrublet simulates 'artificial' doublets from the cell populations and computes the similarity of each cell to these simulated doublets. Cells highly similar to the simulation are marked as doublets.
After filtering, we retained the following number of cells:
Number of cells after filtering: 34576
The following figure shows the number of cells per sample:
<matplotlib.axes._subplots.AxesSubplot at 0x2b7df8b39630>
before filtering:
after filtering: