Normalize and compute highly variable genes

In this notebook, we will

Input-data

Load doublets precomputed by solo

We don't run solo as part of the pipeline, as the results are not reproducible on different systems. Instead, we load pre-computed results from the repository.

How solo was ran initially is described in main.nf.

Normalize and scale

The raw data object will contain normalized, log-transformed values for visualiation. The original, raw (UMI) counts are stored in adata.obsm["raw_counts"].

We use the straightforward normalization by library size as implemented in scanpy.

Add cell-cycle scores

Remove doublets

Summary statistics

Generate a summary table with

The first two metrics are based on the raw files (FASTQ and BAM), i.e. before UMI deduplication and read in from precomputed tables. The other columns are based on the counts produced by cellranger and computed on the anndata object.

Get fraction of ribosomal genes

need to revert to unfiltered anndata object to get stats on ribosomal genes as we removed them earlier. The statistics need to be computed on "called cells" (after doublet filtering), so we can't compute the stats in the earlier notebook either.

Compute stats by aggregating obs

Compute highly variable genes