Input data and configuration

# get default parameters. Either papermill or rmarkdown way.
try:
    input_file = r.params["input_file"]
    output_file = r.params["output_file"]
except:
    print("Could not access params from `r` object. Don't worry if your are running papermill. ")
    input_file = "results/02_process_data/adata.h5ad"
    output_file = "results/03_correct_data/adata.h5ad"

biomart = pd.read_csv("tables/biomart.tsv", sep="\t")
cell_cycle_regev = pd.read_csv("tables/cell_cycle_regev.tsv", sep="\t")
cell_cycle_regev = cell_cycle_regev[["hgnc_symbol", "phase"]].drop_duplicates()

adata = sc.read_h5ad(input_file)

Normalize and scale

The raw data object will contain normalized, log-transformed values for visualiation. The original, raw (UMI) counts are stored in adata.obsm["raw_counts"].

While there are more sophisticated countFactor normalization methods, we stick to a simple CPM method here.

norm_log(adata)

## normalizing by total count per cell
##     finished ({time_passed}): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)

sc.pp.pca(adata)

## Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
## computing PCA with n_comps = 50
##     finished

sc.pp.neighbors(adata)

## computing neighbors
##     using 'X_pca' with n_pcs = 50
##     finished

sc.tl.umap(adata)

## computing UMAP
##     finished

Combat

per-sample batch-effect correction

UMAP plots before correction:

sc.pl.umap(adata, color=["samples", "n_genes", "mt_frac", "doublet_score"], ncols=2)

adata_before_combat = adata.copy()

sc.pp.combat(adata, key="samples")

## Standardizing Data across genes.
## 
## Found 13 batches
## 
## Found 0 numerical variables:
##  
## 
## Fitting L/S model and finding priors
## 
## Finding parametric adjustments
## 
## Adjusting data
## 
## 
## /home/sturm/scratch/projects/2020/kortekaas2020_paper/work/conda/run_notebook-b99176dfb05e0522988294512373169f/lib/python3.7/site-packages/scanpy/preprocessing/_combat.py:235: RuntimeWarning: divide by zero encountered in true_divide
##   b_prior[i],

UMAP plots after correction:

sc.pp.pca(adata)

## Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
## computing PCA with n_comps = 50
##     finished

sc.pp.neighbors(adata)

## computing neighbors
##     using 'X_pca' with n_pcs = 50
##     finished

sc.tl.umap(adata)

## computing UMAP
##     finished

sc.pl.umap(adata, color=["samples", "n_genes", "mt_frac", "doublet_score"], ncols=2)

Highly variable genes

We perform highly variable gene filtering to reduce the feature space to genes that contain biologically relevant information. This step is required for most downsteam analysis and reduces computation time.

A threshold of 4000 HVG is commonly recommended (Luecken and Theis 2019).

sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=4000)

## extracting highly variable genes
##     finished
## added
##     'highly_variable', boolean vector (adata.var)
##     'means', float vector (adata.var)
##     'dispersions', float vector (adata.var)
##     'dispersions_norm', float vector (adata.var)

sc.pl.highly_variable_genes(adata)

print("Highly variable genes: ", np.sum(adata.var["highly_variable"]))

## Highly variable genes:  4000

print("Total genes:", adata.shape[1])

## Total genes: 16818

sc.pp.pca(adata)

## Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose `svd_solver='arpack'.` This will likely become the Scanpy default in the future.
## computing PCA with n_comps = 50
## computing PCA on highly variable genes
##     finished

sc.pp.neighbors(adata)

## computing neighbors
##     using 'X_pca' with n_pcs = 50
##     finished

sc.tl.umap(adata)

## computing UMAP
##     finished

sc.tl.leiden(adata)

## running Leiden clustering
##     finished

visualize confounders

Visualize confounding factors after batch effect correction and HVG-filtering:

Compute cell-cycle phase for each cell

sc.tl.score_genes_cell_cycle(adata,
                             s_genes = cell_cycle_regev.loc[cell_cycle_regev["phase"] == "S","hgnc_symbol"].values,
                             g2m_genes = cell_cycle_regev.loc[cell_cycle_regev["phase"] == "G2M","hgnc_symbol"].values)

## calculating cell cycle phase
## computing score 'S_score'
## gene: MLF1IP is not in adata.var_names and will be ignored
##     finished
## computing score 'G2M_score'
## gene: FAM64A is not in adata.var_names and will be ignored
## gene: HN1 is not in adata.var_names and will be ignored
##     finished
##     'phase', cell cycle phase (adata.obs)

Display confounding factors

sc.pl.umap(adata, color=["n_genes", "n_counts", "mt_frac", "samples", "origin", "leiden", "phase", "doublet_score"], ncols=2)

## ... storing 'phase' as categorical

Save results

adata.write(output_file, compression="lzf")

Summary

The purpose of this notebook is to

load the filtered cells from the previous steps
Normalize the data
remove confounding-factors

Normalize data

The number of reads sequenced per cell is the result of a random sampling process. We, therefore, need to normalize the number of detected reads per cell. A straightforward approach is the counts per million (CPM) normalization that we use here.

Remove confounding effects

We use combat to address sample-specific batch effects. As we can see in the following UMAP-plots, before applying combat, the cells heavliy group by sample. After applying combat, this effect is less prevalent, while the main clusters are maintained, indicating that the biological variability is not removed by over-correction.

Highly-variable gene filtering

We perform highly variable (HVG) gene filtering to reduce the feature space to genes that contain biologically relevant information. This step is required for most downsteam analysis and reduces computation time.

A threshold of 4000 HVG is commonly recommended (Luecken and Theis 2019).