Analysis of 140k T cells from cancer

In this notebook, we re-analyze single-cell TCR/RNA-seq data from Wu et al. (2020) generated on the 10x Genomics platform. The original dataset consists of >140k T cells from 14 treatment-naive patients across four different types of cancer. Roughly 100k of the 140k cells have T-cell receptors.

0. Setup

In [1]:
%load_ext autoreload
%autoreload 2
import sys

sys.path.insert(0, "../../scirpy/")
import scirpy as ir
import pandas as pd
import numpy as np
import scanpy as sc
import scipy as sp
from matplotlib import pyplot as plt
import matplotlib
from weblogo.seq import SeqList, unambiguous_protein_alphabet
from weblogo import png_formatter, svg_formatter, eps_formatter
from weblogo import LogoData, LogoOptions, LogoFormat
from IPython.display import Image, display
import warnings

sc.settings._vector_friendly = True

warnings.filterwarnings('ignore', category=FutureWarning)

# whether to run the alignment or use a cached version.
# If `False` and the cached version is not available, will raise an error.
run_alignment = True

This notebook was run with scirpy commit e07553f6b84515ea2c48d2955660a3d6f900f9eb and the following software versions:

In [2]:
sc.logging.print_versions()
scanpy==1.5.1 anndata==0.7.3 umap==0.4.2 numpy==1.18.1 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.22.2.post1 statsmodels==0.11.1 python-igraph==0.8.2 leidenalg==0.8.0
In [3]:
def weblogo(seqs, title="", format="png"):
    """Draw a sequence logo from a list of amino acid sequences. """
    logodata = LogoData.from_seqs(SeqList(seqs, alphabet=unambiguous_protein_alphabet))
    logooptions = LogoOptions(logo_title=title)
    logoformat = LogoFormat(logodata, logooptions)
    if format == "png":
        display(Image(png_formatter(logodata, logoformat)))
    elif format == "eps":
        return eps_formatter(logodata, logoformat)
In [4]:
# colors from the paper
colors = {
    "Tumor+NAT expanded": "#9458a2",
    "Tumor singleton": "#ff8000",
    "NAT singleton": "#f7f719",
    "Tumor multiplet": "#eeb3cb",
    "NAT multiplet": "#9cd0de",
    "Blood singleton": "#cce70b",
    "Blood multiplet": "#beac83",
}

1. Preparing the data

The dataset ships with the scirpy package. We can conveniently load it from the scirpy.datasets module.

In [5]:
adata = ir.datasets.wu2020()
In [6]:
adata.shape
Out[6]:
(141623, 30727)

We only keep the cells with TCR. ~96k cells remain.

In [7]:
adata = adata[adata.obs["has_tcr"] == "True", :]
adata = adata[~(adata.obs["cluster_orig"] == "nan"), :]
In [8]:
adata.shape
Out[8]:
(96067, 30727)
In [9]:
adata.obs.columns
Out[9]:
Index(['TRA_1_c_gene', 'TRA_1_cdr3', 'TRA_1_cdr3_nt', 'TRA_1_d_gene',
       'TRA_1_expr', 'TRA_1_j_gene', 'TRA_1_junction_ins', 'TRA_1_v_gene',
       'TRA_2_c_gene', 'TRA_2_cdr3', 'TRA_2_cdr3_nt', 'TRA_2_d_gene',
       'TRA_2_expr', 'TRA_2_j_gene', 'TRA_2_junction_ins', 'TRA_2_v_gene',
       'TRB_1_c_gene', 'TRB_1_cdr3', 'TRB_1_cdr3_nt', 'TRB_1_d_gene',
       'TRB_1_expr', 'TRB_1_j_gene', 'TRB_1_junction_ins', 'TRB_1_v_gene',
       'TRB_2_c_gene', 'TRB_2_cdr3', 'TRB_2_cdr3_nt', 'TRB_2_d_gene',
       'TRB_2_expr', 'TRB_2_j_gene', 'TRB_2_junction_ins', 'TRB_2_v_gene',
       'batch', 'clonotype_orig', 'cluster_orig', 'has_tcr', 'multi_chain',
       'patient', 'sample', 'source'],
      dtype='object')
In [10]:
adata.obs["counts"] = adata.X.sum(axis=1).A1
Trying to set attribute `.obs` of view, copying.

Preprocess Transcriptomics data

Transcriptomics data needs to be filtered and preprocessed as with any other single-cell dataset. We recommend following the scanpy tutorial and the best practice paper by Luecken et al.. For the Wu et al. (2020) dataset, the authors already provide clusters and UMAP coordinates. Instead of performing clustering and cluster annotation ourselves, we will just use provided data. The clustering and annotation procedure used by the authors is described in their paper.

In [11]:
sc.pp.filter_genes(adata, min_cells=10)
sc.pp.filter_cells(adata, min_genes=100)
In [12]:
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1000)
sc.pp.log1p(adata)
In [13]:
adata.obsm["X_umap"] = adata.obsm["X_umap_orig"]
In [14]:
mapping = {
    "3.1-MT": "other",
    "4.1-Trm": "CD4_Trm",
    "4.2-RPL32": "CD4_RPL32",
    "4.3-TCF7": "CD4_TCF7",
    "4.4-FOS": "CD4_FOSS",
    "4.5-IL6ST": "CD4_IL6ST",
    "4.6a-Treg": "CD4_Treg",
    "4.6b-Treg": "CD4_Treg",
    "8.1-Teff": "CD8_Teff",
    "8.2-Tem": "CD8_Tem",
    "8.3a-Trm": "CD8_Trm",
    "8.3b-Trm": "CD8_Trm",
    "8.3c-Trm": "CD8_Trm",
    "8.4-Chrom": "other",
    "8.5-Mitosis": "other",
    "8.6-KLRB1": "other",
    "nan": "nan",
}
adata.obs["cluster"] = [mapping[x] for x in adata.obs["cluster_orig"]]

Let's inspect the UMAP plots. The first three panels show the UMAP plot colored by sample, patient and cluster. We don't observe any clustering of samples or patients that could hint at batch effects.

The last three panels show the UMAP colored by the T cell markers CD8, CD4, and FOXP3. We can confirm that the markers correspond to their respective cluster labels.

In [15]:
sc.pl.umap(
    adata,
    color=["sample", "patient", "cluster", "CD8A", "CD4", "FOXP3"],
    ncols=2,
    wspace=0.5,
)
... storing 'cluster' as categorical

TCR Quality Control

While most of T cell receptors have exactly one pair of α and β chains, up to one third of T cells can have dual TCRs, i.e. two pairs of receptors originating from different alleles (Schuldt et al (2019)).

Using the scirpy.tl.chain_pairing function, we can add a summary about the T cell receptor compositions to adata.obs.

  • Orphan chain refers to cells that have either a single alpha or beta receptor chain.
  • Extra chain refers to cells that have a full alpha/beta receptor pair, and an additional chain.
  • Multichain refers to cells with more than two receptor pairs detected. These cells are likely doublets.
In [16]:
%%time
ir.tl.chain_pairing(adata)
CPU times: user 11.4 s, sys: 14.2 ms, total: 11.4 s
Wall time: 11.4 s
In [17]:
ax = ir.pl.group_abundance(
    adata, groupby="chain_pairing", target_col="has_tcr", normalize="has_tcr"
)
ax.set_ylabel("fraction of cells")
ax.set_xlabel("chain pairing")
ax.set_title("")
fig = ax.get_figure()
fig.savefig("figures/chain_pairing.svg")
... storing 'chain_pairing' as categorical
In [18]:
sc.pl.umap(
    adata,
    color="chain_pairing",
    groups=["Extra beta", "Extra alpha", "Two full chains"],
    size=[
        10 if x in ["Extra beta", "Extra alpha", "Two full chains"] else 3
        for x in adata.obs["chain_pairing"]
    ],
)

Indeed, in this dataset, ~7% of cells have more than one pair of productive T-cell receptors:

In [19]:
print(
    "Fraction of cells with more than one pair of TCRs: {:.2f}".format(
        np.sum(
            adata.obs["chain_pairing"].isin(
                ["Extra beta", "Extra alpha", "Two full chains"]
            )
        )
        / adata.n_obs
    )
)
Fraction of cells with more than one pair of TCRs: 0.07

Excluding multichain cells

Next, we visualize the Multichain cells on the UMAP plot and exclude them from downstream analysis. Multichain cells likely represent doublets. This is corroborated by the fact that they tend to have a very high number of detected transcripts.

In [20]:
_, pvalue = sp.stats.mannwhitneyu(
    adata.obs.loc[adata.obs["multi_chain"] == "True", "counts"].values,
    adata.obs.loc[adata.obs["multi_chain"] == "False", "counts"].values,
)
In [21]:
fig, (ax1, ax2) = plt.subplots(
    1, 2, figsize=(8, 4), gridspec_kw={"width_ratios": [2, 1]}
)
sc.pl.umap(
    adata,
    color="chain_pairing",
    groups="Multichain",
    size=[30 if x == "Multichain" else 3 for x in adata.obs["chain_pairing"]],
    ax=ax1,
    legend_loc="none",
    show=False,
    frameon=False,
    title="Multichain UMAP",
)
sc.pl.violin(
    adata, "counts", "multi_chain", ax=ax2, show=False, inner="box", stripplot=False
)
ax2.set_ylabel("detected reads")
ax2.set_xlabel("")
ax2.set_xticklabels(["other", "multichain"])
ax2.set_title("detected reads per cell")
ax2.text(0.3, 50000, f"p={pvalue:.2E}")
plt.subplots_adjust(wspace=0.5)
fig.savefig("figures/multichains.svg", dpi=600)
In [22]:
print(
    f"Median counts Multichain: {np.median(adata.obs.loc[adata.obs['multi_chain'] == 'True', 'counts'].values)}"
)
print(
    f"Median counts other: {np.median(adata.obs.loc[adata.obs['multi_chain'] == 'False', 'counts'].values)}"
)
Median counts Multichain: 8517.5
Median counts other: 2301.0
In [23]:
adata.shape
Out[23]:
(96060, 17690)
In [24]:
adata = adata[adata.obs["chain_pairing"] != "Multichain", :].copy()
In [25]:
adata.shape
Out[25]:
(95586, 17690)

2. Define clonotypes

Defining clonotypes in scirpy is a two-step procedure:

  1. Computing a neighborhood graph based on CDR3 sequences
  2. Finding connected submodules in the neighborhood graph and annotating them as clonotypes

scirpy provides several metrics for creating the neighborhood graph. For instance, it is possible to choose between using nucleotide or amino acid CDR3 sequences, or using a sequence similarity metric based on multiple sequence alignments instead of requiring sequences to be identical.

In [26]:
sc.settings.verbosity = 4

identical nucleotide sequences

The authors of the dataset define the clonotypes on the nucleotide sequences and require all sequences of both receptor arms (and multiple chains in case of dual TCRs) to match.

In [27]:
%%time
ir.pp.tcr_neighbors(
    adata, receptor_arms="all", dual_tcr="all",
)
Initializing TcrNeighbors object...
Finished initalizing TcrNeighbors object.  (0:00:17)
Computing TRA pairwise distances...
Finished computing TRA pairwise distances. (0:00:00)
Computing TRB pairwise distances...
Finished computing TRB pairwise distances. (0:00:00)
Started comstructing TRA coord-dictionary...
100%|██████████| 33557/33557 [00:19<00:00, 1764.91it/s]
Finished constructing TRA coord-dictionary (0:00:19)
Started comstructing TRB coord-dictionary...
100%|██████████| 44828/44828 [00:30<00:00, 1491.71it/s]
Finished constructing TRB coord-dictionary (0:00:30)
Constructing cell x cell distance matrix...
100%|██████████| 5708154/5708154 [00:56<00:00, 100641.24it/s]
Finished constructing cell x cell distance matrix.  (0:00:56)

CPU times: user 2min 15s, sys: 6.44 s, total: 2min 22s
Wall time: 2min 21s
In [28]:
%%time
ir.tl.define_clonotypes(adata)
CPU times: user 1.07 s, sys: 258 ms, total: 1.32 s
Wall time: 1.32 s

identical amino acid sequences

Alternatively, we can define "clonotype clusters" based on identical amino acid sequences. Other than clonotypes, these do not necessarily arise from the same antecedent cell, but they recognize the same epitope.

In [29]:
%%time
ir.pp.tcr_neighbors(
    adata, receptor_arms="all", dual_tcr="all", metric="identity", sequence="aa"
)
Initializing TcrNeighbors object...
Finished initalizing TcrNeighbors object.  (0:00:17)
Computing TRA pairwise distances...
Finished computing TRA pairwise distances. (0:00:00)
Computing TRB pairwise distances...
Finished computing TRB pairwise distances. (0:00:00)
Started comstructing TRA coord-dictionary...
100%|██████████| 31135/31135 [00:17<00:00, 1731.61it/s]
Finished constructing TRA coord-dictionary (0:00:17)
Started comstructing TRB coord-dictionary...
100%|██████████| 44074/44074 [00:22<00:00, 1972.65it/s]
Finished constructing TRB coord-dictionary (0:00:22)
Constructing cell x cell distance matrix...
100%|██████████| 5828284/5828284 [01:00<00:00, 95584.87it/s] 
Finished constructing cell x cell distance matrix.  (0:01:00)

CPU times: user 2min 7s, sys: 7.08 s, total: 2min 14s
Wall time: 2min 14s
In [30]:
%%time
ir.tl.define_clonotype_clusters(adata, metric="identity", sequence="aa")
CPU times: user 1.02 s, sys: 170 ms, total: 1.19 s
Wall time: 1.19 s

similar amino acid sequences

With scirpy, it is possible to to one step further and summarize cells into the same clonotype cluster that might recognize the same epitope because they have similar amino acid sequences. This can be done by leveraging Levenshtein or alignment distances. Here, we compute the alignment distance with a cutoff of 15, which is equivalent of three As mutating into R.

In [31]:
# adata.write_h5ad("./adata_in.h5ad")
In [32]:
%%time
if run_alignment:
    ir.pp.tcr_neighbors(
        adata,
        receptor_arms="all",
        dual_tcr="all",
        cutoff=15,
        sequence="aa",
        metric="alignment",
        n_jobs=32
    )
    adata.write_h5ad("./adata_alignment.h5ad")
else:
    # read cached version
    adata = sc.read_h5ad("./adata_alignment.h5ad")
Initializing TcrNeighbors object...
Finished initalizing TcrNeighbors object.  (0:00:16)
Computing TRA pairwise distances...
100%|██████████| 31135/31135 [07:22<00:00, 70.34it/s] 
Finished computing TRA pairwise distances. (0:07:25)
Computing TRB pairwise distances...
100%|██████████| 44074/44074 [14:59<00:00, 48.99it/s] 
Finished computing TRB pairwise distances. (0:15:03)
Started comstructing TRA coord-dictionary...
100%|██████████| 522682/522682 [00:59<00:00, 8819.42it/s]
Finished constructing TRA coord-dictionary (0:00:59)
Started comstructing TRB coord-dictionary...
100%|██████████| 370851/370851 [00:48<00:00, 7627.50it/s]
Finished constructing TRB coord-dictionary (0:00:48)
Constructing cell x cell distance matrix...
100%|██████████| 16899128/16899128 [02:27<00:00, 114626.41it/s]
Finished constructing cell x cell distance matrix.  (0:02:27)

    Started converting distances to connectivities. 
    Finished converting distances to connectivities.  (0:00:00)
... storing 'clonotype' as categorical
... storing 'ct_cluster_aa_identity' as categorical
CPU times: user 7min 50s, sys: 1min 54s, total: 9min 45s
Wall time: 27min 42s
In [33]:
%%time
ir.tl.define_clonotype_clusters(
    adata, metric="alignment", sequence="aa"
)
CPU times: user 1.14 s, sys: 147 ms, total: 1.28 s
Wall time: 1.28 s

Visualizing clonotype networks

To visualize the network we first call scirpy.tl.clonotype_network to compute the layout. We can then visualize it using scirpy.pl.clonotype_network.

The following plot visualizes all clonotypes with at least 50 cells. Each dot represents a cell, and each blob a clonotype. In the left panel each clonotype is labelled, in the right panel the clonotypes are colored by patient.

In [34]:
%%time
ir.tl.clonotype_network(adata, min_size=50, layout="components")
CPU times: user 10.8 s, sys: 108 ms, total: 10.9 s
Wall time: 10.9 s
In [35]:
ir.pl.clonotype_network(
    adata,
    color=["clonotype", "patient"],
    edges=False,
    size=50,
    ncols=2,
    legend_fontoutline=2,
    legend_loc=["on data", "right margin"],
)
... storing 'ct_cluster_aa_alignment' as categorical
Out[35]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x2adeca101490>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x2adeca0b0190>],
      dtype=object)

3. Clonotype consistency

Before we dive into the analysis of clonal expansion, we compare the different approaches of clonotype definition.

3.1 nucleotide based (scirpy vs. Wu et al.)

In this section, we compare the clonotypes assigned by scirpy assigned to the clonotypes assigned by the authors of the study. The original clonotypes are stored in the clonotype_orig column of obs.

According to the paper, the clonotypes are defined on nucleotide sequences and require sequence identity of all available chains. This should be equivalent of running scirpy.tcr_neighbors with dual_tcr="all" and receptor_arms="all".

To assess if the clonotype definitions are equivalent, we first check if there are clonotypes in clonotype_orig that contain multiple clonotypes according to scirpy's definition.

In [36]:
# the clonotypes in clonotype_orig that contain multiple clonotype_nt
nt_in_orig = (
    adata.obs.groupby(["clonotype_orig", "clonotype"], observed=True)
    .size()
    .reset_index()
    .groupby("clonotype_orig")
    .size()
    .reset_index()
)
nt_in_orig[nt_in_orig[0] > 1]
Out[36]:
clonotype_orig 0

There are none!

Next, we check if there are clonotypes in clonotype_nt that contain multiple clonotypes according to the authors' definition.

In [37]:
# the clonotypes in clonotype_nt that contain multiple clonotype_orig
orig_in_nt = (
    adata.obs.groupby(["clonotype", "clonotype_orig"], observed=True)
    .size()
    .reset_index()
    .groupby("clonotype")
    .size()
    .reset_index()
)
with pd.option_context("display.max_rows", 8):
    display(orig_in_nt[orig_in_nt[0] > 1])
clonotype 0
70 70 2
565 565 5
569 569 2
1548 1548 4
... ... ...
22351 22351 2
25854 25854 2
26240 26240 2
30141 30141 2

22 rows × 2 columns

There are a few!

Let's investigate that further:

In [38]:
with pd.option_context("display.max_rows", 10):
    display(
        adata.obs.loc[
            adata.obs["clonotype"].isin(
                orig_in_nt[orig_in_nt[0] > 1]["clonotype"]
            ),
            [
                "clonotype",
                "clonotype_orig",
                "patient",
                "TRA_1_cdr3_nt",
                "TRA_2_cdr3_nt",
                "TRB_1_cdr3_nt",
                "TRB_2_cdr3_nt",
            ],
        ].sort_values(["clonotype", "clonotype_orig"])
    )
clonotype clonotype_orig patient TRA_1_cdr3_nt TRA_2_cdr3_nt TRB_1_cdr3_nt TRB_2_cdr3_nt
RT2_ACACCCTAGATATACG-1-0 70 renal2.tnb.C82 Renal2 TGTGCTGTGAGAGATCGCGACTACAAGCTCAGCTTT None None None
RB2_TAAGTGCTCAGTTTGG-1-12 70 renal2.tnb.C82 Renal2 TGTGCTGTGAGAGATCGCGACTACAAGCTCAGCTTT None None None
RN2_TACGGTATCTAACGGT-1-27 70 renal2.tnb.C82 Renal2 TGTGCTGTGAGAGATCGCGACTACAAGCTCAGCTTT None None None
RT3_GAACGGATCGTCTGCT-1-6 70 renal3.tnb.C320 Renal3 TGTGCTGTGAGAGATCGCGACTACAAGCTCAGCTTT None None None
CT1_AGTGGGAAGGTTCCTA-1-13 565 colon1.tn.C372 Colon1 TGTGCTGTGATGGATAGCAACTATCAGTTAATCTGG None None None
... ... ... ... ... ... ... ...
LT4_ATAACGCTCTATCCCG-2-17 25854 lung4.tn.C743 Lung4 TGTGCTGTGAGTGGTACCGGCACTGCCAGTAAACTCACCTTT None None None
ET1_GTTAAGCCACCACGTG-2-16 26240 endo1.tn.C3434 Endo1 None None TGTGCCAGCAGCTTAGTTGGGGCCAACGTCCTGACTTTC None
LN6_GGACGTCGTAAGAGGA-1-20 26240 lung6.tnb.C6956 Lung6 None None TGTGCCAGCAGCTTAGTTGGGGCCAACGTCCTGACTTTC None
LT6_AGCGTATGTAGAGCTG-1-18 30141 lung6.tnb.C829 Lung6 None None TGTGCCAGCTCACCACAACAAGAGACCCAGTACTTC None
RN2_TGGCTGGAGCTAGTCT-1-27 30141 renal2.tnb.C3707 Renal2 None None TGTGCCAGCTCACCACAACAAGAGACCCAGTACTTC None

102 rows × 7 columns

All cells of the same clonotype_nt have the same nucleotide sequences (apart from swapping between TRA_1 and TRA_2 or TRB_1 and TRB_2, respectively). Our method appears to work as expected. The inconsistencies seem to arise from the fact that the clonotypes have the same sequences, but originate from different patients. Apparently, the authors did not allow clonotypes from different patients (which makes sense when approaching clonotypes from a genomic point of view, not from an epitope recognition point of view).

When checking the number of clonotypes_orig per clonotype_nt and patient, we eradicate the differences:

In [39]:
# the clonotypes in clonotype_nt that contain multiple clonotype_orig
orig_in_nt = (
    adata.obs.groupby(["clonotype", "patient", "clonotype_orig"], observed=True)
    .size()
    .reset_index()
    .groupby(["clonotype", "patient"])
    .size()
    .reset_index()
)
with pd.option_context("display.max_rows", 8):
    display(orig_in_nt[orig_in_nt[0] > 1])
clonotype patient 0

Now, also the number of clonotypes is consistent:

In [40]:
assert (
    adata.obs["clonotype_orig"].unique().size
    == adata.obs.groupby(["patient", "clonotype"], observed=True).size().reset_index().shape[0]
)
print(
    f"""Number of clonotypes according to Wu et al.: {adata.obs["clonotype_orig"].unique().size}"""
)
print(
    f"""Number of clonotypes according to scirpy: {adata.obs.groupby(["clonotype"]).size().reset_index().shape[0]}"""
)
print(
    f"""Number of clonotypes according to scirpy, within patient: {adata.obs.groupby(["patient", "clonotype"], observed=True).size().reset_index().shape[0]}"""
)
Number of clonotypes according to Wu et al.: 54751
Number of clonotypes according to scirpy: 54719
Number of clonotypes according to scirpy, within patient: 54751

3.2 amino-acid vs. nucleotide-based

For a few clonotypes, we observe differences comparing the amino-acid vs. nucleotide-based approach for defining clonotypes. Clonotypes with the same amino acid sequence but different nucleotide sequences recognize the same antigen - but derive from different antedescent cells. This can be an example of convergent evolution.

In [41]:
ir.tl.clonotype_network(adata, sequence='aa', metric='identity', min_size=50)
In [42]:
ir.pl.clonotype_network(
    adata,
    color=["ct_cluster_aa_identity", "clonotype"],
    edges=False,
    size=50,
    ncols=2,
    legend_fontoutline=2,
    legend_loc=["on data", "none"],
)
Out[42]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x2aded2aec950>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x2aded72d4150>],
      dtype=object)

Let's find the cells that belong to clonotypes that have different nucleotide sequences but the same amino acid sequences:

In [43]:
nt_in_aa = (
    adata.obs.groupby(["ct_cluster_aa_identity", "clonotype"], observed=True)
    .size()
    .reset_index()
    .groupby(["ct_cluster_aa_identity"])
    .size()
    .reset_index()
)
convergent_clonotypes = nt_in_aa.loc[nt_in_aa[0] > 1, "ct_cluster_aa_identity"]
In [44]:
adata.obs["is_convergent"] = [
    "convergent" if x else "-"
    for x in adata.obs["ct_cluster_aa_identity"].isin(convergent_clonotypes)
]
In [45]:
sc.pl.umap(
    adata,
    color="is_convergent",
    groups=["convergent"],
    size=[10 if x == "convergent" else 3 for x in adata.obs["is_convergent"]],
)
... storing 'is_convergent' as categorical

The phonomenon appears to primarily occur in CD8+ effector and tissue resident T cells:

In [46]:
ir.pl.group_abundance(
    adata, groupby="cluster", target_col="is_convergent", normalize=True
)
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x2adeca069510>

3.3 amino-acid identity vs. alignment distance

When computing the alignment distance, we allowed a distance of 15, based on the BLOSUM62 matrix. This is eqivalent of three As mutating into R.

The alignment distance allows us to identify similar CDR3 sequences, that likely recognize the same epitope.

For the following visualization, we exclude cells with Orphan chains. Having only one chain that needs to match there tend to be a lot of similar cells within a distance of 15 resulting in large, uninformative networks for those cells.

In [47]:
%%time
adata_sub = adata[~adata.obs["chain_pairing"].str.startswith("Orphan"), :]
CPU times: user 1min, sys: 49.2 ms, total: 1min
Wall time: 1min
In [48]:
%%time
ir.tl.clonotype_network(
    adata_sub,
    min_size=50,
    layout="components",
    sequence="aa",
    metric="alignment"
)
CPU times: user 9.24 s, sys: 417 ms, total: 9.66 s
Wall time: 9.66 s

We'll have a closer look at the following clonotypes, as they seem interesting:

In [49]:
selected_clonotypes = ["1312", "5499", "1688", "644"]
In [50]:
ir.pl.clonotype_network(
    adata_sub,
    color=["ct_cluster_aa_alignment"],
    groups=selected_clonotypes,
    edges=False,
    size=50,
    ncols=2,
    legend_loc="on data",
    legend_fontoutline=3,
)
Out[50]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x2adeca159e90>],
      dtype=object)
In [51]:
np.random.seed(123)
np.random.shuffle(adata_sub.uns["clonotype_colors"])
In [52]:
ir.pl.clonotype_network(
    adata_sub,
    color=["clonotype", "patient"],
    edges=False,
    size=50,
    ncols=2,
    legend_loc=["none", "right margin"],
    legend_fontoutline=2,
)
Out[52]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x2adecaffe850>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x2adf01acacd0>],
      dtype=object)
In [53]:
%%capture
fig, ax = plt.subplots(figsize=(10, 10), dpi=300)
ir.pl.clonotype_network(
    adata_sub,
    color="clonotype",
    edges=False,
    size=50,
    legend_loc="none",
    ax=ax,
    frameon=False,
)
ax.set_title("clonotypes")
fig.savefig("figures/clonotype_network.svg")

The patient distribution of the four clonotypes:

In [54]:
ax = ir.pl.group_abundance(
    adata_sub[adata_sub.obs["ct_cluster_aa_alignment"].isin(selected_clonotypes), :],
    groupby="ct_cluster_aa_alignment",
    sort=selected_clonotypes,
    target_col="patient",
)
ax.xaxis.set_tick_params(rotation=0)
for tick in ax.xaxis.get_majorticklabels():
    tick.set_horizontalalignment("center")
ax.tick_params(labelsize=10)
fig = ax.get_figure()
ax.set_title("")
ax.set_xlabel("clonotype")
fig.savefig("figures/clonotype_network_patients.svg")

Sequence logos

In [55]:
for ct in selected_clonotypes:
    for chain, chain_label in zip(["TRA_1_cdr3", "TRB_1_cdr3"], ["alpha", "beta"]):
        with open(f"figures/logo_ct_{ct}_{chain_label}.eps", "wb") as f:
            f.write(
                weblogo(
                    adata.obs.loc[adata.obs["ct_cluster_aa_alignment"] == ct, chain].values,
                    title=f"clonotype {ct}: {chain_label}-chain",
                    format="eps",
                )
            )
In [56]:
for ct in selected_clonotypes:
    for chain, chain_label in zip(["TRA_1_cdr3", "TRB_1_cdr3"], ["alpha", "beta"]):
        weblogo(
            adata.obs.loc[adata.obs["ct_cluster_aa_alignment"] == ct, chain].values,
            title=f"clonotype {ct}: {chain_label}-chain",
            format="png",
        )

Clonotype 644 is shared among multiple patients and could be specific for a common, viral antigen.

In case of the clonotype 1312 we find evidence in vdjdb that it could be specific for an Human Cytomegalievirus (CMV) antigen:

  • In vdjdb, we find a CMV-specific alpha-chain searching with the CAV[STR][LG]QAGTALIF pattern.
  • Searching for the beta-chain pattern does not yield a direct result. However, allowing up to two substitutions, we find a CMV-specific chain as well.

convergent clonotypes

We define a clonotype as being convergent, if there are different versions of nucleotide sequences for similar clonotypes, within the same patient.

We define convergence

  • on the level of amino-acid identity vs nucleotide identity
  • on the level of amino-acid similarity vs nucleotide identity
In [57]:
convergent_aa = (
    adata.obs.groupby(["ct_cluster_aa_identity", "ct_cluster_aa_alignment", "patient"], observed=True)
    .size()
    .reset_index()
    .groupby(["ct_cluster_aa_alignment", "patient"], observed=True)
    .size()
    .reset_index()
)
convergent_nt = (
    adata.obs.groupby(["clonotype", "ct_cluster_aa_alignment", "patient"], observed=True)
    .size()
    .reset_index()
    .groupby(["ct_cluster_aa_alignment", "patient"], observed=True)
    .size()
    .reset_index()
)
convergent_clonotypes_aa = convergent_aa.loc[
    convergent_aa[0] > 1, "ct_cluster_aa_alignment"
].values
convergent_clonotypes_nt = convergent_nt.loc[
    convergent_nt[0] > 1, "ct_cluster_aa_alignment"
].values
In [58]:
# convergent clonotypes, AA identity vs alignment
adata.obs["convergent_aa"] = [
    str(x)
    for x in adata.obs["ct_cluster_aa_alignment"].isin(convergent_clonotypes_aa)
    & ~adata.obs["chain_pairing"].str.startswith("Orphan")
]
# convergent clonotypes, NN identity vs AA alignment
adata.obs["convergent_nt"] = [
    str(x)
    for x in adata.obs["ct_cluster_aa_alignment"].isin(convergent_clonotypes_nt)
    & ~adata.obs["chain_pairing"].str.startswith("Orphan")
]
In [59]:
sc.pl.umap(
    adata,
    color=["convergent_aa"],
    groups="True",
    size=[10 if x == "True" else 3 for x in adata.obs["convergent_aa"]],
)
... storing 'convergent_aa' as categorical
... storing 'convergent_nt' as categorical
In [60]:
sc.pl.umap(
    adata,
    color=["convergent_nt"],
    groups="True",
    size=[10 if x == "True" else 3 for x in adata.obs["convergent_nt"]],
)
In [61]:
%%capture
fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
sc.pl.umap(
    adata,
    color=["convergent_nt"],
    groups="True",
    size=[10 if x == "True" else 3 for x in adata.obs["convergent_nt"]],
    ax=ax,
    frameon=False,
    legend_loc="none",
    show=False,
)
ax.set_title("convergent clonotypes")
fig.savefig("figures/convergent_clonotypes_umap.svg")
In [62]:
ir.pl.group_abundance(
    adata,
    groupby="cluster_orig",
    target_col="convergent_nt",
    normalize=True,
    sort="alphabetical",
    fig_kws={"dpi": 120},
)
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ae1c9a6e490>

Inspecting only converged clonotypes

In [63]:
adata_converged = adata[
    (adata.obs["convergent_nt"] == "True")
    & ~adata.obs["chain_pairing"].str.startswith("Orphan"),
    :,
]
In [64]:
%%time
ir.tl.clonotype_network(
    adata_converged,
    min_size=1,
    layout="components",
    metric="alignment",
    sequence="aa"
)
CPU times: user 2.82 s, sys: 21.9 ms, total: 2.84 s
Wall time: 2.84 s
In [65]:
ir.pl.clonotype_network(
    adata_converged,
    color=["ct_cluster_aa_alignment", "clonotype"],
    edges=False,
    size=50,
    ncols=2,
    legend_loc=["none", "none"],
    legend_fontoutline=2,
)
Out[65]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x2adec83fe410>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x2adecbb15910>],
      dtype=object)

4. Clonal expansion

In this section, we assess the clonal expansion

  • across patients
  • across clusters
  • across tissue sources

With the pl.clonal_expansion function, we can easily visualize the clonal expansion by different variables. Per default, the plot shows the fraction of cells belonging to an expanded clonotype.

clonal expansion across patients

In [66]:
fig, ax = plt.subplots(dpi=100)
ir.pl.clonal_expansion(
    adata, groupby="patient", summarize_by="cell", show_nonexpanded=True, ax=ax
)
ax.tick_params(labelsize=10)
fig.savefig("figures/expansion_patients_cell.svg")

Alternatively, we can show the fraction of expanded clonotypes.

In [67]:
fig, ax = plt.subplots(dpi=100)
ir.pl.clonal_expansion(
    adata,
    groupby="patient",
    summarize_by="clonotype",
    show_nonexpanded=False,
    color=["#FF7F0E", "#2CA02C"],
    ax=ax,
)
ax.tick_params(labelsize=10)
fig.savefig("figures/expansion_patients_clonotype.svg")

In the paper, the authors state that depending on the patient, between 9 and 18% of clonotypes are expanded. Our results are highly consistent with these results:

In [68]:
fraction_expanded = (
    ir.tl.summarize_clonal_expansion(
        adata, groupby="patient", summarize_by="clonotype", normalize=True
    )
    .drop("1", axis="columns")
    .sum(axis=1)
)
In [69]:
print(
    f"Between {min(fraction_expanded):.1%} and {max(fraction_expanded):.1%} of clonotypes are expanded."
)
Between 9.0% and 18.4% of clonotypes are expanded.

clonal expansion across cell-type clusters

We observe that clonal expansion is highest among the CD8+ T-cell clusters, in particular effector and tissue-resident T cells.

In [70]:
ir.pl.clonal_expansion(
    adata,
    groupby="cluster_orig",
    summarize_by="cell",
    show_nonexpanded=True,
    fig_kws={"dpi": 120},
)
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ae035f22550>

clonal expansion across tissue sources

The fraction of expanded cell is roughly equivalent among tumor and ajacent normal tissue, but lower in blood.

In [71]:
ir.pl.clonal_expansion(
    adata, groupby="source", summarize_by="cell", show_nonexpanded=True
)
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ae13e4b7c50>

5. Dual- and blood expanded clonotypes

Finally, we will divide the clonotypes into different categories, based on their expansion in blood, tissue and tumor samples.

In particular, we will

  • identify dual-expanded clonotypes, which are expanded in both adjacent tissue and tumor samples
  • identify blood-expanded clonotypes

and compare their abundance across cell-types and patients.

For an illustration, please refer to Fig. 1b of the Wu et al. (2020) paper.

In [72]:
clonotype_size_by_source = (
    adata.obs.groupby(["patient", "source", "clonotype"], observed=True)
    .size()
    .reset_index(name="clonotype_count_by_source")
)
adata.obs = (
    adata.obs.reset_index()
    .merge(clonotype_size_by_source, how="left")
    .set_index("index")
)
In [73]:
blood_expanded = []
for is_expanded, source in zip(
    (adata.obs["source"] == "Blood") & (adata.obs["clonotype_count_by_source"] > 1),
    adata.obs["source"],
):
    if source == "Blood":
        if is_expanded:
            blood_expanded.append("expanded")
        else:
            blood_expanded.append("not expanded")
    else:
        blood_expanded.append("independent")
In [74]:
adata.obs["blood_expanded"] = blood_expanded
In [75]:
clonotype_membership = {ct: list() for ct in adata.obs["clonotype"]}
for clonotype, source in zip(adata.obs["clonotype"], adata.obs["source"]):
    clonotype_membership[clonotype].append(source)
clonotype_membership = {
    ct: set(sources) for ct, sources in clonotype_membership.items()
}
In [76]:
expansion_category = []
for clonotype, clonotype_size, source in zip(
    adata.obs["clonotype"], adata.obs["clonotype_size"], adata.obs["source"]
):
    if clonotype_size == 1:
        if source == "Blood":
            expansion_category.append("Blood singleton")
        elif source == "NAT":
            expansion_category.append("NAT singleton")
        elif source == "Tumor":
            expansion_category.append("Tumor singleton")
    elif clonotype_size > 1:
        membership = clonotype_membership[clonotype]
        if "Tumor" in membership and "NAT" in membership:
            expansion_category.append("Tumor+NAT expanded")
        elif "Tumor" in membership:
            expansion_category.append("Tumor multiplet")
        elif "NAT" in membership:
            expansion_category.append("NAT multiplet")
        elif "Blood" in membership:
            # these are *only* expanded in blood
            expansion_category.append("Blood multiplet")

assert len(expansion_category) == adata.n_obs
In [77]:
adata.obs["cell_type"] = adata.obs["cluster_orig"].str[0]
adata.obs["tumor_type"] = adata.obs["patient"].str[:-1]
In [78]:
adata.obs["expansion_category"] = expansion_category
In [79]:
# make categorical and store colors
adata._sanitize()
adata.uns["expansion_category_colors"] = [
    colors[x] for x in adata.obs["expansion_category"].cat.categories
]
... storing 'blood_expanded' as categorical
... storing 'cell_type' as categorical
... storing 'tumor_type' as categorical
... storing 'expansion_category' as categorical

Mostly CD8+ effector and tissue-resident T cells are blood-expanded and dual-expanded:

In [80]:
sc.pl.umap(
    adata,
    color="blood_expanded",
    groups=["expanded", "not expanded"],
    size=[
        15 if x in ["expanded", "not expanded"] else 3
        for x in adata.obs["blood_expanded"]
    ],
)
In [81]:
ir.pl.embedding(
    adata,
    basis="umap",
    color=["expansion_category", "cluster_orig"],
    legend_loc=["right margin", "on data"],
    show=True,
    legend_fontoutline=2,
    frameon=False,
    wspace=0.5,
    panel_size=(6, 4),
)
In [82]:
%%capture
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8.5, 4), dpi=300)
sc.pl.umap(
    adata,
    color="expansion_category",
    legend_loc="none",
    show=False,
    ax=ax1,
    frameon=False,
)
sc.pl.umap(
    adata,
    color="cluster_orig",
    legend_loc="on data",
    show=False,
    ax=ax2,
    legend_fontoutline=2,
    frameon=False,
)
ax1.set_title("clonal expansion pattern")
ax2.set_title("cell-type cluster")
plt.subplots_adjust(wspace=0.1)
fig.savefig("figures/clonal_expansion_umap.svg")
In [83]:
%%capture
fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
sc.pl.umap(
    adata,
    color="cluster_orig",
    legend_loc="none",
    show=False,
    ax=ax,
    legend_fontoutline=2,
    frameon=False,
)
ax.set_title("")
fig.savefig("figures/cluster_orig.png")

Repertoire overlap

In [84]:
ir.tl.repertoire_overlap(adata, groupby="sample", target_col="ct_cluster_aa_alignment")
ax = ir.pl.repertoire_overlap(adata, groupby="sample", heatmap_cats=["patient", "tumor_type"])
ax.fig.savefig("figures/repertoire_overlap.svg")

Categories by cell type (fractions and count):

In [85]:
fig, ax = plt.subplots(dpi=120)
ir.pl.group_abundance(
    adata,
    groupby="cluster_orig",
    target_col="expansion_category",
    normalize=True,
    ax=ax,
    sort="alphabetical",
)
ax.set_title("")
ax.set_xlabel("cluster")
fig.savefig("figures/expansion_category_cluster_norm.svg")
In [86]:
fig, ax = plt.subplots(dpi=120)
ir.pl.group_abundance(
    adata,
    groupby="cluster_orig",
    target_col="expansion_category",
    normalize=False,
    ax=ax,
    sort="alphabetical",
)
ax.set_title("")
ax.set_xlabel("cluster")
fig.savefig("figures/expansion_category_cluster.svg")

Categories by tumor type:

In [87]:
ir.pl.group_abundance(
    adata, groupby="tumor_type", target_col="expansion_category", normalize=True
)
Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ae0312efbd0>

The fraction of blood-expanded, blood non-expanded and blood-independent cells for the four patients with blood samples

In [88]:
ir.pl.group_abundance(
    adata[adata.obs["patient"].isin(["Lung6", "Renal1", "Renal2", "Renal3"]), :],
    groupby="patient",
    target_col="blood_expanded",
    normalize=True,
)
Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x2adec9adaad0>

Tissue infiltration patterns.

  • The bar plots show the distribution of cells by tissue expansion patterns from blood-indpendned, non-expanded and expanded clones.
In [89]:
for patient in ["Lung6", "Renal1", "Renal2", "Renal3"]:
    ax = ir.pl.group_abundance(
        adata[adata.obs["patient"] == patient, :],
        groupby="blood_expanded",
        target_col="expansion_category",
        normalize=True,
    )
    ax.set_title(patient)

Tissue expansion patterns of T cell by cluster.

In [90]:
for subset in [
    ["NAT singleton", "NAT multiplet"],
    ["Tumor singleton", "Tumor multiplet"],
]:
    ax = ir.pl.group_abundance(
        adata[adata.obs["expansion_category"].isin(subset), :],
        groupby="cluster_orig",
        target_col="expansion_category",
        normalize=False,
        fig_kws={"dpi": 120},
        sort="alphabetical",
    )
    ax.set_title(subset[0].split()[0])
In [91]:
for source in ["NAT", "Tumor"]:
    ax = ir.pl.group_abundance(
        adata[
            (adata.obs["source"] == source)
            & (adata.obs["expansion_category"] == "Tumor+NAT expanded"),
            :,
        ],
        groupby="cluster_orig",
        target_col="expansion_category",
        normalize=False,
        fig_kws={"dpi": 120},
        sort="alphabetical",
    )
    ax.set_title(source)

6. Gene usage

In [92]:
ax = ir.pl.vdj_usage(adata, full_combination=False, top_n=30)
ax.set_ylabel("cells")
ax.set_xlabel("VDJ segment")
ax.set_title("VDJ usage")
fig = ax.get_figure()
fig.savefig("figures/vdj_usage.svg")
In [93]:
ax = ir.pl.spectratype(
    adata[adata.obs["TRB_1_j_gene"] != "None", :],
    cdr3_col="TRB_1_cdr3",
    color="TRB_1_j_gene",
    fig_kws={"dpi": 120},
)
ax.xaxis.set_tick_params(rotation=90)
for tick in ax.xaxis.get_majorticklabels():
    tick.set_horizontalalignment("center")
ax.set_title("Spectratype of primary TCR-β chain")
ax.set_xlabel("CDR3 sequence length")
fig = ax.get_figure()
fig.savefig("figures/spectratype.svg")
Trying to set attribute `.uns` of view, copying.
In [ ]: