Processing and integrating 5k PBMCs CITE-seq data#

This notebooks provides an example for CITE-seq data processing in Python.

CITE-seq is a method for cellular indexing of transcriptomes and epitopes by sequencing. CITE-seq data is single-cell data comprising transcriptome-wide measurements for each cell (gene expression) as well as surface protein level information, typically for a few dozens of proteins. The method is described in Stoeckius et al., 2017 and also on the cite-seq.com website.

The data used in this notebook is on peripheral blood mononuclear cells (PBMCs) and has been provided by 10x Genomics.

[1]:
# Change directory to the root folder of the repository
import os
os.chdir("../")

Download data#

The data is available here. Both filtered and raw matrices are required for this notebook.

[2]:
# This is the directory where those files are downloaded to
data_dir = "data/pbmc5k_protein"
[3]:
for file in os.listdir(data_dir):
    print(file)
filtered_feature_bc_matrix
raw_feature_bc_matrix
5k_pbmc_protein_v3_raw_feature_bc_matrix.tar.gz
5k_pbmc_protein_v3_filtered_feature_bc_matrix.tar.gz

Load libraries and data#

Import libraries:

[4]:
import numpy as np
import pandas as pd
import scanpy as sc
[5]:
import muon as mu
from muon import prot as pt

First, we will load filtered matrices that contain counts for cell but not empty droplets. These cells will be used for the downstream analysis.

[6]:
mdata = mu.read_10x_mtx(os.path.join(data_dir, "filtered_feature_bc_matrix"))
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if not is_categorical(df_full[k]):

Next, we will also load count matrices for all the droplets, whether containing cells or empty.

[7]:
mdata_raw = mu.read_10x_mtx(os.path.join(data_dir, "raw_feature_bc_matrix"))

Protein#

[8]:
prot = mdata.mod['prot']
prot
[8]:
AnnData object with n_obs × n_vars = 5247 × 32
    var: 'gene_ids', 'feature_types'

DSB normalisation#

This normalisation method developed for CITE-seq data uses background droplets defined by low RNA content in order to estimate background protein signal and remove it from the data. More details and its original implementation are available in the ``dsb` GitHub repository <https://github.com/niaid/dsb>`__.

muon brings dsb normalisation method to Python CITE-seq workflows.

pt.pp.dsb(mdata, raw=mdata_raw, empty_droplets=droplets)

Please note due to the nature of the dsb method it behaves in a slightly different way than many other preprocessing and normalisation methods. Its implementation in muon offers 2 ways to use the method.

  1. Operate on 2 AnnData / MuData objects — filtered and raw. Cell calling from the filtered data is used to define cells in the raw data then.

  2. As in the original implementation, operate on raw matrices and then return a copy of the data where only specified cells are kept.

Here, we use the first way of using the method. We would still need to calculate log10umi value for each droplet to define which droplets are empty.

[9]:
mdata_raw['rna'].obs["log10umi"] = np.array(np.log10(mdata_raw['rna'].X.sum(axis=1) + 1)).reshape(-1)
[10]:
mu.pl.histogram(mdata_raw['rna'], ['log10umi'], bins=50)
/usr/local/lib/python3.8/site-packages/seaborn/distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
../_images/cite-seq_1-CITE-seq-PBMC-5k_22_1.png

For clarity, let’s zoom into the part of this data:

[11]:
mu.pl.histogram(mdata_raw['rna'][mdata_raw['rna'].obs.log10umi >= 1], ['log10umi'], bins=50)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if not is_categorical(df_full[k]):
../_images/cite-seq_1-CITE-seq-PBMC-5k_24_1.png

In this dataset, isotype antibody controls are available that also can be taken into account by the normalisation method:

[12]:
isotypes = mdata_raw['prot'].var_names[29:32].values
isotypes
[12]:
array(['IgG1_control_TotalSeqB', 'IgG2a_control_TotalSeqB',
       'IgG2b_control_TotalSeqB'], dtype=object)

Preserve original counts in a layer before the normalisation:

[13]:
prot.layers['counts'] = prot.X

Normalise counts in mdata:

[14]:
pt.pp.dsb(mdata, mdata_raw, empty_counts_range=(1.5, 2.8), isotype_controls=isotypes, random_state=1)
/Users/bredikhi/git/compbio/muon/muon/_prot/preproc.py:129: UserWarning: Dropping 49 empty droplets as they are already defined as cells
  warn(
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if not is_categorical(df_full[k]):

Plot values to visualise the effect of normalisation:

[15]:
sc.pl.scatter(mdata['prot'], x="CD3_TotalSeqB", y="CD19_TotalSeqB", layers='counts')
sc.pl.scatter(mdata['prot'], x="CD3_TotalSeqB", y="CD19_TotalSeqB")
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
... storing 'feature_types' as categorical
../_images/cite-seq_1-CITE-seq-PBMC-5k_32_1.png
../_images/cite-seq_1-CITE-seq-PBMC-5k_32_2.png

Downstream analysis#

We can run conventional methods like PCA on the normalised protein counts:

[18]:
sc.tl.pca(prot)
[19]:
sc.pl.pca(prot, color='CD3_TotalSeqB')
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../_images/cite-seq_1-CITE-seq-PBMC-5k_36_1.png
[20]:
sc.pp.neighbors(prot)
sc.tl.umap(prot, random_state=1)
[21]:
sc.pl.umap(prot, color=['CD3_TotalSeqB', 'CD14_TotalSeqB'])
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../_images/cite-seq_1-CITE-seq-PBMC-5k_38_1.png

RNA#

[22]:
rna = mdata.mod['rna']
rna
[22]:
AnnData object with n_obs × n_vars = 5247 × 33538
    var: 'gene_ids', 'feature_types'

QC#

Perform some quality control. For now, we will filter out cells that do not pass QC.

[23]:
rna.var['mt'] = rna.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(rna, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
[24]:
sc.pl.violin(rna, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
... storing 'feature_types' as categorical
../_images/cite-seq_1-CITE-seq-PBMC-5k_44_1.png

Filter genes which expression is not detected:

[25]:
mu.pp.filter_var(rna, 'n_cells_by_counts', lambda x: x >= 3)
# Same as the following but doesn't copy the object:
#   sc.pp.filter_genes(rna, min_cells=3)

Filter cells:

[26]:
mu.pp.filter_obs(rna, 'n_genes_by_counts', lambda x: (x >= 200) & (x < 5000))
# Same as the following but doesn't copy the object
#   sc.pp.filter_cells(rna, min_genes=200)
#   rna = rna[rna.obs.n_genes_by_counts < 5000, :]

mu.pp.filter_obs(rna, 'total_counts', lambda x: (x > 1500) & (x < 15000))
mu.pp.filter_obs(rna, 'pct_counts_mt', lambda x: x < 20)

Let’s see how the data looks after filtering:

[27]:
sc.pl.violin(rna, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../_images/cite-seq_1-CITE-seq-PBMC-5k_50_1.png

Normalisation#

We’ll normalise the data so that we get log-normalised counts to work with.

[28]:
sc.pp.normalize_total(rna, target_sum=1e4)
[29]:
sc.pp.log1p(rna)

Feature selection#

We will label highly variable genes that we’ll use for downstream analysis.

[30]:
sc.pp.highly_variable_genes(rna, min_mean=0.02, max_mean=4, min_disp=0.5)
[31]:
sc.pl.highly_variable_genes(rna)
../_images/cite-seq_1-CITE-seq-PBMC-5k_58_0.png
[32]:
np.sum(rna.var.highly_variable)
[32]:
1734

Scaling#

We’ll save log-normalised counts in a .raw slot:

[33]:
rna.raw = rna

… and scale the log-normalised counts to zero mean and unit variance:

[34]:
sc.pp.scale(rna, max_value=10)

Analysis#

Having filtered low-quality cells, normalised the counts matrix, and performed feature selection, we can already use this data for multimodal integration.

However it is usually a good idea to study individual modalities as well. Below we run PCA on the scaled matrix, compute cell neighbourhood graph, and perform clustering to define cell types.

PCA and neighbourhood graph#

[35]:
sc.tl.pca(rna, svd_solver='arpack')
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if not is_categorical(df_full[k]):

To visualise the result, we will use some markers for (large-scale) cell populations we expect to see such as T cells and NK cells (CD2), B cells (CD79A), and KLF4 (monocytes).

[36]:
sc.pl.pca(rna, color=['CD2', 'CD79A', 'KLF4', 'IRF8'])
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../_images/cite-seq_1-CITE-seq-PBMC-5k_70_1.png

The first principal component (PC1) is separating myeloid (monocytes) and lymphoid (T, B, NK) cells while B cells-related features seem to drive the second one. Also we see plasmocytoid dendritic cells (marked by IRF8) being close to B cells along the PC2.

[37]:
sc.pl.pca_variance_ratio(rna, log=True)
../_images/cite-seq_1-CITE-seq-PBMC-5k_72_0.png

Now we can compute a neighbourhood graph for cells:

[38]:
sc.pp.neighbors(rna, n_neighbors=10, n_pcs=20)

Non-linear dimensionality reduction and clustering#

With the neighbourhood graph computed, we can now perform clustering. We will use leiden clustering as an example.

[39]:
sc.tl.leiden(rna, resolution=.75)

To visualise the results, we’ll first generate a 2D latent space with cells that we can colour according to their cluster assignment.

[40]:
sc.tl.umap(rna, spread=1., min_dist=.5, random_state=11)
[41]:
sc.pl.umap(rna, color="leiden", legend_loc="on data")
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../_images/cite-seq_1-CITE-seq-PBMC-5k_80_1.png

Cell type annotation#

[42]:
sc.tl.rank_genes_groups(rna, 'leiden', method='t-test_overestim_var')
[43]:
result = rna.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.set_option('display.max_columns', 50)
pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']}).head(10)
[43]:
0_n 0_p 1_n 1_p 2_n 2_p 3_n 3_p 4_n 4_p 5_n 5_p 6_n 6_p 7_n 7_p 8_n 8_p 9_n 9_p 10_n 10_p 11_n 11_p 12_n 12_p 13_n 13_p
0 RPL32 1.380112e-247 S100A8 0.000000e+00 IL7R 4.827268e-175 CST3 8.457789e-138 CCL5 1.614990e-191 GNLY 1.868870e-151 CD79A 6.409675e-113 IGHM 1.284390e-142 GNLY 2.305955e-92 TYMP 4.113971e-44 FCGR3A 7.931312e-73 CD8B 1.344611e-28 IL32 5.583487e-09 ITM2C 1.106098e-36
1 RPS3A 1.627373e-256 S100A9 4.101797e-317 LTB 1.387967e-107 CPVL 4.945355e-162 NKG7 2.032769e-140 NKG7 5.122287e-96 MS4A1 1.513867e-103 CD79A 6.570059e-124 FGFBP2 1.190382e-102 LYZ 5.675880e-35 LST1 3.169949e-44 CD8A 1.179723e-20 TIGIT 7.622707e-07 CCDC50 3.365550e-34
2 RPL30 9.687924e-237 VCAN 3.394343e-312 IL32 2.100285e-104 LYZ 2.891312e-127 GZMA 1.524691e-106 KLRD1 4.177880e-111 HLA-DQA1 3.751459e-106 CD79B 2.500853e-106 GZMH 7.659337e-98 FCN1 4.341500e-35 TCF7L2 1.098160e-45 RPS3A 3.524939e-13 PBXIP1 6.263715e-07 IL3RA 2.391788e-23
3 RPL13 2.673413e-227 LYZ 1.467838e-237 TRAC 7.044805e-82 FCN1 3.664234e-123 CST7 2.410078e-106 KLRF1 2.123803e-102 BANK1 1.331708e-84 IGHD 9.719686e-72 PRF1 1.626375e-91 XAF1 7.801962e-43 SMIM25 3.171766e-49 LINC02446 7.812095e-13 IKZF2 1.872429e-06 PLD4 1.246058e-27
4 RPL11 5.299188e-232 MNDA 1.131567e-278 CD2 2.779068e-67 FGL2 1.029085e-127 IL32 2.124480e-90 PRF1 1.035347e-103 CD79B 7.403563e-79 MS4A1 5.945622e-89 KLRC2 3.624602e-75 FGL2 1.961404e-38 LYN 1.090891e-46 RPS2 1.307254e-12 TRBC2 1.364242e-06 IRF8 2.272772e-29
5 RPS12 2.484028e-223 S100A12 3.040667e-307 AQP3 1.363019e-56 KLF4 5.745357e-131 GZMM 3.587211e-83 CTSW 2.070524e-100 HLA-DQB1 6.762216e-69 TCL1A 5.585904e-65 NKG7 5.237177e-70 CST3 3.191349e-34 MS4A7 6.730315e-51 RPS6 1.514270e-12 DUSP4 4.791963e-06 TCF4 2.379937e-28
6 RPS6 1.943468e-227 FCN1 3.913932e-244 CD3E 4.321252e-53 AIF1 3.486286e-117 KLRG1 1.147997e-74 GZMA 8.167419e-91 CD37 1.360556e-57 HLA-DQA1 1.723135e-87 CST7 5.730287e-82 CTSS 1.003140e-33 CDKN1C 2.347009e-39 RPL13 1.869602e-12 HACD1 1.189909e-05 UGCG 2.010952e-26
7 RPL9 3.701542e-226 CD14 5.684490e-266 KLRB1 8.857507e-50 HLA-DRA 1.229331e-117 CD3D 3.637893e-73 GZMB 2.485553e-88 RALGPS2 1.241454e-54 CD37 6.685736e-66 GZMB 1.041564e-90 MNDA 1.065729e-35 FCER1G 4.274695e-38 RPS12 2.277374e-12 IL2RA 1.689073e-05 IRF7 3.697920e-27
8 TPT1 4.096112e-227 CTSS 1.729932e-207 RPS18 1.695589e-47 PSAP 1.832311e-117 B2M 7.472893e-70 CLIC3 3.298338e-75 HLA-DPB1 4.220414e-51 HLA-DPB1 4.708672e-47 CCL5 2.700983e-64 PSAP 1.983579e-33 LRRC25 3.008410e-45 RPL32 4.908655e-12 FOXP3 2.616250e-05 JCHAIN 2.922103e-22
9 RPL34 4.347625e-218 CSF3R 9.101933e-243 SPOCK2 6.018993e-48 HLA-DRB1 1.410033e-114 GZMK 1.345830e-53 HOPX 2.743326e-83 HLA-DRA 2.174338e-45 CD74 2.262129e-43 FCGR3A 3.320130e-76 CPVL 3.518379e-38 CFD 5.047007e-41 RPL28 4.009718e-12 TRAC 1.275863e-05 MZB1 4.527073e-20
[44]:
new_cluster_names = {
    "0": "CD4+ naïve T", "12": "Treg", "2": "CD4+ memory T",
    "11": "CD8+ naïve T", "4": "CD8+ memory T",
    "5": "NK", "8": "memory-like NK",
    "6": "mature B", "7": "pre-B",
    "1": "CD14 mono", "3": "intermediate mono", "9": "intermediate mono", "10": "CD16 mono",
    "13": "pDC",
}

rna.obs['celltype'] = rna.obs.leiden.astype("str").values
rna.obs.celltype = rna.obs.celltype.replace(new_cluster_names)
rna.obs.celltype = rna.obs.celltype.astype("category")

We will also re-order categories for the next plots:

[45]:
rna.obs.celltype.cat.reorder_categories([
    'CD4+ naïve T', 'CD4+ memory T', 'Treg',
    'CD8+ naïve T', 'CD8+ memory T', 'NK', 'memory-like NK',
    'pre-B', 'mature B',
    'CD14 mono', 'intermediate mono', 'CD16 mono',
    'pDC'], inplace=True)

mdata.update()

… and take colours from a palette:

[46]:
import matplotlib
import matplotlib.pyplot as plt

cmap = plt.get_cmap('rainbow')
colors = cmap(np.linspace(0, 1, len(rna.obs.celltype.cat.categories)))

rna.uns["celltype_colors"] = list(map(matplotlib.colors.to_hex, colors))
[47]:
sc.pl.umap(rna, color="celltype", legend_loc="on data", frameon=False)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../_images/cite-seq_1-CITE-seq-PBMC-5k_89_1.png
[48]:
sc.pl.umap(rna, color="celltype", legend_loc="on data", frameon=False,
          title="RNA celltype annotation")
../_images/cite-seq_1-CITE-seq-PBMC-5k_90_0.png

Finally, we’ll visualise some marker genes across cell types.

[49]:
marker_genes = ['IL7R', 'TRAC',
                'ITGB1', # CD29
                'FOXP3', 'IL2RA',
                'CD8A', 'CD8B', 'CCL5',
                'GNLY', 'NKG7', 'KLRC2',
                'CD79A', 'MS4A1', 'TCL1A', 'IGHM', 'IGHD',
                'IL4R', 'TCL1A',
                'KLF4', 'LYZ', 'S100A8', 'ITGAM', # CD11b
                'CD14', 'FCGR3A', 'MS4A7',
                'CST3', 'IRF8', 'TCF4']
[50]:
sc.pl.dotplot(rna, marker_genes, groupby='celltype');
../_images/cite-seq_1-CITE-seq-PBMC-5k_93_0.png

Plotting#

With mu.pl.embedding interface we can display an embedding from an individual modality (e.g. 'rna') and colour cells by a feature (variable) from another modality.

While variables names should be unique across all modalities, all individual modalities as well as the mdata object itself can have e.g. UMAP basis computed. To point to a basis from a specific modality, use mod:basis syntax, e.g. with "rna:X_umap" in the example below mdata.mod['rna'].obsm["X_umap"] basis is going to be used.

[51]:
mu.pl.embedding(mdata, basis="rna:X_umap", color=["CD3_TotalSeqB"])
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if not is_categorical(df_full[k]):
../_images/cite-seq_1-CITE-seq-PBMC-5k_96_1.png

Multi-omics integration#

Having filtered cells at the RNA QC step, we will now subset them in the protein modality as well so that we can work with the union of cells:

[52]:
mu.pp.intersect_obs(mdata)

Multiplex clustering#

We can run clustering based on neighbours information from each modality taking advantage of multiplex versions of algorithms such as leiden or louvain.

[53]:
mu.tl.louvain(mdata, resolution=[2, .1], random_state=1)
[54]:
mu.tl.leiden(mdata, resolution=[2, .1], random_state=1)
[55]:
mdata.uns['louvain']
[55]:
{'params': {'resolution': [2, 0.1],
  'random_state': 1,
  'partition_improvement': 91323.3393340929}}
[56]:
mdata.uns['leiden']
[56]:
{'params': {'resolution': [2, 0.1],
  'random_state': 1,
  'partition_improvement': 91166.75577710925}}
[57]:
mu.pl.embedding(mdata, basis="rna:X_umap", color=["CD3_TotalSeqB", "louvain", "leiden"])
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if not is_categorical(df_full[k]):
../_images/cite-seq_1-CITE-seq-PBMC-5k_106_1.png

Multi-omics factor analysis#

To generate an interpretable latent space for both 'rna' and 'prot' modalities we will now run multi-omic factor analysis — a group factor analysis method that will allow us to learn an interpretable latent space jointly on both modalities. Intuitively, it can be viewed as a generalisation of PCA for multi-omics data. More information about this method can be found on the MOFA website.

[58]:
prot.var["highly_variable"] = True
mdata.update()
[21]:
mu.tl.mofa(mdata, outfile="models/pbmc5k_citeseq.hdf5",
           n_factors=30)

        #########################################################
        ###           __  __  ____  ______                    ###
        ###          |  \/  |/ __ \|  ____/\    _             ###
        ###          | \  / | |  | | |__ /  \ _| |_           ###
        ###          | |\/| | |  | |  __/ /\ \_   _|          ###
        ###          | |  | | |__| | | / ____ \|_|            ###
        ###          |_|  |_|\____/|_|/_/    \_\              ###
        ###                                                   ###
        #########################################################



Loaded view='prot' group='group1' with N=3891 samples and D=32 features...
Loaded view='rna' group='group1' with N=3891 samples and D=1734 features...


Model options:
- Automatic Relevance Determination prior on the factors: True
- Automatic Relevance Determination prior on the weights: True
- Spike-and-slab prior on the factors: False
- Spike-and-slab prior on the weights: True

Likelihoods:
- View 0 (prot): gaussian
- View 1 (rna): gaussian


######################################
## Training the model with seed 1 ##
######################################



Converged!



#######################
## Training finished ##
#######################


Warning: Output file models/pbmc5k_citeseq_v2.hdf5 already exists, it will be replaced
Saving model in models/pbmc5k_citeseq_v2.hdf5...
Saved MOFA embeddings in .obsm['X_mofa'] slot and their loadings in .varm['LFs'].
[62]:
# # NOTE: if you wish to load the trained model,
# #       use mofax library to quickly add
# #       factors and weights matrices
# #       to the mdata object
# #
# import mofax as mfx
# model = mfx.mofa_model('models/pbmc5k_citeseq_v2.hdf5')
# mdata.obsm["X_mofa"] = model.get_factors()

# # If only highly variable features were used
# w = model.get_weights()
# # Set the weights of features that were not used to zero
# mdata.varm["LFs"] = np.zeros(shape=(mdata.n_vars, w.shape[1]))
# mdata.varm["LFs"][mdata.var["highly_variable"]] = w

# model.close()
[63]:
mu.pl.mofa(mdata, color=['CD3_TotalSeqB', 'CD14_TotalSeqB'])
... storing 'rna:leiden' as categorical
... storing 'rna:celltype' as categorical
... storing 'louvain' as categorical
... storing 'leiden' as categorical
../_images/cite-seq_1-CITE-seq-PBMC-5k_112_1.png
[64]:
sc.pp.neighbors(mdata, use_rep="X_mofa")
sc.tl.umap(mdata, random_state=1)
[65]:
mdata.uns["rna:celltype_colors"] = list(map(matplotlib.colors.to_hex, colors))
[73]:
mu.pl.umap(mdata, color=['rna:celltype'], frameon=False,
           title="UMAP(MOFA) embedding with RNA celltype annotation")
... storing 'louvain' as categorical
... storing 'leiden' as categorical
../_images/cite-seq_1-CITE-seq-PBMC-5k_115_1.png
[74]:
mu.pl.umap(mdata, color=['CD3_TotalSeqB', 'CD4_TotalSeqB', 'CD8a_TotalSeqB'], frameon=False, cmap='viridis')
... storing 'louvain' as categorical
... storing 'leiden' as categorical
../_images/cite-seq_1-CITE-seq-PBMC-5k_116_1.png

Proteins highlighting the memory — naïve axis will also help to refine memory/naïve T cells annotation:

[75]:
mu.pl.umap(mdata, color=['CD45RA_TotalSeqB', 'CD45RO_TotalSeqB'], frameon=False, cmap='viridis')
... storing 'louvain' as categorical
... storing 'leiden' as categorical
../_images/cite-seq_1-CITE-seq-PBMC-5k_118_1.png
[76]:
mu.pl.mofa(mdata, color=['CD45RA_TotalSeqB', 'CD45RO_TotalSeqB'])
... storing 'louvain' as categorical
... storing 'leiden' as categorical
../_images/cite-seq_1-CITE-seq-PBMC-5k_119_1.png

Interpreting the model#

[80]:
from matplotlib import rcParams
rcParams['figure.dpi'] = 150
[78]:
import mofax as mofa
model = mofa.mofa_model("models/pbmc5k_citeseq.hdf5")
[81]:
mofa.plot_weights(model, factors=range(3), n_features=10, sharex=False)
[81]:
<AxesSubplot:title={'center':'rna'}, xlabel='Feature weight'>
../_images/cite-seq_1-CITE-seq-PBMC-5k_123_1.png

While Factor1 corresponds to lymphoid — myeloid axis, Factor2 seems to highlight T — B & NK axis and, at least partly, memory — naïve variability: CD45RO and CD127 are memory markers while CD45RA is a naïveté marker.

[84]:
model.metadata["rna:celltype"] = model.metadata["rna:celltype"].astype('category').cat.reorder_categories(mdata['rna'].obs.celltype.cat.categories)
[85]:
mofa.plot_factors_violin(model, color='rna:celltype', factors=range(3), dots=True, violins=False, palette='rainbow')
[85]:
<AxesSubplot:ylabel='Factor value'>
../_images/cite-seq_1-CITE-seq-PBMC-5k_126_1.png
[90]:
mofa.plot_factors(model, x=1, y=2, color=["CD45RA_TotalSeqB", "CD45RO_TotalSeqB", "CD19_TotalSeqB"], palette="RdPu")
[90]:
<AxesSubplot:title={'center':'CD19_TotalSeqB'}, xlabel='Factor2 value', ylabel='Factor3 value'>
../_images/cite-seq_1-CITE-seq-PBMC-5k_127_1.png
[91]:
mfx.plot_factors(model, color=["CD14_TotalSeqB", "CD16_TotalSeqB"], palette="RdPu")
[91]:
<AxesSubplot:title={'center':'CD16_TotalSeqB'}, xlabel='Factor1 value', ylabel='Factor2 value'>
../_images/cite-seq_1-CITE-seq-PBMC-5k_128_1.png
[95]:
model.close()

Saving multimodal data on disk#

We will now write mdata object to an .h5mu file.

[96]:
mdata.write("data/pbmc5k_citeseq.h5mu")
... storing 'louvain' as categorical
... storing 'leiden' as categorical
... storing 'feature_types' as categorical