Integrating gene expression and chromatin accessibility of 10k PBMCs#

Please see the first chapter where getting the data and processing RNA modality are described and the second chapter with ATAC modality processing

This is the third chapter of the multimodal single-cell gene expression and chromatin accessibility analysis. In this notebook, we will see how to learn a latent space jointly on two omics.

[1]:

# Change directory to the root folder of the repository
import os
os.chdir("../../")

Load libraries and data#

Import libraries:

[2]:

import numpy as np
import pandas as pd
import scanpy as sc

from matplotlib import pyplot as plt
import seaborn as sns

[3]:

import muon as mu

Load the MuData object from the .h5mu file:

[4]:

mdata = mu.read("data/pbmc10k.h5mu")
mdata

[4]:

MuData object with n_obs × n_vars = 11168 × 132435
  var:  'dispersions', 'dispersions_norm', 'feature_types', 'gene_ids', 'genome', 'highly_variable', 'interval', 'mean', 'mean_counts', 'means', 'n_cells_by_counts', 'pct_dropout_by_counts', 'std', 'total_counts'
  2 modalities
    atac:       9765 x 106086
      obs:      'n_genes_by_counts', 'total_counts', 'NS', 'nucleosome_signal', 'tss_score', 'n_counts', 'leiden', 'celltype'
      var:      'gene_ids', 'feature_types', 'genome', 'interval', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
      uns:      'atac', 'celltype_colors', 'files', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
      obsm:     'X_pca', 'X_umap'
      varm:     'PCs'
      layers:   'counts'
      obsp:     'connectivities', 'distances'
    rna:        10915 x 26349
      obs:      'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'celltype'
      var:      'gene_ids', 'feature_types', 'genome', 'interval', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
      uns:      'celltype_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
      obsm:     'X_pca', 'X_umap'
      varm:     'PCs'
      obsp:     'connectivities', 'distances'

In each modality, only cells passing respective QC are retained. For the multimodal data integration, we will use only cells that are present for both modalities:

[5]:

mu.pp.intersect_obs(mdata)

[6]:

mdata.shape

[6]:

(9512, 132435)

Compare cell type annotation#

Before performing the integration, we can compare how the cell type annotations in individual modalities match each other.

We can calculate a general score that compares those clustering solutions, e.g. the adjusted Rand index.

[7]:

from sklearn.metrics import adjusted_rand_score as ari
ari(mdata.obs['rna:celltype'], mdata.obs['atac:celltype'])

[7]:

0.7457983340058714

We can also look into how those cell type annotations match per cell type:

[8]:

# Calculate total number of cells of respective ATAC celltypes
df_total = (
    mdata.obs.groupby("atac:celltype").
        size().
        reset_index(name="n_total").
        set_index("atac:celltype")
)

# Calculate number of cells for each pair of RNA-ATAC celltype annotation
df = (
    mdata.obs.groupby(["atac:celltype", "rna:celltype"]).
        size().
        reset_index(name="n").
        set_index("atac:celltype").
        join(df_total).
        reset_index()
)

# Calculate a fraction of cells of each RNA celltype (n)
# for each ATAC celltype (/ n_total)
df_frac = df.assign(frac = lambda x: x.n / x.n_total)

We can now make a wide table and visualise it with a heatmap.

[9]:

df_wide = df_frac.set_index("atac:celltype").pivot(columns="rna:celltype", values="frac")

[10]:

import seaborn as sns
sns.heatmap(df_wide, cmap="Greys")

[10]:

<AxesSubplot:xlabel='rna:celltype', ylabel='atac:celltype'>

../../_images/single-cell-rna-atac_pbmc10k_3-Multimodal-Omics-Data-Integration_21_1.png

It seems that cell types are highly reproducible across modalities with the only exception of the CD14/intermediate monocytes annotation that doesn’t seem particularly confident — most probably in the ATAC modality. One can check that the adjusted Rand index goes up to 0.92 when those two groups of cells are treated as one cell types.

Perform integration#

We will now run multi-omic factor analysis — a group factor analysis method that will allow us to learn an interpretable latent space jointly on both modalities. Intuitively, it can be viewed as a generalisation of PCA for multi-omics data. More information about this method can be found on the MOFA website.

The time required to train the model depends on the number of cells and features as well as on hardware specs. For the current dataset, it takes 4 minutes on the GeForce RTX 2080 Ti NVIDIA card. Only highly variable features are used by default.

[11]:

mu.tl.mofa(mdata, outfile="models/pbmc10k_rna_atac.hdf5")

[12]:

# # NOTE: if you wish to load the trained model,
# #       use mofax library to quickly add
# #       factors and weights matrices
# #       to the mdata object
# #
# import mofax as mfx
# model = mfx.mofa_model('models/pbmc10k_rna_atac.hdf5')
# mdata.obsm["X_mofa"] = model.get_factors()

# # If only highly variable features were used
# w = model.get_weights()
# # Set the weights of features that were not used to zero
# mdata.varm["LFs"] = np.zeros(shape=(mdata.n_vars, w.shape[1]))
# mdata.varm["LFs"][mdata.var["highly_variable"]] = w

# model.close()

[13]:

mdata

[13]:

MuData object with n_obs × n_vars = 9512 × 132435
  var:  'dispersions', 'dispersions_norm', 'feature_types', 'gene_ids', 'genome', 'highly_variable', 'interval', 'mean', 'mean_counts', 'means', 'n_cells_by_counts', 'pct_dropout_by_counts', 'std', 'total_counts'
  obsm: 'X_mofa'
  varm: 'LFs'
  2 modalities
    atac:       9512 x 106086
      obs:      'n_genes_by_counts', 'total_counts', 'NS', 'nucleosome_signal', 'tss_score', 'n_counts', 'leiden', 'celltype'
      var:      'gene_ids', 'feature_types', 'genome', 'interval', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
      uns:      'atac', 'celltype_colors', 'files', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
      obsm:     'X_pca', 'X_umap'
      varm:     'PCs'
      layers:   'counts'
      obsp:     'connectivities', 'distances'
    rna:        9512 x 26349
      obs:      'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'celltype'
      var:      'gene_ids', 'feature_types', 'genome', 'interval', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
      uns:      'celltype_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
      obsm:     'X_pca', 'X_umap'
      varm:     'PCs'
      obsp:     'connectivities', 'distances'

After the training, the embedding will be added to the obsm slot of the mdata:

[14]:

mdata.obsm['X_mofa'].shape

[14]:

(9512, 10)

We can directly use it for plotting with mu.pl or sc.pl plotting functions — mdata has both .obs and .obsm slots that are needed for plotting with the latter one.

[15]:

# Copy colours that were defined previously
mdata.uns = mdata.uns or dict()
mdata.uns['rna:celltype_colors'] = mdata['rna'].uns['celltype_colors']
mdata.uns['atac:celltype_colors'] = mdata['atac'].uns['celltype_colors']

[16]:

mu.pl.mofa(mdata, color="rna:celltype", components=["1,2", "3,4"])
# 'rna:celltype' is a column in mdata.obs
# derived from the 'celltype' column of mdata['rna'].obs

../../_images/single-cell-rna-atac_pbmc10k_3-Multimodal-Omics-Data-Integration_33_0.png

[17]:

mu.pl.mofa(mdata, color="atac:celltype", components=["1,2", "3,4"])
# 'atac:celltype' is a column in mdata.obs
# derived from the 'celltype' column of mdata['atac'].obs

../../_images/single-cell-rna-atac_pbmc10k_3-Multimodal-Omics-Data-Integration_34_0.png

We can then use this embedding not only to compare cell type annotation performed individually on each modality but also to annotate cell types based on both omics jointly.

To visualise all the factors together, we’ll use a non-linear dimensionality reduction method such as UMAP to display the embedding in 2D:

[18]:

sc.pp.neighbors(mdata, use_rep="X_mofa")
sc.tl.umap(mdata)

[19]:

sc.tl.umap(mdata, min_dist=.2, spread=1., random_state=10)

For instance, we can then visualise how cell type annotations performed on individual modalities correspond to this 2D projection of the joint MOFA embeddings:

[20]:

sc.pl.umap(mdata, color=["rna:celltype", "atac:celltype"])

/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
... storing 'rna:mt' as categorical
... storing 'feature_types' as categorical
... storing 'interval' as categorical

../../_images/single-cell-rna-atac_pbmc10k_3-Multimodal-Omics-Data-Integration_41_1.png

Conventional clustering can now be performed based on the MOFA embeddings and also can be visualised on the same UMAP:

[21]:

sc.tl.leiden(mdata, key_added='leiden_joint')

[22]:

sc.pl.umap(mdata, color="leiden_joint", legend_loc='on data')

../../_images/single-cell-rna-atac_pbmc10k_3-Multimodal-Omics-Data-Integration_44_0.png

Individual features from modalities are also available when plotting embeddings:

[23]:

mu.pl.umap(mdata, color=["KLF4", "chr9:107480158-107492721"])

../../_images/single-cell-rna-atac_pbmc10k_3-Multimodal-Omics-Data-Integration_46_0.png

We can also generate custom plots with matplotlib and seaborn:

[24]:

df = pd.DataFrame(mdata.obsm["X_mofa"])
df.columns = [f"Factor {i+1}" for i in range(df.shape[1])]

plot_scatter = lambda i, ax: sns.scatterplot(data=df, x=f"Factor {i+1}", y=f"Factor {i+2}", color="black", linewidth=0, s=3, ax=ax)

fig, axes = plt.subplots(2, 2)
for i in range(4):
    plot_scatter(i, axes[i%2][i//2])

../../_images/single-cell-rna-atac_pbmc10k_3-Multimodal-Omics-Data-Integration_49_0.png

Annotate cell types#

We will now define cell types based on both omics.

[25]:

mdata['rna'].obs['leiden_joint'] = mdata.obs.leiden_joint
mdata['atac'].obs['leiden_joint'] = mdata.obs.leiden_joint

Ranking genes and peaks#

[26]:

sc.tl.rank_genes_groups(mdata['rna'], 'leiden_joint', method='t-test_overestim_var')

/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])

[27]:

from muon import atac as ac
ac.tl.rank_peaks_groups(mdata['atac'], 'leiden_joint', method='t-test_overestim_var')

Listing differentially expressed genes and differentially accessible peaks#

[28]:

result = {}
result['rna'] = mdata['rna'].uns['rank_genes_groups']
result['rna']['genes'] = result['rna']['names']
result['atac'] = mdata['atac'].uns['rank_genes_groups']

groups = result['rna']['names'].dtype.names
pd.set_option("max_columns", 200)
pd.DataFrame(
    {mod + ':' + group + '_' + key[:1]: result[mod][key][group][:10]
    for group in groups for key in ['names', 'genes', 'pvals']
    for mod in mdata.mod.keys()})

[28]:

	atac:0_n	rna:0_n	atac:0_g	rna:0_g	atac:0_p	rna:0_p	atac:1_n	rna:1_n	atac:1_g	rna:1_g	atac:1_p	rna:1_p	atac:2_n	rna:2_n	atac:2_g	rna:2_g	atac:2_p	rna:2_p	atac:3_n	rna:3_n	atac:3_g	rna:3_g	atac:3_p	rna:3_p	atac:4_n	rna:4_n	atac:4_g	rna:4_g	atac:4_p	rna:4_p	atac:5_n	rna:5_n	atac:5_g	rna:5_g	atac:5_p	rna:5_p	atac:6_n	rna:6_n	atac:6_g	rna:6_g	atac:6_p	rna:6_p	atac:7_n	rna:7_n	atac:7_g	rna:7_g	atac:7_p	rna:7_p	atac:8_n	rna:8_n	atac:8_g	rna:8_g	atac:8_p	rna:8_p	atac:9_n	rna:9_n	atac:9_g	rna:9_g	atac:9_p	rna:9_p	atac:10_n	rna:10_n	atac:10_g	rna:10_g	atac:10_p	rna:10_p	atac:11_n	rna:11_n	atac:11_g	rna:11_g	atac:11_p	rna:11_p	atac:12_n	rna:12_n	atac:12_g	rna:12_g	atac:12_p	rna:12_p	atac:13_n	rna:13_n	atac:13_g	rna:13_g	atac:13_p	rna:13_p	atac:14_n	rna:14_n	atac:14_g	rna:14_g	atac:14_p	rna:14_p	atac:15_n	rna:15_n	atac:15_g	rna:15_g	atac:15_p	rna:15_p	atac:16_n	rna:16_n	atac:16_g	rna:16_g	atac:16_p	rna:16_p	atac:17_n	rna:17_n	atac:17_g	rna:17_g	atac:17_p	rna:17_p	atac:18_n	rna:18_n	atac:18_g	rna:18_g	atac:18_p	rna:18_p	atac:19_n	rna:19_n	atac:19_g	rna:19_g	atac:19_p	rna:19_p	atac:20_n	rna:20_n	atac:20_g	rna:20_g	atac:20_p	rna:20_p	atac:21_n	rna:21_n	atac:21_g	rna:21_g	atac:21_p	rna:21_p	atac:22_n	rna:22_n	atac:22_g	rna:22_g	atac:22_p	rna:22_p
0	chr14:99255246-99275454	LEF1	BCL11B, AL109767.1	LEF1	8.438309e-63	2.442371e-130	chr6:137715217-137718150	INPP4B	AL356234.2, LINC02539	INPP4B	1.341573e-65	3.133586e-142	chr2:86783559-86792275	LEF1	CD8A	LEF1	2.991441e-130	7.030922e-148	chr9:107480158-107492721	VCAN	KLF4	VCAN	1.514428e-125	1.684183e-264	chr9:107480158-107492721	SLC8A1	KLF4	SLC8A1	2.417349e-96	3.736816e-192	chr2:86783559-86792275	BACH2	CD8A	BACH2	6.215390e-92	3.212512e-114	chr14:99255246-99275454	INPP4B	BCL11B, AL109767.1	INPP4B	4.790166e-45	6.537422e-139	chr20:50269694-50277398	VCAN	SMIM25	VCAN	8.811276e-80	1.446464e-194	chr14:99255246-99275454	LEF1	BCL11B, AL109767.1	LEF1	4.219380e-49	1.268213e-87	chr9:107480158-107492721	PLXDC2	KLF4	PLXDC2	1.360803e-85	4.643975e-115	chr1:184386243-184389335	CCL5	C1orf21, AL445228.2	CCL5	2.674621e-73	3.779073e-257	chr2:86783559-86792275	CCL5	CD8A	CCL5	9.833715e-48	1.683454e-93	chr17:83076201-83103570	GNLY	METRNL, AC130371.2	GNLY	1.464539e-95	2.512159e-300	chr2:16653069-16660704	FCGR3A	FAM49A	FCGR3A	6.194099e-62	5.207895e-237	chr9:107480158-107492721	PLXDC2	KLF4	PLXDC2	4.280707e-38	3.497046e-103	chr2:231669797-231676530	BANK1	PTMA	BANK1	1.637270e-119	1.245263e-299	chr22:41917087-41929835	IGHM	TNFRSF13C	IGHM	2.367636e-92	6.200262e-302	chr10:33135632-33141841	IL32	IATPR	IL32	1.571908e-18	1.061190e-43	chr1:220876295-220883526	CST3	HLX, HLX-AS1	CST3	8.837247e-29	3.276288e-77	chr11:114072228-114076352	NKG7	ZBTB16	NKG7	5.263514e-32	1.790101e-56	chr2:86783559-86792275	LEF1	CD8A	LEF1	0.000002	8.856198e-22	chr17:81425658-81431769	TCF4	BAHCC1	TCF4	3.739753e-36	1.641340e-99	chr20:44066714-44073458	FHIT	TOX2, AL035447.1	FHIT	0.001339	6.461786e-14
1	chr14:99223600-99254668	FHIT	BCL11B, AL109767.1	FHIT	5.960695e-54	6.110806e-126	chr1:1210271-1220028	IL7R	SDF4, TNFRSF4	IL7R	2.685949e-64	6.501053e-135	chr12:10552886-10555668	NELL2	LINC02446	NELL2	7.674191e-114	5.043340e-146	chr6:41268623-41279829	PLXDC2	TREM1	PLXDC2	4.615210e-103	1.179676e-227	chr20:50269694-50277398	TYMP	SMIM25	TYMP	6.595060e-80	1.036221e-166	chr12:10552886-10555668	LEF1	LINC02446	LEF1	6.824578e-60	7.217206e-117	chr14:22536559-22563070	ANK3	TRAJ7, TRAJ6, TRAJ5, TRAJ4, TRAJ3, TRAJ2, TRAJ...	ANK3	7.744505e-40	1.133036e-104	chr20:1943201-1947850	PLXDC2	PDYN-AS1	PLXDC2	2.451617e-76	1.028900e-153	chr14:99181080-99219442	FHIT	BCL11B, AL162151.1	FHIT	1.741674e-45	4.523144e-90	chr1:212604203-212626574	NAMPT	FAM71A, ATF3, AL590648.2	NAMPT	3.476135e-73	4.237109e-115	chr4:6198055-6202103	NKG7	JAKMIP1, C4orf50	NKG7	2.087277e-69	1.109701e-229	chr1:24909406-24919504	PRKCH	RUNX3	PRKCH	4.776232e-34	3.791643e-53	chr1:184386243-184389335	PRF1	C1orf21, AL445228.2	PRF1	1.508249e-73	1.206868e-219	chr5:1476663-1483241	TCF7L2	SLC6A3, LPCAT1	TCF7L2	5.632560e-61	1.252915e-220	chr6:41268623-41279829	DPYD	TREM1	DPYD	7.767020e-28	1.984950e-83	chr22:41917087-41929835	RALGPS2	TNFRSF13C	RALGPS2	5.062678e-122	1.751230e-246	chr22:41931503-41942227	BANK1	CENPM	BANK1	1.075287e-79	2.367819e-273	chr14:22536559-22563070	INPP4B	TRAJ7, TRAJ6, TRAJ5, TRAJ4, TRAJ3, TRAJ2, TRAJ...	INPP4B	3.409747e-15	2.751113e-40	chr9:107480158-107492721	HDAC9	KLF4	HDAC9	1.021219e-26	3.550487e-60	chr11:114065160-114066911	CCL5	ZBTB16	CCL5	2.871598e-26	2.320116e-47	chr12:10552886-10555668	PDE3B	LINC02446	PDE3B	0.000016	8.262455e-18	chr7:98641522-98642532	RHEX	NPTX2	RHEX	1.711629e-26	4.041783e-93	chr11:128300590-128306168	MALAT1	LINC02098	MALAT1	0.003834	1.109931e-10
2	chr17:40601555-40611036	MALAT1	SMARCE1	MALAT1	3.691809e-46	3.588622e-109	chr14:22536559-22563070	ITGB1	TRAJ7, TRAJ6, TRAJ5, TRAJ4, TRAJ3, TRAJ2, TRAJ...	ITGB1	1.239273e-62	5.749681e-133	chr14:99181080-99219442	CD8B	BCL11B, AL162151.1	CD8B	1.738567e-82	1.909060e-136	chr6:41280331-41287503	CSF3R	TREM1	CSF3R	7.386577e-96	1.244160e-263	chr5:1476663-1483241	FCN1	SLC6A3, LPCAT1	FCN1	6.242010e-73	1.136550e-153	chr14:99255246-99275454	PDE3B	BCL11B, AL109767.1	PDE3B	1.781287e-56	1.697447e-110	chr7:142782798-142813716	AC139720.1	TRBC1, TRBJ2-1, TRBJ2-2, TRBJ2-2P, TRBJ2-3, TR...	AC139720.1	1.443398e-35	2.249938e-59	chr2:218361880-218373051	LRMDA	CATIP, CATIP-AS1	LRMDA	2.254083e-75	1.393370e-150	chr17:82125073-82129615	CAMK4	CCDC57	CAMK4	7.944329e-43	3.595335e-71	chr22:38950570-38958424	SLC8A1	APOBEC3A	SLC8A1	2.768292e-66	7.915051e-113	chr2:144507361-144525092	GZMH	ZEB2, LINC01412, ZEB2-AS1, AC009951.6	GZMH	2.679995e-71	2.847630e-178	chr6:111757055-111766675	SYNE2	FYN	SYNE2	2.893507e-30	1.773567e-53	chr1:24909406-24919504	NKG7	RUNX3	NKG7	1.569805e-74	2.455912e-219	chr20:40686526-40691350	MTSS1	MAFB	MTSS1	6.205222e-59	2.266043e-203	chr2:47067863-47077814	LRMDA	TTC7A, AC073283.1	LRMDA	9.658957e-26	4.656471e-90	chr22:41931503-41942227	MS4A1	CENPM	MS4A1	7.979511e-96	5.261627e-229	chr19:5125450-5140568	AFF3	KDM4B, AC022517.1	AFF3	4.475093e-78	3.061064e-228	chr20:59157931-59168100	SYNE2	ZNF831	SYNE2	7.611764e-14	1.021233e-38	chr3:159762787-159765472	HLA-DRB1	IQCJ-SCHIP1-AS1, IQCJ-SCHIP1	HLA-DRB1	2.839047e-23	6.167329e-62	chr4:6198055-6202103	A2M	JAKMIP1, C4orf50	A2M	1.508503e-27	1.831348e-41	chr14:99255246-99275454	BACH2	BCL11B, AL109767.1	BACH2	0.000016	1.329808e-16	chr22:50281096-50284890	PLD4	PLXNB2	PLD4	1.265134e-27	5.019379e-78	chr5:75049736-75054761	BCL11B	GCNT4	BCL11B	0.004047	3.662743e-10
3	chr14:99181080-99219442	BCL11B	BCL11B, AL162151.1	BCL11B	7.800328e-46	4.885671e-104	chr21:45223468-45225379	IL32	LINC00334, ADARB1	IL32	3.121999e-61	7.805209e-125	chr14:99255246-99275454	THEMIS	BCL11B, AL109767.1	THEMIS	2.015437e-73	1.362464e-137	chr19:13824929-13854962	LRMDA	ZSWIM4, AC020916.1	LRMDA	2.398127e-93	9.014281e-222	chr9:134369462-134387253	PID1	RXRA	PID1	8.660699e-72	1.828607e-164	chr11:66311352-66319301	CD8B	CD248, AP001107.3	CD8B	2.436022e-55	4.060815e-94	chr1:89605612-89612338	LTB	AC093423.2, LRRC8C-DT	LTB	4.194887e-33	1.383628e-55	chr1:109820879-109824541	FCN1	LINC01768	FCN1	1.193126e-72	2.934203e-143	chr14:99223600-99254668	BCL11B	BCL11B, AL109767.1	BCL11B	8.332159e-38	4.551354e-68	chr5:150385442-150415310	VCAN	CD74, TCOF1	VCAN	7.276433e-64	2.623718e-105	chr12:52566432-52567532	GNLY	KRT74	GNLY	4.318250e-64	1.216602e-187	chr2:241762278-241764383	GZMK	D2HGDH, AC114730.2	GZMK	2.154760e-29	1.329633e-39	chr19:10513768-10519594	KLRD1	S1PR5	KLRD1	5.259143e-63	3.873001e-148	chr19:18167172-18177541	LST1	PIK3R2, IFI30	LST1	1.169732e-59	7.935218e-212	chr9:134369462-134387253	SLC8A1	RXRA	SLC8A1	1.395050e-25	6.519244e-89	chr10:48671056-48673240	OSBPL10	AC068898.1	OSBPL10	1.658282e-81	2.182868e-173	chr6:150598919-150601318	FCRL1	AL450344.3	FCRL1	3.760302e-72	7.420044e-185	chr10:33131993-33134879	ITGB1	IATPR	ITGB1	4.133223e-13	1.399717e-37	chr8:132895117-132896351	HLA-DRA	TG	HLA-DRA	6.543880e-22	1.647421e-56	chr16:81519063-81525049	RORA	CMIP	RORA	2.091597e-27	7.922389e-38	chr11:66311352-66319301	THEMIS	CD248, AP001107.3	THEMIS	0.000026	1.154498e-16	chr6:4281598-4282429	LINC01374	AL159166.1	LINC01374	1.502998e-25	4.623344e-56	chr11:119304967-119307621	MLLT3	MCAM, CBL	MLLT3	0.004515	4.185940e-09
4	chr6:90290787-90299045	CAMK4	BACH2, AL132996.1	CAMK4	6.755553e-45	3.727964e-101	chr2:9777921-9782964	LTB	AC082651.3	LTB	2.413979e-61	4.746335e-125	chr11:66311352-66319301	BACH2	CD248, AP001107.3	BACH2	8.695469e-72	1.474765e-123	chr11:61953652-61974246	LRRK2	BEST1, FTH1	LRRK2	6.153565e-87	6.000683e-218	chr5:150385442-150415310	JAK2	CD74, TCOF1	JAK2	5.511765e-71	2.947789e-151	chr17:82125073-82129615	NELL2	CCDC57	NELL2	1.822793e-47	3.484500e-95	chr14:99223600-99254668	TSHZ2	BCL11B, AL109767.1	TSHZ2	5.453597e-33	6.659819e-54	chr9:107480158-107492721	LYZ	KLF4	LYZ	1.323135e-71	1.232026e-138	chr7:1925312-1929606	BACH2	MAD1L1, AC104129.1	BACH2	1.236814e-35	2.927671e-67	chr19:6059651-6068090	LRMDA	RFX2, AC011444.2	LRMDA	1.757537e-61	1.742780e-104	chr16:81519063-81525049	GZMA	CMIP	GZMA	5.663960e-68	5.518277e-123	chr14:22536559-22563070	KLRK1	TRAJ7, TRAJ6, TRAJ5, TRAJ4, TRAJ3, TRAJ2, TRAJ...	KLRK1	2.985850e-28	2.445739e-38	chr20:24948563-24956577	GZMA	CST7	GZMA	2.057619e-62	6.039948e-139	chr18:13562096-13569511	IFITM3	LDLRAD4, AP005131.3	IFITM3	1.280635e-56	7.007801e-207	chr22:38950570-38958424	NEAT1	APOBEC3A	NEAT1	1.810800e-24	1.618015e-80	chr6:167111604-167115345	AFF3	CCR6, AL121935.1	AFF3	5.092136e-86	1.802127e-181	chr6:167111604-167115345	PAX5	CCR6, AL121935.1	PAX5	2.686190e-70	1.329193e-172	chr14:105856766-105860740	TTC39C	IGHM	TTC39C	1.969771e-12	3.865242e-33	chr17:81044486-81052618	CD74	BAIAP2	CD74	4.705331e-22	7.588303e-53	chr20:33369063-33370940	ZBTB16	CDK5RAP1	ZBTB16	1.162533e-23	2.661080e-35	chr1:59813997-59815790	NELL2	HOOK1	NELL2	0.000137	2.043361e-16	chr20:43681305-43682881	AC023590.1	MYBL2	AC023590.1	1.418595e-27	7.584299e-59	chr10:35011628-35012338	TCF7	CUL2, LINC02635	TCF7	0.005693	6.010186e-09
5	chr17:82125073-82129615	BACH2	CCDC57	BACH2	3.902213e-43	3.292870e-99	chr5:756911-759750	RORA	ZDHHC11B, AC026740.3	RORA	3.899329e-59	3.427174e-123	chr2:86805831-86810178	PDE3B	CD8A	PDE3B	4.741051e-64	2.098156e-121	chr10:97376057-97403290	RBM47	RRP12, AL355490.2	RBM47	6.487902e-88	8.502875e-206	chr11:61953652-61974246	MCTP1	BEST1, FTH1	MCTP1	1.013020e-69	1.227565e-155	chr2:136122469-136138482	THEMIS	CXCR4	THEMIS	4.146569e-46	3.375008e-94	chr6:149492867-149497093	BCL11B	ZC3H12D	BCL11B	4.263075e-32	1.067813e-53	chr1:220876295-220883526	DPYD	HLX, HLX-AS1	DPYD	1.125513e-70	4.179948e-118	chr17:40601555-40611036	TCF7	SMARCE1	TCF7	5.215166e-36	3.409635e-53	chr7:106256272-106286624	DPYD	NAMPT, AC007032.1	DPYD	1.637974e-60	7.231443e-93	chr20:24948563-24956577	PRF1	CST7	PRF1	2.051183e-67	3.843459e-116	chr4:6198055-6202103	IL32	JAKMIP1, C4orf50	IL32	3.059577e-26	2.299498e-38	chr10:70600535-70604482	CD247	PRF1	CD247	6.527049e-61	2.309114e-129	chr16:83744270-83745221	SERPINA1	CDH13, AC009063.2	SERPINA1	1.250536e-50	1.131737e-183	chr6:41280331-41287503	ARHGAP26	TREM1	ARHGAP26	8.062107e-24	5.289995e-82	chr8:11491635-11494865	PAX5	BLK	PAX5	1.471368e-79	1.253291e-156	chr2:231669797-231676530	RALGPS2	PTMA	RALGPS2	6.268393e-69	3.787194e-187	chr16:79598130-79602898	LTB	MAF	LTB	1.009806e-11	1.109585e-29	chr10:75399596-75404660	HLA-DPB1	ZNF503, ZNF503-AS2, AC010997.3	HLA-DPB1	8.032085e-22	1.694404e-57	chr15:39623205-39625650	SYNE2	FSIP1, AC037198.2	SYNE2	1.091165e-23	1.105071e-34	chr5:142094537-142099713	CCR7	NDFIP1	CCR7	0.000213	6.501597e-16	chr17:16986865-16990016	ZFAT	LINC02090	ZFAT	1.159140e-27	3.849926e-69	chr9:128898308-128904486	BACH2	LRRC8A	BACH2	0.006479	4.565414e-08
6	chr14:22536559-22563070	TCF7	TRAJ7, TRAJ6, TRAJ5, TRAJ4, TRAJ3, TRAJ2, TRAJ...	TCF7	3.614154e-42	9.292790e-80	chr12:12464985-12469154	SYNE2	BORCS5, AC007619.1	SYNE2	1.274032e-56	3.940893e-111	chr17:40601555-40611036	APBA2	SMARCE1	APBA2	1.376667e-63	1.794472e-95	chr22:38950570-38958424	NAMPT	APOBEC3A	NAMPT	1.558022e-83	7.064345e-196	chr1:220876295-220883526	PSAP	HLX, HLX-AS1	PSAP	1.401260e-69	3.821859e-149	chr19:50723222-50725949	OXNAD1	CLEC11A	OXNAD1	1.813057e-44	5.791598e-82	chr14:99181080-99219442	TCF7	BCL11B, AL162151.1	TCF7	6.841236e-32	7.870449e-52	chr1:182143071-182150314	CSF3R	LINC01344, AL390856.1	CSF3R	9.404967e-70	9.180976e-138	chr20:44066714-44073458	MLLT3	TOX2, AL035447.1	MLLT3	1.437998e-34	6.534084e-50	chr6:44057321-44060655	ARHGAP26	AL109615.2, AL109615.3	ARHGAP26	4.561375e-59	7.186789e-95	chr1:24909406-24919504	TGFBR3	RUNX3	TGFBR3	8.747916e-68	4.423262e-109	chr14:99223600-99254668	SLFN12L	BCL11B, AL109767.1	SLFN12L	1.682337e-26	9.876600e-37	chr11:114072228-114076352	KLRF1	ZBTB16	KLRF1	2.042013e-58	4.219011e-113	chr21:43344646-43351817	CDKN1C	LINC01679	CDKN1C	7.539208e-53	1.537383e-141	chr3:72092464-72103763	VCAN	LINC00877	VCAN	4.381333e-23	5.609419e-85	chr19:5125450-5140568	CD79A	KDM4B, AC022517.1	CD79A	1.372673e-83	3.113845e-154	chr6:14093828-14096232	MS4A1	CD83	MS4A1	1.434184e-66	1.927510e-153	chr12:22332848-22336188	SKAP1	ST8SIA1	SKAP1	1.294966e-11	7.253496e-29	chr3:4975862-4990757	LYZ	BHLHE40, BHLHE40-AS1	LYZ	1.523831e-21	4.167852e-52	chr20:35250675-35252861	SLC4A10	MMP24	SLC4A10	1.974890e-22	1.718142e-29	chr20:53871922-53873963	APBA2	AC006076.1	APBA2	0.000455	5.082542e-13	chr19:15197159-15198290	EPHB1	NOTCH3	EPHB1	4.099613e-23	2.330743e-54	chr10:26682168-26683639	RPL19	PDSS1	RPL19	0.007781	2.671981e-06
7	chr7:142782798-142813716	ANK3	TRBC1, TRBJ2-1, TRBJ2-2, TRBJ2-2P, TRBJ2-3, TR...	ANK3	1.078833e-40	2.184718e-79	chr14:64755519-64756581	ANK3	SPTB, PLEKHG3	ANK3	1.465280e-53	3.822445e-108	chr1:24500773-24509089	CCR7	RCAN3, RCAN3AS	CCR7	8.828176e-63	6.728310e-94	chr7:106256272-106286624	AC020916.1	NAMPT, AC007032.1	AC020916.1	5.194756e-82	3.182772e-201	chr1:182143071-182150314	AOAH	LINC01344, AL390856.1	AOAH	3.187238e-68	2.213631e-134	chr2:86805831-86810178	TXK	CD8A	TXK	1.116431e-43	3.455977e-73	chr10:102610246-102613789	CAMK4	SUFU, TRIM8	CAMK4	2.782567e-31	1.744012e-50	chr4:184850175-184858076	RBM47	MIR3945HG, AC084871.3	RBM47	1.054152e-68	1.555535e-136	chr11:118906756-118931379	CCR7	BCL9L	CCR7	9.716782e-33	2.919129e-48	chr3:72092464-72103763	NEAT1	LINC00877	NEAT1	1.425993e-58	1.139761e-92	chr14:101709778-101714256	SYNE1	LINC02320, LINC00239	SYNE1	8.434929e-63	2.249528e-106	chr17:35889026-35891507	IFNG-AS1	CCL5	IFNG-AS1	9.189996e-26	1.596513e-35	chr12:52566432-52567532	SPON2	KRT74	SPON2	1.407997e-57	1.424799e-114	chr1:182143071-182150314	AIF1	LINC01344, AL390856.1	AIF1	2.856652e-53	2.067843e-176	chr11:61953652-61974246	CSF3R	BEST1, FTH1	CSF3R	9.522027e-23	7.652856e-85	chr6:150598919-150601318	EBF1	AL450344.3	EBF1	5.614446e-80	1.356953e-140	chr14:105817737-105818927	LINC00926	IGHD	LINC00926	2.299110e-60	5.362189e-133	chr8:29347324-29353460	BCL11B	DUSP4, AC084262.1, AC084262.2	BCL11B	1.384822e-11	3.239137e-28	chr7:132390553-132391253	HLA-DPA1	PLXNA4, AC011625.1	HLA-DPA1	1.129696e-19	1.807182e-55	chr17:3794770-3797216	KLRG1	ITGAE	KLRG1	1.040331e-21	1.211021e-31	chr21:46423263-46427109	RPS5	PCNT, DIP2A	RPS5	0.000441	1.327628e-11	chr6:11730442-11733107	UGCG	ADTRP, AL022724.3	UGCG	1.117509e-26	9.265508e-75	chr4:54940758-54943697	RPS27	AC111194.1	RPS27	0.009565	4.603503e-06
8	chr2:109030596-109033601	CCR7	EDAR	CCR7	2.197680e-40	1.504801e-77	chr15:69467756-69470179	CDC14A	DRAIC	CDC14A	5.258900e-54	4.462095e-107	chr17:82125073-82129615	CAMK4	CCDC57	CAMK4	7.367503e-61	1.780769e-91	chr20:50269694-50277398	DPYD	SMIM25	DPYD	6.306425e-81	1.214915e-164	chr1:212484685-212489524	WARS	LINC02771	WARS	1.451334e-67	3.838958e-152	chr14:99223600-99254668	CAMK4	BCL11B, AL109767.1	CAMK4	3.311882e-43	1.052840e-70	chr11:60977201-60989782	RPL13	CD6	RPL13	1.262988e-30	8.999545e-49	chr1:212484685-212489524	ARHGAP26	LINC02771	ARHGAP26	2.711090e-68	1.728856e-121	chr19:49551445-49563522	ANK3	NOSIP	ANK3	6.416963e-33	7.760573e-47	chr6:41268623-41279829	TYMP	TREM1	TYMP	1.975339e-58	1.885466e-96	chr15:39623205-39625650	A2M	FSIP1, AC037198.2	A2M	1.028597e-58	4.716177e-96	chr2:29009215-29017012	CBLB	TOGARAM2	CBLB	1.258743e-24	1.134928e-34	chr2:231425358-231427497	MCTP2	AC017104.3	MCTP2	3.888965e-57	5.003292e-129	chr9:134369462-134387253	MS4A7	RXRA	MS4A7	2.108124e-53	2.519311e-158	chr1:212604203-212626574	JAK2	FAM71A, ATF3, AL590648.2	JAK2	1.642385e-21	1.467037e-81	chr19:50414004-50420358	BLK	POLD1, SPIB	BLK	2.610839e-80	4.866240e-139	chr10:48671056-48673240	IGHD	AC068898.1	IGHD	1.656861e-59	1.746471e-118	chr7:142782798-142813716	TRAC	TRBC1, TRBJ2-1, TRBJ2-2, TRBJ2-2P, TRBJ2-3, TR...	TRAC	4.551260e-11	1.653127e-26	chr18:9707302-9713342	HLA-DQA1	RAB31	HLA-DQA1	3.051259e-21	3.095756e-54	chr21:45226467-45228792	KLRB1	LINC00334, ADARB1	KLRB1	8.993120e-21	4.190163e-30	chr13:40185436-40188814	CAMK4	LINC00598, LINC00332	CAMK4	0.000496	1.402814e-11	chr19:2446191-2447956	PTPRS	LMNB2	PTPRS	3.049988e-23	9.656860e-55	chr14:99181080-99219442	LEF1	BCL11B, AL162151.1	LEF1	0.009794	3.335552e-06
9	chr19:16363226-16378669	INPP4B	EPS15L1, AC020917.3	INPP4B	4.913659e-40	4.645049e-76	chr1:6459432-6463105	MDFIC	ESPN, TNFRSF25	MDFIC	5.836921e-53	5.501357e-84	chr1:8924588-8928112	TXK	CA6	TXK	2.093080e-58	1.422387e-91	chr6:144149410-144160995	ARHGAP26	STX11	ARHGAP26	1.423961e-77	1.644443e-172	chr1:212604203-212626574	CPVL	FAM71A, ATF3, AL590648.2	CPVL	1.061151e-63	3.088902e-151	chr14:99181080-99219442	CD8A	BCL11B, AL162151.1	CD8A	7.665869e-43	2.372229e-68	chr17:82125073-82129615	IL7R	CCDC57	IL7R	2.906530e-30	7.763317e-50	chr19:47388419-47392063	CD36	MEIS3	CD36	4.112903e-68	2.538275e-134	chr9:133700382-133704361	IL7R	SARDH	IL7R	9.256986e-33	3.162917e-46	chr2:47067863-47077814	AC020916.1	TTC7A, AC073283.1	AC020916.1	3.881064e-58	3.254648e-102	chr2:241762278-241764383	KLRD1	D2HGDH, AC114730.2	KLRD1	3.716237e-58	5.267267e-92	chr14:99255246-99275454	TC2N	BCL11B, AL109767.1	TC2N	6.850508e-24	2.520548e-33	chr22:37161134-37168690	CTSW	Z82188.2	CTSW	1.424692e-58	2.142344e-119	chr8:55878674-55886699	WARS	LYN	WARS	6.500344e-51	9.164220e-162	chr2:26008387-26009976	DENND1A	RAB10	DENND1A	3.955349e-21	2.902043e-80	chr14:104656327-104658929	CD74	AL583722.3	CD74	8.519442e-74	4.704890e-110	chr16:88043944-88046683	COL19A1	BANP, AC134312.3	COL19A1	8.838689e-61	1.094404e-108	chr10:8041366-8062418	TNIK	GATA3, GATA3-AS1, AL390294.1	TNIK	8.872703e-11	7.982392e-26	chr22:39299385-39300930	CCDC88A	AL022326.1	CCDC88A	1.937375e-19	1.917706e-53	chr2:29009215-29017012	PRF1	TOGARAM2	PRF1	1.895675e-21	1.460752e-30	chr2:136122469-136138482	TCF7	CXCR4	TCF7	0.000576	1.413410e-11	chr4:791795-793496	IRF8	CPLX1, AC139887.2	IRF8	8.422711e-23	8.381387e-68	chr13:40483522-40487100	CAMK4	LINC00598	CAMK4	0.010548	3.378319e-06

Assigning cell type labels#

Having studied markers of individual clusters, we can add a new cell type annotation to the object.

[29]:

new_cluster_names = {
    "0": "CD4+ naïve T", "8": "CD4+ naïve T", "22": "CD4+ naïve T",
    "1": "CD4+ memory T", "6": "CD4+ memory T", "17": "CD4+ memory T",
    "2": "CD8+ naïve T", "5": "CD8+ naïve T", "20": "CD8+ naïve T",
    "10": "CD8+ cytotoxic effector T", "11": "CD8+ transitional effector T",
    "19": "MAIT", "12": "NK",

    "16": "naïve B", "15": "memory B",

    "3": "classical mono", "7": "classical mono", "9": "classical mono", "14": "classical mono",
    "4": "intermediate mono", "13": "non-classical mono",
    "18": "mDC", "21": "pDC",
}

mdata.obs['celltype'] = mdata.obs.leiden_joint.astype("str")
mdata.obs.celltype = mdata.obs.celltype.map(new_cluster_names).astype("category")

We will also re-order categories for the next plots:

[30]:

mdata.obs.celltype.cat.reorder_categories([
    'CD4+ naïve T', 'CD4+ memory T',
    'CD8+ naïve T', 'CD8+ transitional effector T', 'CD8+ cytotoxic effector T',
    'MAIT', 'NK',
    'naïve B', 'memory B',
    'classical mono', 'intermediate mono', 'non-classical mono',
    'mDC', 'pDC'], inplace=True)

… and take colours from a palette:

[31]:

import matplotlib
import matplotlib.pyplot as plt

cmap = plt.get_cmap('rainbow')
colors = cmap(np.linspace(0, 1, len(mdata.obs.celltype.cat.categories)))

mdata.uns["celltype_colors"] = list(map(matplotlib.colors.to_hex, colors))

[32]:

mu.pl.umap(mdata, color="celltype", legend_loc="on data", frameon=False)

../../_images/single-cell-rna-atac_pbmc10k_3-Multimodal-Omics-Data-Integration_65_0.png

Visualising markers#

[33]:

mdata['rna'].obs['celltype_joint'] = mdata.obs.celltype
mdata['atac'].obs['celltype_joint'] = mdata.obs.celltype

Finally, we’ll visualise some marker genes across cell types.

[34]:

marker_genes = [
    'IL7R', 'TRAC', 'GATA3',                                # CD4+ T
    'LEF1', 'FHIT', 'RORA', 'ITGB1',                        # naïve/memory
    'CD8A', 'CD8B', 'CD248', 'CCL5',                        # CD8+ T
    'GZMH', 'GZMK',                                         # cytotoxic/transitional effector T cells
    'KLRB1', 'SLC4A10',                                     # MAIT
    'IL32',                                                 # T/NK
    'GNLY', 'NKG7',                                         # NK
    'CD79A', 'MS4A1', 'IGHD', 'IGHM', 'IL4R', 'TNFRSF13C',  # B
    'KLF4', 'LYZ', 'S100A8', 'ITGAM', 'CD14',               # mono
    'DPYD', 'ITGAM',                                        # classical/intermediate/non-classical mono
    'FCGR3A', 'MS4A7', 'CST3',                              # non-classical mono
    'CLEC10A', 'IRF8', 'TCF4'                               # DC
]

[35]:

sc.pl.dotplot(mdata['rna'], marker_genes, 'celltype_joint')

/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])

../../_images/single-cell-rna-atac_pbmc10k_3-Multimodal-Omics-Data-Integration_70_1.png

[36]:

marker_peaks = [
    'chr14:99255246-99275454', 'chr10:33135632-33141841',                              # T/NK
    'chr1:1210271-1220028',                                                            # memory T/NK
    'chr2:86783559-86792275',                                                          # CD8+ T/NK
    'chr12:10552886-10555668',                                                         # naïve CD8+ T
    'chr11:114072228-114076352',                                                       # MAIT/NK
    'chr5:150385442-150415310',                                                        # B and mono (CD74)
    'chr22:41931503-41942227', 'chr22:41917087-41929835', 'chr6:167111604-167115345',  # B
    'chr9:107480158-107492721', 'chr5:1476663-1483241',                                # mono
    'chr10:75399596-75404660', 'chr1:220876295-220883526',                             # mDC
    'chr17:81425658-81431769', 'chr7:98641522-98642532',                               # pDC
]

[37]:

sc.pl.dotplot(mdata['atac'], marker_peaks, 'celltype_joint')

/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])

../../_images/single-cell-rna-atac_pbmc10k_3-Multimodal-Omics-Data-Integration_72_1.png

Saving progress on disk#

In this notebook we have filtered cells and added MOFA factors & things based on them (neighbourhood graph, clusters, UMAP) to the mdata object, and we’ll finally save our progress on disk:

[38]:

mdata.write("data/pbmc10k.h5mu")

/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])

For a more detailed exploration of the trained MOFA model in Python, please see this notebook that demonstrated how to use mofa× to interpret the MOFA model that we’ve trained on the ATAC+RNA multi-omics data.

muon-tutorials documentation

Integrating gene expression and chromatin accessibility of 10k PBMCs

Contents