Processing gene expression of 10k PBMCs#

This is the first chapter of the multimodal single-cell gene expression and chromatin accessibility analysis. In this notebook, scRNA-seq data processing is described, largely following this scanpy notebook on processing and clustering PBMCs.

[1]:
# Change directory to the root folder of the repository
import os
os.chdir("../../")

Download data#

Download the data that we will use for this series of notebooks. The data is available here.

For the tutorial, we will use the following files:

  1. Filtered feature barcode matrix (HDF5)

  2. ATAC peak annotations based on proximal genes (TSV)

  3. ATAC Per fragment information file (TSV.GZ)

  4. ATAC Per fragment information index (TSV.GZ index)

[2]:
# This is the directory where those files are downloaded to
data_dir = "data/pbmc10k"
[3]:
# Remove file prefixes if any
prefix = "pbmc_granulocyte_sorted_10k_"
for file in os.listdir(data_dir):
    if file.startswith(prefix):
        new_filename = file[len(prefix):]
        os.rename(os.path.join(data_dir, file), os.path.join(data_dir, new_filename))

Load libraries and data#

Import libraries:

[4]:
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
[5]:
import muon as mu

We will use an HDF5 file containing gene and peak counts as input. In addition to that, when loading this data, muon will look for default files like atac_peak_annotation.tsv and atac_fragments.tsv.gz in the same folder and will load peak annotation table and remember the path to the fragments file if they exist.

[6]:
mdata = mu.read_10x_h5(os.path.join(data_dir, "filtered_feature_bc_matrix.h5"))
mdata.var_names_make_unique()
mdata
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Added `interval` annotation for features from data/pbmc10k/filtered_feature_bc_matrix.h5
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if not is_categorical(df_full[k]):
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Added peak annotation from data/pbmc10k/atac_peak_annotation.tsv to .uns['atac']['peak_annotation']
Added gene names to peak annotation in .uns['atac']['peak_annotation']
Located fragments file: data/pbmc10k/atac_fragments.tsv.gz
[6]:
MuData object with n_obs × n_vars = 11909 × 144978
  2 modalities
    rna:        11909 x 36601
      var:      'gene_ids', 'feature_types', 'genome', 'interval'
    atac:       11909 x 108377
      var:      'gene_ids', 'feature_types', 'genome', 'interval'
      uns:      'atac', 'files'

Muon uses multimodal data (MuData) objects as containers for multimodal data.

mdata here is a MuData object that has been created directly from an AnnData object with multiple features types.

RNA#

In this notebook, we will only work with the Gene Expression modality.

We can refer to an individual AnnData inside the MuData by defining a respective variable. All the operations will be performed on the respective AnnData object inside the MuData as you would expect.

Please note that when AnnData is copied (e.g. rna = rna.copy()), there is no way for mdata to have a reference to this new object. It should be assigned back to a respective modality then (mdata.mod['rna'] = rna).

[7]:
rna = mdata.mod['rna']
rna
[7]:
AnnData object with n_obs × n_vars = 11909 × 36601
    var: 'gene_ids', 'feature_types', 'genome', 'interval'

Preprocessing#

QC#

Perform some quality control. For now, we will filter out cells that do not pass QC.

[8]:
rna.var['mt'] = rna.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(rna, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
[9]:
sc.pl.violin(rna, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
... storing 'feature_types' as categorical
... storing 'genome' as categorical
... storing 'interval' as categorical
../../_images/single-cell-rna-atac_pbmc10k_1-Gene-Expression-Processing_21_1.png

Filter genes which expression is not detected:

[10]:
mu.pp.filter_var(rna, 'n_cells_by_counts', lambda x: x >= 3)
# This is analogous to
#   sc.pp.filter_genes(rna, min_cells=3)
# but does in-place filtering and avoids copying the object

Filter cells:

[11]:
mu.pp.filter_obs(rna, 'n_genes_by_counts', lambda x: (x >= 200) & (x < 5000))
# This is analogous to
#   sc.pp.filter_cells(rna, min_genes=200)
#   rna = rna[rna.obs.n_genes_by_counts < 5000, :]
# but does in-place filtering avoiding copying the object

mu.pp.filter_obs(rna, 'total_counts', lambda x: x < 15000)
mu.pp.filter_obs(rna, 'pct_counts_mt', lambda x: x < 20)

Let’s see how the data looks after filtering:

[12]:
sc.pl.violin(rna, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../../_images/single-cell-rna-atac_pbmc10k_1-Gene-Expression-Processing_27_1.png

Normalisation#

We’ll normalise the data so that we get log-normalised counts to work with.

[13]:
sc.pp.normalize_total(rna, target_sum=1e4)
[14]:
sc.pp.log1p(rna)

Feature selection#

We will label highly variable genes that we’ll use for downstream analysis.

[15]:
sc.pp.highly_variable_genes(rna, min_mean=0.02, max_mean=4, min_disp=0.5)
[16]:
sc.pl.highly_variable_genes(rna)
../../_images/single-cell-rna-atac_pbmc10k_1-Gene-Expression-Processing_35_0.png
[17]:
np.sum(rna.var.highly_variable)
[17]:
3026

Scaling#

We’ll save log-normalised counts in a .raw slot:

[18]:
rna.raw = rna

… and scale the log-normalised counts to zero mean and unit variance:

[19]:
sc.pp.scale(rna, max_value=10)

Analysis#

Having filtered low-quality cells, normalised the counts matrix, and performed feature selection, we can already use this data for multimodal integration.

However it is usually a good idea to study individual modalities as well. Below we run PCA on the scaled matrix, compute cell neighbourhood graph, and perform clustering to define cell types.

PCA and neighbourhood graph#

[20]:
sc.tl.pca(rna, svd_solver='arpack')
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if not is_categorical(df_full[k]):

To visualise the result, we will use some markers for (large-scale) cell populations we expect to see such as T cells and NK cells (CD2), B cells (CD79A), and KLF4 (monocytes).

[21]:
sc.pl.pca(rna, color=['CD2', 'CD79A', 'KLF4', 'IRF8'])
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../../_images/single-cell-rna-atac_pbmc10k_1-Gene-Expression-Processing_47_1.png

The first principal component (PC1) is separating myeloid (monocytes) and lymphoid (T, B, NK) cells while B cells-related features seem to drive the second one. Also we see plasmocytoid dendritic cells (marked by IRF8) being close to B cells along the PC2.

[22]:
sc.pl.pca_variance_ratio(rna, log=True)
../../_images/single-cell-rna-atac_pbmc10k_1-Gene-Expression-Processing_49_0.png

Now we can compute a neighbourhood graph for cells:

[23]:
sc.pp.neighbors(rna, n_neighbors=10, n_pcs=20)

Non-linear dimensionality reduction and clustering#

With the neighbourhood graph computed, we can now perform clustering. We will use leiden clustering as an example.

[24]:
sc.tl.leiden(rna, resolution=.5)

To visualise the results, we’ll first generate a 2D latent space with cells that we can colour according to their cluster assignment.

[25]:
sc.tl.umap(rna, spread=1., min_dist=.5, random_state=11)
[26]:
sc.pl.umap(rna, color="leiden", legend_loc="on data")
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../../_images/single-cell-rna-atac_pbmc10k_1-Gene-Expression-Processing_57_1.png

Marker genes and celltypes#

[27]:
sc.tl.rank_genes_groups(rna, 'leiden', method='t-test')
[28]:
result = rna.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.set_option('display.max_columns', 50)
pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']}).head(10)
[28]:
0_n 0_p 1_n 1_p 2_n 2_p 3_n 3_p 4_n 4_p 5_n 5_p 6_n 6_p 7_n 7_p 8_n 8_p 9_n 9_p 10_n 10_p 11_n 11_p 12_n 12_p 13_n 13_p 14_n 14_p 15_n 15_p 16_n 16_p
0 INPP4B 0.000000e+00 LEF1 0.0 SLC8A1 0.0 LEF1 0.000000e+00 VCAN 0.0 CCL5 0.000000e+00 BANK1 0.000000e+00 GNLY 0.000000e+00 TCF7L2 0.000000e+00 NEAT1 3.539470e-211 IGHM 1.717810e-288 CD74 6.009640e-157 RPS27 3.403327e-23 SLC4A10 2.139839e-53 TCF4 2.131726e-108 BANK1 3.763632e-47 CDK6 1.042747e-13
1 IL32 0.000000e+00 BACH2 0.0 PLXDC2 0.0 BCL11B 0.000000e+00 PLXDC2 0.0 NKG7 5.553269e-264 CD74 0.000000e+00 NKG7 2.101678e-278 PSAP 0.000000e+00 PLXDC2 5.837108e-182 AFF3 7.544332e-270 CST3 5.713597e-115 RPL13 1.627878e-22 NKG7 4.390558e-45 RHEX 9.108035e-85 AFF3 3.540131e-43 MAML3 6.821403e-13
2 LTB 0.000000e+00 PDE3B 0.0 NEAT1 0.0 BACH2 0.000000e+00 DPYD 0.0 IL32 6.837879e-285 RALGPS2 2.564288e-316 CD247 1.184389e-282 LYN 0.000000e+00 ZEB2 3.822000e-166 BANK1 4.769351e-241 LYZ 1.477017e-119 RPS26 1.102095e-21 PHACTR2 1.706919e-44 IRF8 2.498884e-83 CD74 4.056816e-40 SOX4 1.456599e-12
3 SYNE2 0.000000e+00 THEMIS 0.0 AOAH 0.0 CAMK4 0.000000e+00 ARHGAP26 0.0 SYNE2 2.782505e-216 AFF3 1.996288e-313 PRF1 1.691605e-223 SAT1 0.000000e+00 LYST 2.413644e-152 CD74 1.556806e-232 HLA-DRA 1.555526e-114 RPS14 2.776577e-21 RORA 8.668969e-42 FCHSD2 1.102102e-83 MARCH1 3.041281e-38 NKAIN2 4.956515e-12
4 ANK3 0.000000e+00 NELL2 0.0 TYMP 0.0 FHIT 0.000000e+00 LRMDA 0.0 PRKCH 4.704978e-203 MS4A1 1.361493e-278 KLRD1 9.015533e-175 MTSS1 3.601402e-298 DPYD 2.703448e-156 FCRL1 6.639306e-170 HLA-DRB1 1.612832e-105 RPS12 3.946688e-20 SYNE2 2.249597e-38 UGCG 2.849758e-78 LYN 1.349819e-37 RPLP0 1.187223e-11
5 CDC14A 0.000000e+00 BCL11B 0.0 DPYD 0.0 MALAT1 0.000000e+00 NEAT1 0.0 GZMA 2.019015e-175 OSBPL10 8.911800e-228 GZMA 4.081546e-171 CST3 5.585083e-296 SLC8A1 3.767648e-141 RALGPS2 3.628682e-166 HDAC9 1.065920e-105 RPS29 1.077821e-18 CCL5 1.692335e-34 AFF3 3.814395e-81 HLA-DRA 1.141581e-35 RACK1 2.365493e-11
6 IL7R 0.000000e+00 CAMK4 0.0 JAK2 0.0 RPS27A 0.000000e+00 FCN1 0.0 SKAP1 4.021151e-184 HLA-DRA 6.100705e-255 SYNE1 6.986485e-165 WARS 7.996632e-289 TYMP 3.349482e-145 HLA-DRA 5.218855e-177 SAMHD1 7.086573e-110 RPL28 2.123781e-14 IL7R 8.878099e-35 CD74 5.871105e-85 PLXDC2 9.282713e-33 RNF220 6.095781e-11
7 ITGB1 1.456236e-313 OXNAD1 0.0 FCN1 0.0 RPS27 2.318139e-298 ZEB2 0.0 A2M 4.303104e-163 CD79A 1.017956e-225 CTSW 3.623371e-158 CTSS 2.783901e-316 LYN 4.655651e-143 BACH2 8.734895e-164 HLA-DPA1 1.586961e-93 RPS21 6.182699e-14 SKAP1 4.837739e-34 PLD4 6.527479e-73 MEF2C 5.807980e-31 ZNRF1 9.125704e-11
8 BCL11B 0.000000e+00 CD8B 0.0 ZEB2 0.0 TCF7 1.617994e-267 CSF3R 0.0 HLA-B 1.581266e-169 PAX5 4.409159e-221 PRKCH 1.297229e-169 COTL1 1.043765e-282 FCN1 8.937256e-135 PAX5 3.466932e-149 ACTB 2.218584e-92 RPL31 7.068589e-14 A2M 4.454185e-33 BCL11A 1.199667e-67 MS4A1 2.406744e-30 RPS3A 1.477150e-10
9 ARHGAP15 0.000000e+00 TXK 0.0 LYN 0.0 RPL3 9.427121e-287 RBM47 0.0 SYNE1 5.753746e-161 BLK 1.280630e-195 MCTP2 3.220710e-153 TYMP 4.803234e-299 AOAH 4.788465e-141 MS4A1 3.121565e-133 HLA-DPB1 1.917279e-88 RPS23 1.052583e-13 KLRB1 9.260931e-33 RABGAP1L 1.325866e-70 PRKCB 5.573921e-30 EEF2 3.706140e-10
[29]:
sc.pl.rank_genes_groups(rna, n_genes=20, sharey=False)
../../_images/single-cell-rna-atac_pbmc10k_1-Gene-Expression-Processing_61_0.png

Exploring the data we notice clusters 9 and 15 seem to be composed of cells bearing markers for different cell lineages so likely to be noise (e.g. doublets). Cluster 12 has higher ribosomal gene expression when compared to other clusters. Cluster 16 seem to consist of proliferating cells.

We will remove cells from these clusters before assigning cell types names to clusters.

[30]:
mu.pp.filter_obs(rna, "leiden", lambda x: ~x.isin(["9", "15", "12", "16"]))
# Analogous to
#   rna = rna[~rna.obs.leiden.isin(["9", "15", "12", "16"])]
# but doesn't copy the object
[31]:
new_cluster_names = {
    "0": "CD4+ memory T", "1": "CD8+ naïve T", "3": "CD4+ naïve T",
    "5": "CD8+ activated T", "7": "NK", "13": "MAIT",
    "6": "memory B", "10": "naïve B",
    "4": "CD14 mono", "2": "intermediate mono", "8": "CD16 mono",
    "11": "mDC", "14": "pDC",
}

rna.obs['celltype'] = rna.obs.leiden.astype("str").values
rna.obs.celltype = rna.obs.celltype.astype("category")
rna.obs.celltype = rna.obs.celltype.cat.rename_categories(new_cluster_names)

We will also re-order categories for the next plots:

[32]:
rna.obs.celltype.cat.reorder_categories([
    'CD4+ naïve T', 'CD4+ memory T', 'MAIT',
    'CD8+ naïve T', 'CD8+ activated T', 'NK',
    'naïve B', 'memory B',
    'CD14 mono', 'intermediate mono', 'CD16 mono',
    'mDC', 'pDC'], inplace=True)

… and take colours from a palette:

[33]:
import matplotlib
import matplotlib.pyplot as plt

cmap = plt.get_cmap('rainbow')
colors = cmap(np.linspace(0, 1, len(rna.obs.celltype.cat.categories)))

rna.uns["celltype_colors"] = list(map(matplotlib.colors.to_hex, colors))
[34]:
sc.pl.umap(rna, color="celltype", legend_loc="on data")
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../../_images/single-cell-rna-atac_pbmc10k_1-Gene-Expression-Processing_69_1.png

Finally, we’ll visualise some marker genes across cell types.

[35]:
marker_genes = ['IL7R', 'TRAC',
                'ITGB1', # CD29
                'SLC4A10',
                'CD8A', 'CD8B', 'CCL5',
                'GNLY', 'NKG7',
                'CD79A', 'MS4A1', 'IGHM', 'IGHD',
                'IL4R', 'TCL1A',
                'KLF4', 'LYZ', 'S100A8', 'ITGAM', # CD11b
                'CD14', 'FCGR3A', 'MS4A7',
                'CST3', 'CLEC10A', 'IRF8', 'TCF4']
[36]:
sc.pl.dotplot(rna, marker_genes, groupby='celltype');
../../_images/single-cell-rna-atac_pbmc10k_1-Gene-Expression-Processing_72_0.png

Saving multimodal data on disk#

We will now write mdata object to an .h5mu file. It will contain all the changes we’ve done to the RNA modality (mdata.mod['rna']) inside it.

[37]:
mdata.write("data/pbmc10k.h5mu")
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
... storing 'feature_types' as categorical
... storing 'genome' as categorical

Next, we’ll look into processing the second modality — chromatin accessibility.