Processing chromatin accessibility of 10k PBMCs#

Please see the first chapter where getting the data and processing RNA modality are described

This is the second chapter of the multimodal single-cell gene expression and chromatin accessibility analysis. In this notebook, scATAC-seq data processing is described.

The flow of this notebook is similar to the scRNA-seq one, and we use rather similar data normalisation strategy to process the cells by peaks matrix. Alternative normalisation strategies are discussed elsewhere.

[1]:
# Change directory to the root folder of the repository
import os
os.chdir("../../")

Load libraries and data#

Import libraries:

[2]:
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
[3]:
import muon as mu

# Import a module with ATAC-seq-related functions
from muon import atac as ac

Load the MuData object from the .h5mu file that was saved at the end of the previous chapter:

[4]:
mdata = mu.read("data/pbmc10k.h5mu")
mdata
[4]:
MuData object with n_obs × n_vars = 11909 × 134726
  var:  'feature_types', 'gene_ids', 'genome', 'interval'
  2 modalities
    atac:       11909 x 108377
      var:      'gene_ids', 'feature_types', 'genome', 'interval'
      uns:      'atac', 'files'
    rna:        10915 x 26349
      obs:      'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'celltype'
      var:      'gene_ids', 'feature_types', 'genome', 'interval', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
      uns:      'celltype_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
      obsm:     'X_pca', 'X_umap'
      varm:     'PCs'
      obsp:     'connectivities', 'distances'

ATAC#

In this notebook, we will only work with the Peaks modality and will use the ATAC module of muon.

We will refer to the atac AnnData inside the MuData by defining a respective variable:

[5]:
atac = mdata.mod['atac']
atac  # an AnnData object
[5]:
AnnData object with n_obs × n_vars = 11909 × 108377
    var: 'gene_ids', 'feature_types', 'genome', 'interval'
    uns: 'atac', 'files'

Preprocessing#

To filter and to normalise the data, we are going to use the same scanpy functionality as we use when working with gene expression. The only thing to bear in mind here that a gene would mean a peak in the context of the AnnData object with ATAC-seq data.

QC#

Perform some quality control filtering out cells with too few peaks and peaks detected in too few cells. For now, we will filter out cells that do not pass QC.

[6]:
sc.pp.calculate_qc_metrics(atac, percent_top=None, log1p=False, inplace=True)
[7]:
sc.pl.violin(atac, ['total_counts', 'n_genes_by_counts'], jitter=0.4, multi_panel=True)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_18_1.png

Filter peaks which expression is not detected:

[8]:
mu.pp.filter_var(atac, 'n_cells_by_counts', lambda x: x >= 10)
# This is analogous to
#   sc.pp.filter_genes(rna, min_cells=10)
# but does in-place filtering and avoids copying the object

Filter cells:

[9]:
mu.pp.filter_obs(atac, 'n_genes_by_counts', lambda x: (x >= 2000) & (x <= 15000))
# This is analogous to
#   sc.pp.filter_cells(atac, max_genes=15000)
#   sc.pp.filter_cells(atac, min_genes=2000)
# but does in-place filtering avoiding copying the object

mu.pp.filter_obs(atac, 'total_counts', lambda x: (x >= 4000) & (x <= 40000))

Let’s see how the data looks after filtering:

[10]:
sc.pl.violin(atac, ['n_genes_by_counts', 'total_counts'], jitter=0.4, multi_panel=True)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_24_1.png

Or on histograms:

[11]:
mu.pl.histogram(atac, ['n_genes_by_counts', 'total_counts'])
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_26_0.png

ATAC-specific QC#

There are a few expectations about how ATAC-seq data looks like as noted in the hitchhiker’s guide to ATAC-seq data analysis for instance.

Nucleosome signal#

Fragment size distribution typically reflects nucleosome binding pattern showing enrichment around values corresponding to fragments bound to a single nucleosome (between 147 bp and 294 bp) as well as nucleosome-free fragments (shorter than 147 bp).

[12]:
atac.obs['NS']=1
[13]:
ac.pl.fragment_histogram(atac, region='chr1:1-2000000')
Fetching Regions...: 100%|██████████| 1/1 [00:02<00:00,  2.67s/it]
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_32_1.png

The ratio of mono-nucleosome cut fragments to nucleosome-free fragments can be called nucleosome signal, and it can be estimated using a subset of fragments.

[14]:
ac.tl.nucleosome_signal(atac, n=1e6)
Reading Fragments: 100%|██████████| 1000000/1000000 [00:05<00:00, 185853.88it/s]
[15]:
mu.pl.histogram(atac, "nucleosome_signal", kde=False)
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_35_0.png
TSS enrichment#

We can expect chromatin accessibility enriched around transcription start sites (TSS) compared to accessibility of flanking regions. Thus this measure averaged across multiple genes can serve as one more quality control metric.

The positions of transcription start sites can be obtained from the interval field of the gene annotation in the rna modality:

[16]:
ac.tl.get_gene_annotation_from_rna(mdata['rna']).head(3)  # accepts MuData with 'rna' modality or mdata['rna'] AnnData directly
[16]:
Chromosome Start End gene_id gene_name
AL627309.1 chr1 120931 133723 ENSG00000238009 AL627309.1
AL627309.5 chr1 149706 173862 ENSG00000241860 AL627309.5
AL627309.4 chr1 160445 160446 ENSG00000241599 AL627309.4

TSS enrichment function will return an AnnData object with cells x bases dimensions where bases correspond to positions around TSS and are defined by extend_upstream and extend_downstream parameters, each of them being 1000 bp by default. It will also record tss_score in the original object.

[17]:
tss = ac.tl.tss_enrichment(mdata, n_tss=1000)  # by default, features=ac.tl.get_gene_annotation_from_rna(mdata)
Fetching Regions...: 100%|██████████| 1000/1000 [00:30<00:00, 32.72it/s]
[18]:
tss
[18]:
AnnData object with n_obs × n_vars = 10069 × 2001
    obs: 'n_genes_by_counts', 'total_counts', 'NS', 'nucleosome_signal', 'tss_score'
    var: 'TSS_position'
[19]:
ac.pl.tss_enrichment(tss)
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_43_0.png

Normalisation#

[20]:
# Save original counts
atac.layers["counts"] = atac.X

There can be multiple options for ATAC-seq data normalisation.

One is latent semantic indexing that is frequently used for processing ATAC-seq datasets. First, it constructs term-document matrix from the original count matrix. Then the singular value decomposition (SVD) — the same technique that convential principal component analysis uses — is used to generate LSI components. Note that there are different flavours of computing TF-IDF, e.g. see this blog post about that.

TF-IDF normalisation is implemented in the muon’s ATAC module:

ac.pp.tfidf(atac, scale_factor=1e4)

Here we will use the same log-normalisation and PCA that we are used to from scRNA-seq analysis. We notice on this data it yields PC & UMAP spaces similar to the one generated on scRNA-seq counts.

[21]:
sc.pp.normalize_per_cell(atac, counts_per_cell_after=1e4)
sc.pp.log1p(atac)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if not is_categorical(df_full[k]):

Feature selection#

We will label highly variable peaks that we’ll use for downstream analysis.

[22]:
sc.pp.highly_variable_genes(atac, min_mean=0.05, max_mean=1.5, min_disp=.5)
[23]:
sc.pl.highly_variable_genes(atac)
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_53_0.png
[24]:
np.sum(atac.var.highly_variable)
[24]:
14896

Scaling#

For uniformity, and for consequent visualisation, we’ll save log-transformed counts in a .raw slot:

[25]:
atac.raw = atac

Analysis#

After filtering out low-quality cells, normalising the counts matrix, and selecting highly varianbe peaks, we can already use this data for multimodal integration.

However, as in the case of gene expression, we will study this data individually first and will run PCA on the scaled matrix, compute cell neighbourhood graph, and perform clustering to define cell types. This might be useful later to compare cell type definition between modalities.

LSI#

When working on TF-IDF counts, sc.tl.pca or ac.tl.lsi can be used to get latent components, e.g.:

ac.tl.lsi(atac)

We find the first component is typically associated with number of peaks or counts per cell so it is reasonable to remove it:

atac.obsm['X_lsi'] = atac.obsm['X_lsi'][:,1:]
atac.varm["LSI"] = atac.varm["LSI"][:,1:]
atac.uns["lsi"]["stdev"] = atac.uns["lsi"]["stdev"][1:]

The respective neighbourhood graph can be generated with sc.tl.neighbors:

sc.pp.neighbors(atac, use_rep="X_lsi", n_neighbors=10, n_pcs=30)

PCA#

For this notebook, we are using PCA on the log-normalised counts in atac.X as described above.

[26]:
sc.pp.scale(atac)
sc.tl.pca(atac)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if not is_categorical(df_full[k]):

We can only colour our plots by cut counts in individual peaks with scanpy:

[27]:
sc.pl.pca(atac, color=["n_genes_by_counts", "n_counts"])
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_68_1.png

With muon’s ATAC module, we can plot average values for cut counts in peaks of different types (promoter/distal) that are assigned to respective genes — just by providing gene names.

For that to work, we need the peak annotation table with gene -> peak correspondence. The peak_annotation.tsv file was detected and loaded automatically when we loaded the original data. Here is how the processed peak annotation table looks like:

[28]:
atac.uns['atac']['peak_annotation'].tail()

# Alternatively add peak annotation from a TSV file
# ac.tl.add_peak_annotation(atac, annotation="data/pbmc10k/atac_peak_annotation.tsv")
[28]:
gene peak distance peak_type
gene_name
AC024558.2 ENSG00000288436 chr3:126607163-126607753 802 distal
AL138899.3 ENSG00000288460 chr1:158113208-158113983 14318 distal
AL138899.3 ENSG00000288460 chr1:158115189-158115461 12840 distal
AL138899.3 ENSG00000288460 chr1:158118041-158118134 10167 distal
AL138899.3 ENSG00000288460 chr1:158124927-158125417 2884 distal

Now we can plot average cut values in peaks corresponding to genes just by providing a gene name. By default, values in atac.raw are used for plotting.

[29]:
ac.pl.pca(atac, color=["BCL11B", "CCR6", "KLF4"], average="total")
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_72_0.png

We can also average peaks of each type separately:

[30]:
ac.pl.pca(atac, color="BCL11B", average="peak_type")
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_74_0.png

We see how this component space here resembles the one based on gene expression from the previous notebook. Looking at top loadings of first two components, we see how peaks linked to BCL11B (ENSG00000127152) and KLF4 (ENSG00000136826) demarcate lympohoid / myeloid axis while peaks linked to CCR6 (ENSG00000112486) define B cell axis.

Now we will compute a neighbourhood graph for cells that we’ll use for clustering later on.

[31]:
sc.pp.neighbors(atac, n_neighbors=10, n_pcs=30)

Non-linear dimensionality reduction and clustering#

To stay comparable to the gene expression notebook, we will use leiden to cluster cells.

[32]:
sc.tl.leiden(atac, resolution=.5)

We’ll use UMAP latent space for visualisation below.

[33]:
sc.tl.umap(atac, spread=1.5, min_dist=.5, random_state=20)
[34]:
sc.pl.umap(atac, color=["leiden", "n_genes_by_counts"], legend_loc="on data")
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_83_1.png

Again, we can use the functionality of the ATAC module in muon to color plots by cut values in peaks correspoonding to a certain gene:

[35]:
ac.pl.umap(atac, color=["KLF4"], average="peak_type")
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_85_0.png

Marker genes and celltypes#

We will now define cell types based on chromatin accessibility.

[36]:
ac.tl.rank_peaks_groups(atac, 'leiden', method='t-test')
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
[37]:
result = atac.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.set_option("max_columns", 50)
pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'genes', 'pvals']}).head(10)
[37]:
0_n 0_g 0_p 1_n 1_g 1_p 2_n 2_g 2_p 3_n 3_g 3_p 4_n 4_g 4_p 5_n 5_g 5_p 6_n 6_g 6_p 7_n 7_g 7_p 8_n ... 8_p 9_n 9_g 9_p 10_n 10_g 10_p 11_n 11_g 11_p 12_n 12_g 12_p 13_n 13_g 13_p 14_n 14_g 14_p 15_n 15_g 15_p 16_n 16_g 16_p
0 chr9:107480158-107492721 KLF4 0.000000e+00 chr14:22536559-22563070 TRAJ7, TRAJ6, TRAJ5, TRAJ4, TRAJ3, TRAJ2, TRAJ... 1.314566e-244 chr2:86783559-86792275 CD8A 7.444581e-320 chr14:99255246-99275454 BCL11B, AL109767.1 1.069735e-230 chr9:107480158-107492721 KLF4 7.999720e-201 chr1:24909406-24919504 RUNX3 6.664347e-143 chr22:41917087-41929835 TNFRSF13C 1.873048e-153 chr19:18167172-18177541 PIK3R2, IFI30 4.831525e-87 chr17:83076201-83103570 ... 2.896804e-139 chr22:41917087-41929835 TNFRSF13C 9.642899e-92 chr5:150385442-150415310 CD74, TCOF1 8.302140e-09 chr9:107480158-107492721 KLF4 6.012456e-43 chr16:81519063-81525049 CMIP 1.356473e-24 chr17:81425658-81431769 BAHCC1 8.694046e-34 chr16:88448143-88480965 ZFPM1 2.133287e-12 chr10:133265906-133281093 TUBGCP2, ADAM8 4.538238e-08 chr22:22926399-22936461 IGLC7 2.602094e-08
1 chr11:61953652-61974246 BEST1, FTH1 0.000000e+00 chr14:99255246-99275454 BCL11B, AL109767.1 2.026255e-163 chr14:99255246-99275454 BCL11B, AL109767.1 1.085833e-272 chr14:99223600-99254668 BCL11B, AL109767.1 2.092935e-183 chr10:128045032-128071717 PTPRE 2.538496e-176 chr4:6198055-6202103 JAKMIP1, C4orf50 4.098742e-117 chr22:41931503-41942227 CENPM 2.004780e-141 chr9:134369462-134387253 RXRA 1.081515e-80 chr1:24909406-24919504 ... 1.340255e-90 chr22:41931503-41942227 CENPM 2.474419e-89 chr9:107480158-107492721 KLF4 1.689408e-07 chr3:4975862-4990757 BHLHE40, BHLHE40-AS1 8.784248e-39 chr15:39623205-39625650 FSIP1, AC037198.2 2.395729e-23 chr17:3910107-3919628 P2RX1 1.222270e-29 chr14:101807949-101830976 PPP2R5C, AL137779.2 1.364485e-08 chr19:21593423-21595308 AC123912.2 1.443596e-07 chr10:110353286-110359160 SMNDC1 5.102884e-07
2 chr1:212604203-212626574 FAM71A, ATF3, AL590648.2 0.000000e+00 chr10:8041366-8062418 GATA3, GATA3-AS1, AL390294.1 5.332028e-161 chr11:66311352-66319301 CD248, AP001107.3 2.221809e-250 chr14:99181080-99219442 BCL11B, AL162151.1 3.031413e-157 chr20:50269694-50277398 SMIM25 6.313702e-159 chr2:241762278-241764383 D2HGDH, AC114730.2 3.053371e-116 chr2:231669797-231676530 PTMA 1.188464e-119 chr5:1476663-1483241 SLC6A3, LPCAT1 7.206631e-77 chr2:28388837-28416648 ... 5.215648e-87 chr5:150385442-150415310 CD74, TCOF1 9.352640e-89 chr3:93470156-93470998 3.688178e-07 chr5:150385442-150415310 CD74, TCOF1 1.520758e-37 chr11:114072228-114076352 ZBTB16 5.202906e-23 chr12:108627138-108639115 SELPLG, AC007569.1 5.350308e-28 chr5:35850992-35860227 IL7R 1.421720e-08 chr17:82211956-82220400 CCDC57, AC132872.2 4.006489e-06 chr1:117653309-117657455 TENT5C 1.223868e-06
3 chr7:106256272-106286624 NAMPT, AC007032.1 0.000000e+00 chr7:142782798-142813716 TRBC1, TRBJ2-1, TRBJ2-2, TRBJ2-2P, TRBJ2-3, TR... 2.245007e-147 chr12:10552886-10555668 LINC02446 5.458067e-216 chr14:22536559-22563070 TRAJ7, TRAJ6, TRAJ5, TRAJ4, TRAJ3, TRAJ2, TRAJ... 6.400878e-155 chr11:61953652-61974246 BEST1, FTH1 4.827014e-161 chr20:24948563-24956577 CST7 4.800075e-117 chr19:5125450-5140568 KDM4B, AC022517.1 5.836709e-111 chr20:40686526-40691350 MAFB 3.732520e-75 chr1:184386243-184389335 ... 3.798579e-78 chr19:5125450-5140568 KDM4B, AC022517.1 9.052628e-83 chr19:56476473-56478541 ZNF667-AS1, ZNF667 1.903205e-06 chr1:220876295-220883526 HLX, HLX-AS1 1.466923e-36 chr22:44709028-44711972 PRR5 3.427431e-22 chr22:50281096-50284890 PLXNB2 3.843163e-27 chr5:134110288-134135061 TCF7 1.117150e-07 chr17:75771404-75787641 H3F3B, UNK 4.607566e-06 chr2:181303834-181310073 LINC01934 4.285367e-06
4 chr19:13824929-13854962 ZSWIM4, AC020916.1 0.000000e+00 chr20:59157931-59168100 ZNF831 1.054954e-130 chr2:136122469-136138482 CXCR4 2.905790e-222 chr7:142782798-142813716 TRBC1, TRBJ2-1, TRBJ2-2, TRBJ2-2P, TRBJ2-3, TR... 1.374339e-152 chr2:218361880-218373051 CATIP, CATIP-AS1 1.946499e-150 chr2:144507361-144525092 ZEB2, LINC01412, ZEB2-AS1, AC009951.6 4.220997e-114 chr6:167111604-167115345 CCR6, AL121935.1 1.224408e-96 chr2:16653069-16660704 FAM49A 2.346287e-73 chr20:24948563-24956577 ... 4.163668e-76 chr6:150598919-150601318 AL450344.3 8.325901e-69 chr22:50298976-50310027 PLXNB2, DENND6B 3.217514e-06 chr18:9707302-9713342 RAB31 2.943967e-29 chr3:4975862-4990757 BHLHE40, BHLHE40-AS1 4.286493e-22 chr6:11730442-11733107 ADTRP, AL022724.3 8.037828e-27 chr19:16363226-16378669 EPS15L1, AC020917.3 6.283947e-07 chr9:91419015-91426704 NFIL3, AL353764.1 6.142867e-06 chr20:47414623-47417478 LINC01754 6.489840e-06
5 chr9:134369462-134387253 RXRA 0.000000e+00 chr14:99223600-99254668 BCL11B, AL109767.1 4.600452e-130 chr14:99181080-99219442 BCL11B, AL162151.1 8.071691e-211 chr17:82125073-82129615 CCDC57 2.041006e-140 chr1:220876295-220883526 HLX, HLX-AS1 1.038407e-142 chr2:86783559-86792275 CD8A 4.062172e-112 chr19:50414004-50420358 POLD1, SPIB 1.830272e-96 chr18:13562096-13569511 LDLRAD4, AP005131.3 2.775838e-71 chr22:37161134-37168690 ... 5.636845e-72 chr2:231669797-231676530 PTMA 8.058508e-68 chr22:49942843-49973919 PIM3 1.418807e-05 chr17:81044486-81052618 BAIAP2 1.611984e-28 chr17:40536232-40543738 CCR7 1.324495e-21 chr17:16986865-16990016 LINC02090 2.373697e-26 chr22:39087900-39092994 APOBEC3H 1.207371e-06 chr6:146543038-146547054 RAB32 6.715263e-06 chr19:16116910-16120475 RAB8A 1.125913e-05
6 chr3:72092464-72103763 LINC00877 0.000000e+00 chr5:35850992-35860227 IL7R 4.620867e-122 chr14:99223600-99254668 BCL11B, AL109767.1 6.169497e-207 chr17:40601555-40611036 SMARCE1 2.660882e-133 chr20:1943201-1947850 PDYN-AS1 4.491744e-139 chr6:111757055-111766675 FYN 4.490952e-109 chr6:150598919-150601318 AL450344.3 1.517345e-93 chr9:107480158-107492721 KLF4 2.792462e-71 chr19:10513768-10519594 ... 1.203014e-71 chr6:167111604-167115345 CCR6, AL121935.1 1.020474e-64 chr6:113854439-113860529 MARCKS 1.726017e-05 chr2:47067863-47077814 TTC7A, AC073283.1 2.769793e-28 chr20:35250675-35252861 MMP24 1.660974e-21 chr20:43681305-43682881 MYBL2 2.832015e-26 chr16:85609841-85617514 GSE1 1.210676e-06 chr21:34907551-34910714 RUNX1 8.304409e-06 chr11:102473481-102475121 AP001830.2 1.142937e-05
7 chr5:150385442-150415310 CD74, TCOF1 0.000000e+00 chr10:33135632-33141841 IATPR 1.398564e-120 chr1:24500773-24509089 RCAN3, RCAN3AS 1.984912e-184 chr19:16363226-16378669 EPS15L1, AC020917.3 1.234663e-123 chr1:182143071-182150314 LINC01344, AL390856.1 9.402171e-135 chr4:1199596-1209141 SPON2 1.072746e-108 chr10:48671056-48673240 AC068898.1 8.200235e-91 chr8:55878674-55886699 LYN 2.976013e-70 chr4:1199596-1209141 ... 2.637658e-72 chr6:14093828-14096232 CD83 1.679937e-62 chr2:218361880-218373051 CATIP, CATIP-AS1 1.838801e-05 chr10:128045032-128071717 PTPRE 2.444100e-28 chr15:90854201-90857369 FURIN 2.832322e-21 chr19:50414004-50420358 POLD1, SPIB 1.255745e-25 chr20:63733192-63743479 SLC2A4RG, ZGPAT, LIME1 1.598500e-06 chr17:8068680-8069932 AC129492.1 1.317718e-05 chr16:81808702-81811338 PLCG2 1.249782e-05
8 chr6:41268623-41279829 TREM1 1.154384e-315 chr14:91240967-91256390 GPR68, AL135818.1 1.221951e-120 chr17:82125073-82129615 CCDC57 3.170907e-181 chr14:61319750-61346673 PRKCH, AL359220.1 4.704178e-117 chr6:44057321-44060655 AL109615.2, AL109615.3 7.512762e-133 chr16:81519063-81525049 CMIP 3.692239e-98 chr5:150385442-150415310 CD74, TCOF1 9.510751e-96 chr7:100364953-100375457 PILRA, PILRB 8.570418e-70 chr10:70600535-70604482 ... 4.668276e-71 chr1:30737200-30766455 LAPTM5 1.446541e-63 chr11:1777309-1784076 AC068580.2 2.950714e-05 chr18:13562096-13569511 LDLRAD4, AP005131.3 1.177026e-26 chr11:114065160-114066911 ZBTB16 6.424121e-21 chr7:98641522-98642532 NPTX2 2.160392e-25 chr14:105856766-105860740 IGHM 2.075807e-06 chr21:44965117-44970844 FAM207A, AP001505.1 1.640535e-05 chr16:81816510-81818969 PLCG2 1.814464e-05
9 chr22:38950570-38958424 APOBEC3A 2.957899e-309 chr6:37512323-37518673 LINC02520 1.330774e-118 chr7:142782798-142813716 TRBC1, TRBJ2-1, TRBJ2-2, TRBJ2-2P, TRBJ2-3, TR... 1.603307e-178 chr2:136122469-136138482 CXCR4 5.492263e-117 chr1:212484685-212489524 LINC02771 3.852391e-129 chr1:184386243-184389335 C1orf21, AL445228.2 1.305389e-97 chr10:12262528-12266420 AL512770.1 7.056151e-87 chr22:21936035-21941162 PPM1F, AC245452.1 9.957822e-69 chr2:144507361-144525092 ... 8.482316e-67 chr19:13092541-13106133 NFIX, LYL1, TRMT1 5.471164e-63 chr8:144308853-144327923 DGAT1, HSF1, AC233992.1 4.156898e-05 chr11:76626256-76631044 LINC02757, AP001189.5 2.186765e-26 chr20:33369063-33370940 CDK5RAP1 1.480746e-20 chr6:4281598-4282429 AL159166.1 2.905881e-25 chr17:78107894-78136274 TNRC6C, TMC6, TMC8, TNRC6C-AS1 2.410035e-06 chr12:121780861-121808228 SETD1B, RHOF, TMEM120B, LINC01089, AC084018.1,... 1.846176e-05 chr17:16986865-16990016 LINC02090 2.062376e-05

10 rows × 51 columns

Having studied markers of individual clusters, we will filter some cells out before assigning cell types names to clusters: likely doublets in clusters 10 and 14, proliferating cells in 15, stressed cells in 16.

[38]:
mu.pp.filter_obs(atac, "leiden", lambda x: ~x.isin(["10", "14", "15", "16"]))
# Analogous to
#   atac = atac[~atac.obs.leiden.isin(["10", "14", "15", "16"])]
# but doesn't copy the object
[39]:
new_cluster_names = {
    "1": "CD4+ memory T", "2": "CD8+ naïve T", "3": "CD4+ naïve T",
    "5": "CD8+ activated T", "8": "NK", "12": "MAIT",
    "9": "naïve B", "6": "memory B",
    "0": "intermediate mono", "4": "CD14 mono", "7": "CD16 mono",
    "11": "mDC", "13": "pDC",
}

atac.obs['celltype'] = atac.obs.leiden.astype("str").values
atac.obs.celltype = atac.obs.celltype.astype("category").cat.rename_categories(new_cluster_names)

We will also re-order categories for the next plots:

[40]:
atac.obs.celltype.cat.reorder_categories([
    'CD4+ naïve T', 'CD4+ memory T', 'MAIT',
    'CD8+ naïve T', 'CD8+ activated T', 'NK',
    'naïve B', 'memory B',
    'CD14 mono', 'intermediate mono', 'CD16 mono',
    'mDC', 'pDC'], inplace=True)

… and take colours from a palette:

[41]:
import matplotlib
import matplotlib.pyplot as plt

cmap = plt.get_cmap('rainbow')
colors = cmap(np.linspace(0, 1, len(atac.obs.celltype.cat.categories)))

atac.uns["celltype_colors"] = list(map(matplotlib.colors.to_hex, colors))
[42]:
sc.pl.umap(atac, color="celltype", legend_loc="on data", frameon=False)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_97_1.png

Finally, we’ll visualise some marker genes across cell types.

[43]:
marker_genes = ['IL7R', 'TRAC',
                'GATA3',
                'SLC4A10',
                'CD8A', 'CD8B', 'CD248', 'CCL5',
                'GNLY', 'NKG7',
                'CD79A', 'MS4A1', 'IGHD', 'IGHM', 'TNFRSF13C',
                'IL4R', 'KLF4', 'LYZ', 'S100A8', 'ITGAM', 'CD14',
                'FCGR3A', 'MS4A7', 'CST3',
                'CLEC10A', 'IRF8', 'TCF4']
[44]:
ac.pl.dotplot(atac, marker_genes, groupby='celltype')
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if not is_categorical(df_full[k]):
../../_images/single-cell-rna-atac_pbmc10k_2-Chromatin-Accessibility-Processing_100_1.png

Saving progress on disk#

In this chapter, we have been working on the ATAC modality only. We can save our progress into the .h5mu file. That will only update the ATAC modality inside the file.

[45]:
mu.write("data/pbmc10k.h5mu/atac", atac)
/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
... storing 'feature_types' as categorical
... storing 'interval' as categorical

Next, we’ll look into multimodal omics data integration.