Analysing CLL data#

This notebooks presents primary blood cancer encyclopedia data, in particular the chronic lymphocytic leukemia (CLL) dataset, analysis using muon.

Multi-omics integration of different modalities in this dataset has been originally described in an R notebook here. For more details please see a detailed analysis in the original publication.

[1]:
import scanpy as sc
import muon as mu
import numpy as np
import pandas as pd
import mofax as mofa

Load data#

Count matrices and metadata are publicly available as part of this R package or, originally, here in another R package. For the purpose of this notebook, those matrices and the metadata table were saved in individual .tsv files.

[2]:
data_dir = "data/CLL"

Metadata:

[3]:
obs = pd.read_csv(f"{data_dir}/metadata.tsv", sep="\t", index_col='sample')

Matrices with data:

[4]:
mods = {i:sc.AnnData(pd.read_csv(f"{data_dir}/{i}.csv", index_col=0).T) for i in ("drugs", "methylation", "rna", "mutations")}

A MuData object can be created directly from a dictionary of modalities:

[5]:
mdata = mu.MuData(mods)
mdata.obs = mdata.obs.join(obs)
[6]:
mdata
[6]:
MuData object with n_obs × n_vars = 200 × 9627
  obs:  'Gender', 'age', 'TTT', 'TTD', 'treatedAfter', 'died', 'IGHV_status', 'trisomy12'
  4 modalities
    drugs:      184 x 310
    methylation:        196 x 4248
    rna:        136 x 5000
    mutations:  200 x 69

We can notice that not all samples (patients) have been profiled in each modality. Missing values like that are quite common for biomedical datasets.

Integrate data#

Multi-omics factor analysis (MOFA) integration can be run on a MuData object with a single command:

mu.tl.mofa()

In case some samples are missing from some of the modalities, as they are in this CLL dataset, muon can be instructed to handle it. One of two modes has to be chosen for that:

  1. use_obs='union', which will populate values for missing samples in certain modalities with NAs, or

  2. use_obs='intersection', which will use a subset of samples that are present in all modalities.

We will use the first option here.

[100]:
mu.tl.mofa(mdata, use_obs='union',
           n_factors=15, convergence_mode='medium',
           outfile="models/CLL_union.hdf5")

        #########################################################
        ###           __  __  ____  ______                    ###
        ###          |  \/  |/ __ \|  ____/\    _             ###
        ###          | \  / | |  | | |__ /  \ _| |_           ###
        ###          | |\/| | |  | |  __/ /\ \_   _|          ###
        ###          | |  | | |__| | | / ____ \|_|            ###
        ###          |_|  |_|\____/|_|/_/    \_\              ###
        ###                                                   ###
        #########################################################



Loaded view='drugs' group='group1' with N=200 samples and D=310 features...
Loaded view='methylation' group='group1' with N=200 samples and D=4248 features...
Loaded view='rna' group='group1' with N=200 samples and D=5000 features...
Loaded view='mutations' group='group1' with N=200 samples and D=69 features...


Model options:
- Automatic Relevance Determination prior on the factors: True
- Automatic Relevance Determination prior on the weights: True
- Spike-and-slab prior on the factors: False
- Spike-and-slab prior on the weights: True

Likelihoods:
- View 0 (drugs): gaussian
- View 1 (methylation): gaussian
- View 2 (rna): gaussian
- View 3 (mutations): bernoulli


######################################
## Training the model with seed 1 ##
######################################



Converged!



#######################
## Training finished ##
#######################


Saving model in models/CLL_union.hdf5...
Saved MOFA embeddings in .obsm['X_mofa'] slot and their loadings in .varm['LFs'].

We can see that the model has been trained successfully and has managed to learn some meaningful structure in the data:

[9]:
mdata.obs.IGHV_status = mdata.obs.IGHV_status.astype(str).astype("category").cat.rename_categories({"1.0": "mutated", "0.0": "unmutated"})
mu.pl.mofa(mdata, color="IGHV_status")
... storing 'Gender' as categorical
... storing 'treatedAfter' as categorical
... storing 'died' as categorical
WARNING:matplotlib.axes._axes:*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
WARNING:matplotlib.axes._axes:*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
WARNING:matplotlib.axes._axes:*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
_images/CLL_18_1.png

Interpret intergrated data#

[10]:
# # Can be used to plot figures with higher resolution:
# from matplotlib import rcParams
# rcParams["figure.dpi"] = 200
[103]:
model = mofa.mofa_model("models/CLL_union.hdf5")
model
[103]:
MOFA+ model: CLL union
Samples (cells): 200
Features: 9627
Groups: group1 (200)
Views: drugs (310), methylation (4248), mutations (69), rna (5000)
Factors: 15
Expectations: W, Z
[107]:
mofa.plot_r2(model, x='View', vmax=15)
[107]:
_images/CLL_22_0.png

We notice e.g. that Factor 1 captures variability across all modalities, Factor 2 captures variability across different modalities incl. drug response-specific one, Factor 4 is quite specific to the RNA modality.

Factor 1#

[108]:
mofa.plot_weights(model, views=['rna', 'mutations'], factors=0, zero_line=True, ncols=2, label_size=10)
[108]:
<AxesSubplot:title={'center':'mutations'}, xlabel='Feature weight'>
_images/CLL_25_1.png

We can see that top weights from the RNA modality make sense and are likely to represent the same latent process as IGHV status:

[134]:
model.metadata.IGHV_status = (
    model.metadata.IGHV_status.astype(str).
        replace({'1.0': 'mutated', '0.0': 'unmutated'}).
        astype('category').cat.reorder_categories(["mutated", "unmutated", "nan"])
)

mofa.plot_factors_violin(model, factors=0, dots=True, violins=False, color="IGHV_status", palette="Set2")
[134]:
<AxesSubplot:ylabel='Factor value'>
_images/CLL_27_1.png

Focusing on the drug response, we notice PF477736 (D_078), AZD7762 (D_020), AT13387 (D_017), and dasatinib (D_050) have the major association with this factor. Those are the exact top associations with the IGHV status described in the original paper:

[163]:
mofa.plot_weights(model, views=['drugs'], factors=0, label_size=10)
[163]:
<AxesSubplot:title={'center':'drugs'}, xlabel='Feature weight'>
_images/CLL_29_1.png

Factor 3#

[164]:
mofa.plot_weights(model, views=['mutations'], factors=2, label_size=10)
[164]:
<AxesSubplot:title={'center':'mutations'}, xlabel='Feature weight'>
_images/CLL_31_1.png

We can see how different the values are for this factor depending on the trisomy12:

[154]:
model.metadata.trisomy12 = (
    model.metadata.trisomy12.astype(str).
        replace({'1.0': 'trisomy12', '0.0': 'no trisomy12', 'nan': 'NA'}).
        astype('category').cat.reorder_categories(['trisomy12', 'no trisomy12', 'NA'])
)
mofa.plot_factors_violin(model, factors=2, violins=True, violins_alpha=.3, dots=True, color="trisomy12", palette="Set2")
[154]:
<AxesSubplot:ylabel='Factor value'>
_images/CLL_33_1.png

Factor 4#

[178]:
mofa.plot_weights(model, views=['rna'], factors="Factor4", label_size=10)
[178]:
<AxesSubplot:title={'center':'rna'}, xlabel='Feature weight'>
_images/CLL_35_1.png

We can see that top weights from the RNA modality might be related to myeloid leukemia:

We, however, do not have clinical covariates related to such sample status in order to corroborate this observation.

[181]:
mofa.plot_factors(model, y="Factor4", color=["ENSG00000163220", "ENSG00000085265"], size=20)
[181]:
<AxesSubplot:title={'center':'ENSG00000085265'}, xlabel='Factor1 value', ylabel='Factor4 value'>
_images/CLL_38_1.png

Combinations of factors#

A couple of major factors associated with differences in the drug response assay seem to capture molecular profiles of patients so that they could be put into groups according to their somatic mutations status (IGHV and trisomy12).

[162]:
mofa.plot_factors(model, x=0, y=2, color=["IGHV_status", "trisomy12"], size=20, zero_line=True, palette="Set2")
[162]:
<AxesSubplot:title={'center':'trisomy12'}, xlabel='Factor1 value', ylabel='Factor3 value'>
_images/CLL_41_1.png