In short, BioTIP is an R-package designated for the characterization of biological tipping points. A more detailed overview - (1) what are tipping points, (2) how BioTIP improves on existing methods, (3) what datasets can BioTIP be applied to, and (4) how to install BioTIP - can be found at BioTIP.
The general workflow of BioTIP is detailed in the following chart:
Examples of BioTIP applied to various datasets can be found in the folder Rcode.
We here present two cases in the following sections, one with a single-cell RNA dataset and the other with a bulk RNA dataset.
The following figure shows BioTIP's analytic pipeline of single-cell RNA-seq analysis, as an alternative to differential expression analysis - instead of discovering marker genes from the differential-expression analysis between stable states, BioTIP introduces the tipping-point analysis that focuses on unstable transition states. The orange arrows show that BioTIP is applied to cell state ensembles (clusters) without the pseudo-order information. The dashed orange arrow shows that BioTIP may use pseudo-order information to focus cell states on the trajectory of interests.
We will perform downstream analysis of the existing the BloodNet dataset, the hematopoietic stem and progenitor cells (HSPCs) from ten female mice. The detail of the experiment can be found in details on our Vignette.
Note that, as mentioned in 'Why BioTIP?', multiple tipping points may coexist during an observed progression and multiple CTSs may coexist in the same critical transition state. We set a parameter n
in our function getTopMCI
to incorporate situations where multiple tipping points coexist; we also set a parameter n
in our function getMaxMCImember
to incorporate situations where multiple CTSs coexist. An example of how this function is used is in the Identifying Dynamic Network Biomodule (MCI) section of the Vignette.
As indicated in the BioTIP workflow chart, the analysis can be divided into the following two sections:
- Data preprocessing before running BioTIP,
- Standard Identification in 6 steps, detailed in the BioTIP workflow chart.
An existing dataset, GSE6136, is used to demonstrate how our functions are applied. Samples were collected from transgenic mouse lymphomas and divided into five groups based on clinical presentation, pathology, and flow cytometry (Lenburg 2007), thus belonging to cross-sectional profiles. Noticing these five group represent a control stage similar to non-transgenic B cells and four major periods of B-cell lymphomagenesis, Dr. Chen and coauthors used the DNB method to identify the pre-disease state exists around the normal activated period (P2), i.e., the system transitions to the disease state around the transitional lymphoma period (Figure S12 in publication (Chen 2012)). Start by installing the package 'BioTIP' and other dependent packages such as stringr, psych, and igraph if not. Below are some examples.
# load package
library(BioTIP)
# first row of the GEO-downloaded matrix should be column-name.
data(GSE6136_matrix)
GSE6136 = GSE6136_matrix[,-1]
rownames(GSE6136) <- GSE6136_matrix[,1]
dim(GSE6136) #[1] 22690 rows and 26 columns
rm(GSE6136_matrix)
The summary of GEO-downloaded clinical information GSE6136_cli is shown below.
#requires library(stringr)
library(BioTIP)
data(GSE6136_cli)
#dim(GSE6136_cli) #check dimension
cli = t(GSE6136_cli)
library(stringr)
colnames(cli) = str_split_fixed(cli[1,],'_',2)[,2]
cli = cli[-1,]
cli = data.frame(cli)
cli[,"cell-type:ch1"] = str_split_fixed(cli$characteristics_ch1.1,": ",2)[,2]
cli[,"Ig clonality:ch1"] = str_split_fixed(cli$characteristics_ch1.3,": ",2)[,2]
colnames(cli)[colnames(cli) == "cell-type:ch1"] = "group"
cli$Row.names = cli[,1]
head(cli[,1:3])
We normalized the expression of genes using log2() scale. This normalization will ensure a more accurate comparison of the variance between the expression groups (clusters).
df <- log2(GSE6136+1)
df[1:3,1:5]
# GSM142398 GSM142399 GSM142400 GSM142401 GSM142402
#1415670_at 10.21917 10.29048 9.987548 10.07682 9.827343
#1415671_at 10.90358 11.15993 10.776186 10.93000 11.468268
#1415672_at 11.11530 10.89209 11.091303 11.04029 11.109504
The sample assembles are previously published.
tmp <- names(table(cli$group))
cli$group = factor(cli$group, levels=c('resting','activated','lymphoma_marginal','lymphoma_transitional','lymphoma_aggressive'))
samplesL <- split(cli[,1],f = cli$group)
names(samplesL)
#[1] "resting" "activated" "lymphoma_marginal"
#[4] "lymphoma_transitional" "lymphoma_aggressive"
The first step is to calculate an Index of Critical transition (Ic score) of the dataset. One can use the function 'getIc' to calculate the Ic score based on randomly selected genes (20-1000). 'GetIc' has a key parameter fun which gives the options to call the existing Ic score or the proposed new Ic* score.
RandomG <- sample(rownames(df), 100)
IC.new <- getIc(df, samplesL, RandomG, fun='BioTIP')
plotIc(IC.new, las = 2)
IC.old <- getIc(df, samplesL, RandomG, fun='cor')
plotIc(IC.old, las = 2)
To estimate the distribution of random Ic scores, one can call the function simulation_Ic.
simuIC <- simulation_Ic(length(RandomG),samplesL,df, fun='BioTIP')
par(mar = c(10,5,0,2))
plot_Ic_Simulation(IC,simuIC,las = 2)
Once pre-processed, we can analyze the clinical-stage ensembles. Here, we demonstrate a pre-selection of 226 (1%) genes for each state.
test <- sd_selection(df, samplesL,0.01)
class(test) #[1] "list"
lapply(test, dim)
#$resting
#[1] 226 5
#
#$activated
#[1] 226 3
#
#$lymphoma_marginal
#[1] 226 6
#
#$lymphoma_transitional
#[1] 226 5
#
#$lymphoma_aggressive
#[1] 226 7
A graphical presentation of genes of interest can be achieved using the following functions. The getNetwork
function will obtain an igraph object based on Pearson Correlation of test
. This igraphL
object is then run using the getCluster_methods
function classify nodes.
library(igraph)
library(psych)
igraphL <- getNetwork(test, fdr = 1)
cluster <- getCluster_methods(igraphL)
names(cluster)
class(cluster) #[1] "list"
class(cluster[[1]]) #[1] "communities"
S4: Identifying Dynamic Network Biomodule
Here, ‘module’ refers to a cluster of network nodes (e.g. transcripts) highly linked (e.g. by correlation). “Biomodule” refers to the module resenting significantly highest score called “Module-Criticality Index (MCI)” per state.
The following step shows a graph of classified clustered samples for five different states. MCI score is calculated for each module using the getMCI
function. The getMaxMCImember
function will obtain a list of modules with the highest
MCI at each stage. Use "head(maxCIms)"
to view the MCI scores calculated. Use
plotMaxMCI
function to view the line linking the highest MCI score at each state.
membersL_noweight <- getMCI(cluster,test,adjust.size = FALSE)
plotBar_MCI(membersL_noweight,ylim = c(0,6))
maxMCIms <- getMaxMCImember(membersL_noweight[[1]],membersL_noweight[[2]],min =10)
names(maxMCIms)
names(maxMCIms[[1]])
names(maxMCIms[[2]])
head(maxMCIms[['idx']])
head(maxMCIms[['members']][['lymphoma_aggressive']])
To get the selected statistics of biomodule (i.e., the gene module that presents the highest MCI score) of each state, please run the following
biomodules = getMaxStats(membersL_noweight[['members']],maxMCIms[[1]])
maxMCI = getMaxStats(membersL_noweight[['MCI']],maxMCIms[[1]])
head(maxMCI)
maxSD = getMaxStats(membersL_noweight[['sd']],maxMCIms[[1]])
head(maxSD)
To get the biomodule with the highest MCI score among all states, as we call it CTS (Critical Transition Signals), please run the following.
CTS = getCTS(maxMCI, maxMCIms[[2]])
Run the following to visualize the trendence of every state represented by the cluster with the highest MCI scores.
par(mar = c(10,5,0,2))
plotMaxMCI(maxMCIms,membersL_noweight[[2]],states = names(samplesL),las = 2)
We then perform simulation for MCI scores based on identified signature size
(length(CTS) ) using the simulationMCI
function.Use plot_MCI_simulation
function to visualize the result. This step usually takes 20-30mins, so here to save time, we picked a small number 3 as the length of the CTS.
simuMCI <- simulationMCI(3,samplesL,df)
plot_MCI_Simulation(maxMCI,simuMCI)
The next step is to calculate an Index of Critical transition (Ic score) of the dataset. First, use the getIc function to calculate the Ic score based on the biomodule previously identified. We use the function plotIc to draw a line plot of the Ic score.
IC <- getIc(df,samplesL,CTS)
par(mar = c(10,5,0,2))
plotIc(IC,las = 2)
Then use the two functions to evaluate two types of empirical significance, respectively. The first function simulation_Ic calculates random Ic-scores by shuffling features (transcripts). Showing in the plot is Ic scores of the identification (red) against its corresponding size-controlled random scores (grey).
simuIC <- simulation_Ic(length(CTS),samplesL,df)
par(mar = c(10,5,0,2))
plot_Ic_Simulation(IC, simuIC,las = 2)
Importantly, the function plot_simulation_sample calculates random Ic-scores by shuffling samples and visualizes the results. Showing in the plot is observed Ic-score (red vertical line) comparing to the density of random scores (grey), at the tipping point identified.
RandomIc = simulation_Ic_sample(df, sampleNo=length(samplesL[['lymphoma_aggressive']]),
Ic=IC[['lymphoma_aggressive']], genes=CTS, B=100)
plot_Ic_Simulation(IC, RandomIc, las = 2)
The final step is to calculate the Delta Score, which evaluates the significance of our identified CTS.
plot_SS_Simulation(Ic, Ic_simuG,
main = paste("Delta of Ic*", length(CTS),"genes"),
ylab='BioTIP 35 genes') # [1] 0
plot_Ic_Simulation(Ic, Ic_simuG, las = 0, ylim = c(0,1.5),
order = NULL, main = 'BioTIP 35 genes',
ylab = "Ic*", fun = 'boxplot',
which2point = NULL)
plot_SS_Simulation(Ic, Ic_simuG,
main = paste("Delta of Ic*", length(CTS),"genes"),
ylab='BioTIP 35 genes')