RADs stands for Robust and Accurate Deconvolution with Single-cell data. It is an improved method based on our previous publication and take advantage of the single-cell RNA-seq information to infer the cell type profiles in bulk tumor sample(s).
The high level idea of to combine bulk and single-cell RNA-seq data is as shown as below:
In a mathematic way, we would like: Given a non-negative bulk RNA expression matrix B \in R_+^{m x n}, where each row i is a gene, each column j is a tumor sample, our goal is to infer an expression profile matrix C \in R_+^{m x k}, where each column l is a cell community, and a fraction matrix F \in R_+^{k x n}, such that:
B ~= C F.
In the meanwhile, we use single-cell data from metastases as reference and allow unknown cell type(s) that only exist in primary tumor. The overall problem is shown as:
C_1 is the known cell types in metastatic tumor, C_2 is the unknown cell type only in primary tumor, (C_1|C_2) means horizonal stack of these two gene expression in matrix manner
Compared to other method, RADs has at least two advantages:
- it can work on small number of tumor samples, which is usually difficult or impossible for other methods
- it can infer additional information about the primary tumor, while other methods that use reference can only infer information (e.g. cell types) in the reference used
The main solver is contained in the sRAD_v3.py
, there are five main functions:
- _quad_prog_BCmu2F: solve F by fixing C
- _quad_prog_BFmu2C1: solve C_1 by fixing F, \mu and C_2 (if there is any)
- _quad_prog_BFmu2C2 (if necessary): solve C_2 by fixing C1, F and \mu
- _linear_reg_mu: solve \mu using Least sqaure
- _rna_coordescent: include the previous function in a coordinate descent algorithm.
For details, please refer to the Section 2.2 in the paper
-
Users should clone the git repository, plossibly by typing:
git clone https://github.com/CMUSchwartzLab/RADs.git
Instructions in thisREADME
assume a GNU Linux command line or a Macintosh terminal. The git command above will create a subdirectory namedRADs
. The instructions assume that the user's current directory isRADs/code
-
The setup assumes that the user will put the data in another subdirectory under
RADs
(e.g.,RADs/simulated_data
). It is inherent to the code and documentation that the three subdirectories {code, simulated_data, results} are parallel, at the same level. The file structure would be as follows:
/some/path/to/RADs
/simulated_data (or data): where the tested data are put
/code: the main codes to solve the deconvolution problem
/results: the results would be saved here
- The scripts in that end in .py must be run using python3, not python2. The problem was stated as a Quadratic Programming and solved using CVXOPT in Python, you may install the package first by using
conda install -c conda-forge cvxopt
orpip install cvxopt
- Run the solver: The main code to solve the deconvolution problem is contained in
run_exp_v3.py
, we use this file as the main function to test our model on the simulated data, there are mainly four arguments for the function:b_noise, s_nosie, sample_nums, lambda
. Whileb_noise
ands_noise
are two parameter to simulate the noise in the data at two different levels (please refer to Section 2.3 for details),sample_nums
andlambda
refer to number of bulk sample and regularization term for||C_1-\mu S||_{Fr}^2
in the objective, respectively. Please note that there is a pre-defined parameter calleddate
aroundLine 18
, which we used to mark different test for the data, users can use whatever value fordate
, but just put the data into a folder has the same value asdate
underRADs\simulated_data
. A sample command to run the code would be:whererun_exp_v3.py 0.0 0.0 1 0.1
0.0 0.0
mean the level ofb_noise
ands_noise
,1
means the number of bulk sample,0.1
means the value oflambda
. When the problem was solved, a singlepickle
file would be generated and saved inresults
directory:ThisRADs\results\result_0.0_0.0_1_0.1.pickle
pickle
file has a dictinary-like structure and all the inferred results such as inferred C, inferred F and inferred mu will be stored, as well as the ground truth from simulated data. Please note, ifdate
has been defined and used, the result path would be:RADs\results\{date}\result_0.0_0.0_1_0.1.pickle
. - Test on real data:
run_exp_v3.py
is mainly for testing the performance of the model on simulated by varying different parameters. Once the optimal parameter (e.g.,lambda
) was determined for a dataset, we provide another codetestreal.py
for the real data. The command would be just as simple as:then, the result would be stored intestreal.py /path/to/real_data
/RADs/results/real/real_data.picke
Users can also modify themain
function to accomodate their needs to read and save the data.
While the real data is not able to share at this moment, we provided a simulated data based on the real data as well as jupyter notebook tutorial.ipynb
in the code
directory for users to better understand the tool