Read-level methylome simulator using a beta-binomial distribution.
It currently supports only two cell-type simulations (tumour and normal).
Pseudo-bulk samples with random cell-type compositions can be also simulated with the read-level methylomes.
methylseq_simulation requires Python version > 3.6 (So far, it has been tested under Python version 3.7, 3.8 and 3.9).
The dependencies are clarified in requirements.txt
.
If your environment (e.g. conda, Python venv and so on) is already activated, you can simply install the dependencies by pip install -r requirements.txt
Conda is a package and environment manager. It helps you with managing software dependencies independently from your local system. If you want to start conda, please find their installation guidance here.
- Create a new conda environment with a specific python version
conda creat -n $your_env_name python=$python_version
- Install the dependencies by
pip install -r requirements.txt
Python also supports a virtual environment.
- Create a Python virtual environment by
python -m venv $your_directory_path
- Activate the virtual environment by
source $your_directory_path/bin/activate
- Install dependencies by
pip install -r requirements.txt
Docker also provides resources to manage environment called container. We provide Dockerfile
to run the simulator with the required dependencies.
- Build a docker container by
sudo docker build -t methylseq_simulation .
- Run the container by
sudo docker run -i -t methylseq_simulation /bin/bash
- You can use the command line as described in Quick start
If you get a version-related error message as below:
Could not find a version that satisfies the requirement numpy==1.21.4
(from versions: 1.14.5, 1.14.6, 1.15.0, 1.15.1, 1.15.2, 1.15.3, 1.15.4,
1.16.0, 1.16.1, 1.16.2, 1.16.3, 1.16.4, 1.16.5, 1.16.6, 1.17.0, 1.17.1,
1.17.2, 1.17.3, 1.17.4, 1.17.5, 1.18.0, 1.18.1, 1.18.2, 1.18.3, 1.18.4,
1.18.5, 1.19.0, 1.19.1, 1.19.2, 1.19.3, 1.19.4, 1.19.5, 1.20.0, 1.20.1,
1.20.2, 1.20.3, 1.21.0, 1.21.1, 1.21.2, 1.21.3)
You may want to upgrade your pip
by pip install --upgrade pip
The simulator requires a reference genome FASTA file (such as hg19.fa) to simulate the reads. You can download various genome sequences from UCSC as a gzipped FASTA file.
If you have a specific set of regions you want to simulate reads, you can give the file name with -d
option.
The file must be tab-separated and include chromosome, start and end information with chr, start and end for column names.
You can simulate read-level methylomes and pseudo-bulk samples (with --bulk
option) by running main.py
as below:
python main.py --help
usage: main.py [-h] -r F_REF [-d F_REGION] [-o OUTPUT_DIR] [--save_img]
[-ng N_REGIONS] [-a A] [-b B] [-nr N_READS] [-k K_MER]
[-l LEN_READ] [--seed SEED] [--bulk] [-nb N_BULKS] [-s STD]
optional arguments:
-h, --help show this help message and exit
-r F_REF, --f_ref F_REF
.fasta file for the reference genome
-d F_REGION, --f_region F_REGION
tab-separated .csv file, DMRs should be given with
mean methylation level of each cell type, chr, start
and end. n_regions will be selected from hg19 CpG
islands if the file is not given
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
a directory where all generated results will be saved
(default: ./)
--save_img Save simulated methylation patterns as a .png file
(default: False)
-ng N_REGIONS, --n_regions N_REGIONS
Number of regions to select from CGIs when the region
file is not given (default: 100)
-a A, --a A alpha parameter of the beta-binomial distribution to
simulate read-level methylomes (default: 1e-6)
-b B, --b B beta parameter of the beta-binomial distribution to
simulate read-level methylomes (default: 5)
-nr N_READS, --n_reads N_READS
Read coverage to simulate in each DMR (default: 120)
-k K_MER, --k_mer K_MER
K to process the simulated read sequences into K-mer
sequence (default: 1)
-l LEN_READ, --len_read LEN_READ
Read length to simulate (default: 150)
--seed SEED seed number (default: 950410)
--bulk Whether you want to generate pseudo-bulks or the
entire dataset (default: False)
-nb N_BULKS, --n_bulks N_BULKS
Number of bulks, Applicable only when --bulk is True
(default: 1)
-s STD, --std STD Standard deviation to sample local proportions. The
larger value is given, the more varying local
proportions are sampled from a Gaussian distribution
centred at the global proportion (default: 0.0)
If --save_img
option is on, you can get the summary of results as figures in the designated output_dir
.