This project has been replaced by ft which is a binary that replaces this pipeline in a single command!
This is a Snakemake workflow for making fully annotated fiberseq bam files from PacBio HiFi data. This pipeline creates and extends the standard fiberseq bam available through fibertools-rs by adding nucleosome calls, MSP (methylation sensitive patch) calls, options to distribute compute across a cluster, and options to make auxiliary bed files, in addition to the standard m6A calls.
This code is in an alpha state and will be changing without notice or versioning.
See docs/INSTALL.md for installation instructions.
The input data is a ccs
bam file(s) with HiFi kinetics. You can generate this file from PacBio subreads using pbccs
. The input data can be either an aligned or unaligned bam file.
If the data is multiplexed you must first process it with lima
and pass in demultiplexed bam file(s).
You can pass subread bam(s) to fiberseq-smk
by adding input_type=subreads
to your config options. This will run pbccs
on the subreads and then run the rest of the pipeline. This feature is deprecated and will be removed in the future.
You can run data using the following command, read the comments to learn more about the config options:
snakemake \
--profile profile/local `# sets up where and how jobs are submitted` \
--config \
env="fiberseq-smk" `# sets the conda env for the jobs, always the same` \
test=.test/ccs.bam `# path to the ccs reads with HiFi kinetics, and the key sets the sample name` \
ref=.test/ref.fa `# reference to align results to`
You can configure fiberseq-smk
to run on a distributed resource manager (e.g. SLURM, PBS, SGE, etc.) by creating a snakemake profile. An example configuration for a SLURM cluster is included in the profile/compute directory. To use this profile you can add the following to your command:
--profile profile/compute
Several additional profiles are made available under profile/ which can be used as a starting point for your own cluster configuration.
Before running your own data please run this small test case included in the repository shown in the Usage
section. In addition, please clear this test case and then try again with the distributed resources (e.g. profile/compute/
).
You can find examples of running the pipeline in docs/EXAMPLES.md.
Example output files if your sample name is test
:
# Aligned fiberseq bam file
results/test/test.fiberseq.bam
# Plain text fiberseq output with m6A calls, nucleosome calls, MSP calls, and 5mC calls
results/test/test.fiberseq.all.tbl.gz
# Quality control directory with figures and html reports
results/test/qc/
- MSP: methylation sensitive patch, defined as being any stretch of sequence between nucleosomes that has methylation.
- as: A bam tag with an array of MSP start sites
- al: A bam tag with an array of MSP lengths
- ns: A bam tag with an array of nucleosome start sites
- nl: A bam tag with an array of nucleosome lengths
- MM/ML: Bam tags for sorting m6a and 5mC methylation information. See the SAM spec for details.
See docs/bed.md for instructions on how to make bed files from your fiberseq bam(s).
fiberseq-smk
aims to distribute the process of making fiberseq data into many small jobs that complete quickly (<30 minutes), and in general fiberseq-smk
will create one job for every GB of input HiFi data.
Each job will use at most 16 threads and 32 GB of memory, but on average uses much less.
- Add a pipeline version to the bam header (git commit).
- Add env version to the output.