Skip to content

Latest commit

 

History

History
84 lines (60 loc) · 3.52 KB

README.md

File metadata and controls

84 lines (60 loc) · 3.52 KB

fastocc

Rapid and robust estimation of nucleosome-free regions and nucleosome positioning.

Quickstart

Most of the core functionality in fastocc is accessible through the CLI with a small number of commands.

The bulk of processing is done by the call command, which takes a .bam file containing paired ATAC-seq reads, processes it into fragments, and then counts those fragments into nucleosomal (longer) and nucleosome-free (shorter) bins along the genome (default is 10bp).

fastocc can be run in tiles over the whole genome, but the algorithm will generally be uninformative outside of regions of high ATAC-seq signal (ie peaks), specified with the --bed argument.

fastocc call --bam my_atac.bam --bed peaks.bed --output fragments.h5wig

.h5wig is a novel format (built on HDF5), which stores a set of regions (e.g. peaks), and acro each region tracks multiple signals in consecutive bins across the whole region. This concept of binned data in ranges is used internally in many places in fastocc, and is a significant reason behind its efficiency.

We can then call nucleosome-free regions using the nfr.

fastocc nfr --fragments fragments.h5wig --output nfrs.bed

If we want to visualise the predicted occupancy in a genome browser (or the counts of NF for nucleosomal reads), we can export these from the fragments file as a .bigWig:

fastocc export --fragments fragments.h5wig --track occ --output occ.bw

Lastly, we can export everything to an extended .bedGraph format. This isn't recommended for the whole dataset due to the resulting file size, so a set of regions can be provided:

fastocc dump --fragments fragments.h5wig --bed regions_of_interest.bed --output fragments.bg

If you do need programmitic random access to the fragment-level data, functions are provided in Python (exported in the package) and R (in extra/). Implementing reading should be straightforward in any other language with a mature HDF5 library.

What happens next?

More information about the package can be found in the documentation and source code, and the Jupyter notebooks included in the repository.

Some things you might be interested in:

  • Merge NFRs across cell lines to generate precises consensus cis-regulatory elements.
  • Scan NFRs for enriched motifs.
  • Predict TF-binding based on ChIP-seq and NFRs
  • Use the internal API to analyse scATAC data directly from .arrow files.

If you do any of this, please let me know! Github DMs are the best way to reach me, see here. If you have any issues or feature requests, I will try to watch issues here but may require a nudge. I am no longer working on this full time, so you may be asked to contribute, please don't be scared to do this!

Acknowledgements

The development of fastocc took place within the Ensembl Regulation team at EMBL-EBI. The project received funding from the European Union’s Horizon 2020 programme under the AQUA-FAANG and GENE-SWitCH projects.

The tool itself was based on ideas originally implemented in the NucleoATAC package, if you're interested in the biological reasoning behind fastocc you sohuld read their paper in Genome Research and check out NucleoATAC on GitHub.

fastocc is available under the Apache v2.0 license, included in the repository.