Skip to content

Commit

Permalink
add sctools
Browse files Browse the repository at this point in the history
  • Loading branch information
nikellepetrillo committed Aug 26, 2024
1 parent 42956ea commit 2070107
Show file tree
Hide file tree
Showing 201 changed files with 35,336 additions and 0 deletions.
36 changes: 36 additions & 0 deletions tools/scripts/sctools/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
FROM python:3.7.7

LABEL maintainer="Farzaneh Khajouei <[email protected]>" \
software="sctools v.1.0.0" \
description="A collection of tools for single cell data. Splitting fastq files based on cellbarcodes and other tools to compute metrics on single cell data using barcodes and UMIs."


RUN apt-get update && apt-get upgrade -y && apt-get install -y patch libhdf5-dev vim apt-utils
RUN mkdir /sctools/

COPY . /sctools

ARG htslib_version="1.13"

RUN cd /sctools/fastqpreprocessing &&\
wget https://github.com/khajoue2/libStatGen/archive/refs/tags/v1.0.15.broad.tar.gz &&\
wget https://github.com/samtools/htslib/releases/download/${htslib_version}/htslib-${htslib_version}.tar.bz2 &&\
tar -zxvf v1.0.15.broad.tar.gz &&\
tar -jxvf htslib-${htslib_version}.tar.bz2 &&\
mv libStatGen-1.0.15.broad libStatGen

RUN cd /sctools/fastqpreprocessing &&\
wget http://www.cs.unc.edu/Research/compgeom/gzstream/gzstream.tgz &&\
tar -xvf gzstream.tgz

RUN cd /sctools/fastqpreprocessing &&\
make -C libStatGen

RUN cd /sctools/fastqpreprocessing && make -C htslib-${htslib_version}/ && make -C gzstream

RUN cd /sctools/fastqpreprocessing && mkdir bin obj && make install

RUN cp /sctools/fastqpreprocessing/bin/* /usr/local/bin/

WORKDIR usr/local/bin/sctools

27 changes: 27 additions & 0 deletions tools/scripts/sctools/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
Copyright (c) 2017 Human Cell Atlas Authors, https://humancellatlas.org
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the name Broad Institute, Inc. nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE
3 changes: 3 additions & 0 deletions tools/scripts/sctools/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include src/sctools/test/data/*
include README.rst
include LICENSE
157 changes: 157 additions & 0 deletions tools/scripts/sctools/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
Single Cell Tools
#################

.. image:: https://img.shields.io/circleci/project/github/HumanCellAtlas/sctools.svg?label=Unit%20Test%20on%20Circle%20CI%20&style=flat-square&logo=circleci
:target: https://circleci.com/gh/HumanCellAtlas/sctools/tree/master
:alt: Unit Test Status

.. image:: https://img.shields.io/codecov/c/github/HumanCellAtlas/sctools/master.svg?label=Test%20Coverage&logo=codecov&style=flat-square
:target: https://codecov.io/gh/HumanCellAtlas/sctools
:alt: Test Coverage on Codecov

.. image:: https://img.shields.io/readthedocs/sctools/latest.svg?label=ReadtheDocs%3A%20Latest&logo=Read%20the%20Docs&style=flat-square
:target: http://sctools.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status

.. image:: https://img.shields.io/snyk/vulnerabilities/github/HumanCellAtlas/sctools/requirements.txt.svg?label=Snyk%20Vulnerabilities&logo=Snyk
:target: https://snyk.io/test/github/HumanCellAtlas/sctools/?targetFile=requirements.txt
:alt: Snyk Vulnerabilities for GitHub Repo (Specific Manifest)

.. image:: https://img.shields.io/github/release/HumanCellAtlas/sctools.svg?label=Latest%20Release&style=flat-square&colorB=green
:target: https://github.com/HumanCellAtlas/sctools/releases
:alt: Latest Release

.. image:: https://img.shields.io/github/license/HumanCellAtlas/sctools.svg?style=flat-square
:target: https://img.shields.io/github/license/HumanCellAtlas/sctools.svg?style=flat-square
:alt: License

.. image:: https://img.shields.io/badge/python-3.6-green.svg?style=flat-square&logo=python&colorB=blue
:target: https://img.shields.io/badge/python-3.6-green.svg?style=flat-square&logo=python&colorB=blue
:alt: Language

.. image:: https://img.shields.io/badge/Code%20Style-black-000000.svg?style=flat-square
:target: https://github.com/ambv/black
:alt: Code Style

Single Cell Tools provides utilities for manipulating sequence data formats suitable for use in
distributed systems analyzing large biological datasets.

Download and Installation
=========================

.. code bash
git clone https://github.com/humancellatlas/sctools.git
cd sctools
pip3 install .
pytest # verify installation; run tests
sctools Package
===============

The sctools package provides both command line utilities and classes designed for use in python
programs.

Command Line Utilities
======================

1. Attach10XBarcodes: Attached barcodes stored in fastq files to reads in an unaligned bam file
2. SplitBam: Split a bam file into chunks, guaranteeing that cells are contained in 1 chunk
3. CalculateGeneMetrics: Calculate information about genes in an experiment or chunk
4. CalculateCellMetrics: Calculate information about cells in an experiment or chunk
5. MergeGeneMetrics: Merge gene metrics calculated from different chunks of an experiment
6. MergeCellMetrics Merge cell metrics calculated from different chunks of an experiment

Main Package Classes
====================

1. **Platform**: an abstract class that defines a common data structure for different 3' sequencing
formats. All algorithms and methods in this package that are designed to work on 3' sequencing data
speak to this common data structure. Currently 10X_v2 is defined.

2. **Reader**: a general iterator over arbitrarily zipped file(s) that is extended to work with common
sequence formats like fastq (fastq.Reader) and gtf (gtf.Reader). We recommend using the pysam
package for reading sam and bam files.

3. **TwoBit & ThreeBit** DNA encoders that store DNA in 2- and 3-bit form. 2-bit is smaller but
randomizes "N" nucleotides. Both classes support fastq operations over common sequence tasks such
as the calculation of GC content.

4. **ObservedBarcodeSet & PriorBarcodeSet**: classes for analysis and comparison of sets of barcodes
such as the cell barcodes used by 10X genomics. Supports operations like summarizing hamming
distances and comparing observed sequence diversity to expected (normally uniform) diversity.

5. **gtf.Reader & gtf.Record** GTF iterator and GTF record class that exposes the gtf
fields as a lightweight, lazy-parsed python object.

6. **fastq.Reader & fastq.Record** fastq reader and fastq record class that exposes the fastq fields
as a lightweight, lazy-parsed python object.

7. **Metrics** calculate information about the genes and cells of an experiment

8. **Bam** Split bam files into chunks and attach barcodes as tags


Viewing Test Results and Coverage
=================================

To calculate and view test coverage cd to the ``sctools`` directory and
type the following two commands to generate the report and open it in your web browser:

.. code:: bash
pytest --cov-report html:cov_html --cov=sctools
open cov_html/index.html
Definitions
===========

Several definitions are helpful to understand how sequence data is analyzed.

1. **Cell**: an individual cell, the target of single-cell RNA-seq experiments and the entity that we
wish to characterize

2. **Capture Primer**: A DNA oligonucleotide containing amplification machinery, a fixed cell barcode,
a random molecule barcode, and an oligo-dT tail to capture poly-adenylated RNA

3. **Molecule**: A molecule refers to a single mRNA molecule that is captured by an oligo-dT capture
primer in a single-cell sequencing experiment

4. **Molecule Barcode**: A molecule barcode (alias: UMI, RMT) is a short, random DNA barcode attached
to the capture primer that has adequate length to be probabilistically unique across the experiment.
Therefore, when multiple molecules of the same gene are captured in the same cell, they can be
differentiated through having different molecule barcodes. The proposed GA4GH standard tag for a
molecule barcode is UB and molecule barcode qualities is UY

5. **Cell Barcode**: A short DNA barcode that is typically selected from a whitelist of barcodes that
will be used in an experiment. All capture primers for a given cell will contain the same cell
barcode. The proposed GA4GH standard tag for a cell barcode is CB and cell barcode qualities is CY

6. **Fragment**: During library construction, mRNA molecules captured on capture primers are amplified,
and the resulting amplified oligonucleotides are fragmented. In 3' experiments, only the fragment
that contains the 3' end is retained, but the break point will be random, which means fragments
often have different lengths. Once sequenced, different fragments can be identified as unique
combinations of cell barcode, molecule barcode, the chromosome the sequence aligns to, and the
position it aligns to on that chromosome, after correcting for clipping that the aligner may add

7. **Bam/Sam file**: The GA4GH standard file type for the storage of aligned sequencing reads.
Unless specified, our Single Cell Tools will operate over bam files containing either aligned or
unaligned reads

Development
===========

Code Style
----------
The sctools code base is complying with the PEP-8 and using `Black <https://github.com/ambv/black>`_ to
format our code, in order to avoid "nitpicky" comments during the code review process so we spend more time discussing about the logic,
not code styles.

In order to enable the auto-formatting in the development process, you have to spend a few seconds setting
up the ``pre-commit`` the first time you clone the repo:

1. Install ``pre-commit`` by running: ``pip install pre-commit`` (or simply run ``pip install -r requirements.txt``).
2. Run `pre-commit install` to install the git hook.

Once you successfully install the ``pre-commit`` hook to this repo, the Black linter/formatter will be automatically triggered and run on this repo. Please make sure you followed the above steps, otherwise your commits might fail at the linting test!

If you really want to manually trigger the linters and formatters on your code, make sure ``Black`` and ``flake8`` are installed in your Python environment and run ``flake8 DIR1 DIR2`` and ``black DIR1 DIR2 --skip-string-normalization`` respectively.
19 changes: 19 additions & 0 deletions tools/scripts/sctools/build/lib/sctools/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# flake8: noqa
from . import bam
from . import encodings
from . import barcode
from . import fastq
from . import gtf
from . import stats
from . import reader
from . import metrics
from . import platform
from . import consts
from . import groups
from pkg_resources import get_distribution, DistributionNotFound


try:
__version__ = get_distribution(__name__).version
except DistributionNotFound:
pass
Loading

0 comments on commit 2070107

Please sign in to comment.