Skip to content

Commit

Permalink
about and 4 updated
Browse files Browse the repository at this point in the history
  • Loading branch information
FabianAndradeLozano committed Sep 9, 2024
1 parent 8f1835d commit 1d8268e
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 20 deletions.
3 changes: 0 additions & 3 deletions docs/1- Library preparation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,6 @@ Template Preparation
Preparation of genomic samples for WGBS is commonly performed through the post-bisulfite treatment of DNA and de-tagging before index adaptor ligation for NGS sequencing . ChIP-Seq allows for genome-wide mapping of DNA-binding proteins and histone modifications at base-pair resolution. To prepare samples for ChIP-Seq, formaldehyde-fixed or natural chromatin is fragmented by micrococcal nuclease (MNase) or sonication, which is further immunoprecipitated with target-specific antibody conjugated to magnetic beads. Isolated DNA from the precipitated protein-DNA complexes is used to generate libraries





Library preparation
========================

Expand Down
3 changes: 1 addition & 2 deletions docs/3- Quality Control and Preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -163,8 +163,7 @@ Typical tools used for pre-processing are:

Fastp performs in all one the following corrections:

- Adapter removal: in paired-end data, fastp seeks the overlap of each pair and considers the bases that fall out of the
overlapped regions as adapter contents. Not need to specify the adapter sequence.
- Adapter removal: in paired-end data, fastp seeks the overlap of each pair and considers the bases that fall out of the overlapped regions as adapter contents. Not need to specify the adapter sequence.
- Base correction: for good quality overlapped sequences, quality differences are corrected if one of the bases has a higher score.
Tipically base quality decrease towards the 3' end of the read, poor quality tails are removed to leave only-high quality reads for aligment.
sliding window method to drop the low-quality bases of each read’s head and tail. The window can slide from either 5′ to 3′ or from 3′ to 5′, and the average quality score within the window is evaluated.
Expand Down
45 changes: 30 additions & 15 deletions docs/4- Quality of the mapping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,38 +4,50 @@
4 Quality of the Mapping
***********************************

Introduction to Mapping and tools
==================================

Once our reads are clean and with good Quality, most of the analysis requires the aligment of this reads respect a reference genome.
Depending on the origin of our sequencing data (WGS, WES, RNA-seq, Chip-seq, ...) and the downstream analysis, several alingers are available to adjust to the necessities of our analysis.

**Basic aligment**: Based on the Smith-Waterman algorithm, needs the creation of an index of the reference genome, used as a dictionary to query the reads and accelerate the search reducing the memory footprint.
Both can be used for WGS or WES data:
Previous aligment of the reads, For DNA-seq is required a reference genome in fasta format is needed, Typical sources to look up are UCSC, Ensembl or Gencode.
And for RNA-Seq aligment, a reference transcriptome is needed, typically a fasta file with the transcript sequences of the organism of interest.

If references are not available, a de novo assembly of the reads can be performed to generate a reference genome or transcriptome.

**Basic aligment**: Based on the Smith-Waterman algorithm, needs the creation of an index of the reference genome, used as a dictionary to query the reads and accelerate the search reducing the memory footprint.
The following tools can be used either for WGS or WES, but the have some differences:

- BWA-MEM: by default perform local aligment, high accuracy and efficiency in align reads to the entire genome. Because its very efficent for finding aligment with gaps, very important for variant detection <https://bio-bwa.sourceforge.net/bwa.shtml>.
- BWA-MEM: by default perform local aligment, high accuracy and efficiency in align reads to the entire genome. Because its very efficent for finding aligment with gaps is commonly used for variant detection `<https://bio-bwa.sourceforge.net/bwa.shtml>`_.

- bowtie2: by default perform global aligment, is faster than BWA but less sensitive. recommended for large-scale sequencing and frequently used for ChiP-seq due to its speed to align shorter reads and identified enriched regions (peak detection) <https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml>.
- Bowtie2: by default perform global aligment, is faster than BWA but less sensitive. Recommended for large-scale sequencing samples, and for ChiP-seq due to its speed to align shorter reads and identified enriched regions (peak detection) `<https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml>`_.

**RNA-seq splice-aware aligner**: Specialized in the mapping of RNA-seq reads, that can be spliced and map to different exons of the same gene:

- STAR: Most popular aligner for RNA-seq data, very effcient and accurate identifying splice junctions <https://github.com/alexdobin/STAR>.
- STAR: Most popular aligner for RNA-seq data, very effcient and accurate identifying splice junctions `<https://github.com/alexdobin/STAR>`_.

- TopHat2: first aligners designed for RNA-seq data, but now is deprecated and replaced by STAR <https://ccb.jhu.edu/software/tophat/index.shtml>.
- TopHat2: first aligners designed for RNA-seq data, but now is deprecated and replaced by STAR `<https://ccb.jhu.edu/software/tophat/index.shtml>`_.

- HISAT2: built on the Bowtiw2 aligment algorithm, but optimized for RNA-seq data <https://daehwankimlab.github.io/hisat2/>.
- HISAT2: built on the Bowtie2 aligment algorithm, but optimized for RNA-seq data `<https://daehwankimlab.github.io/hisat2/>`_.

**Pseudo-Aligner - Quasi-mapping**: very fast, map to transciptome and does quantitation. Can't find novel transcripts. When the goal is quantify gene expression levels, this is the best option:

- Kallisto: use pseudo-aligment approach, efficiently determines the compatibility of the transcript without full sequence aligment, very fast and memory-efficient, better option for large-scale projects <https://github.com/pachterlab/kallisto>.
- Salmon: use quasi-mapping approach, similar to pseudo-aligment but includes information about the location of the read within the transcript, and perform bias correction steps, slower than kallisto but more accurate quantifications. better option for complex transcriptomes <https://combine-lab.github.io/salmon/getting_started/>.
- Kallisto: use pseudo-aligment approach, efficiently determines the compatibility of the transcript without full sequence aligment, very fast and memory-efficient, better option for large-scale projects `<https://github.com/pachterlab/kallisto>`_.

- Salmon: use quasi-mapping approach, similar to pseudo-aligment but includes information about the location of the read within the transcript, and perform bias correction steps. Slower than kallisto but more accurate quantifications. Better option for complex transcriptomes `<https://combine-lab.github.io/salmon/getting_started/>`_.

Previous aligment of the reads, a reference genome in fasta format is needed, Typical sources to look up are UCSC, Ensembl or Gencode. An indexing of the reference genome is perfomed to create a dictionary database of the redundant sequences of the genome and facilitate and accelerate the query of the reads respect this regions, thus, minimizing the the memory footprint.

SAM format
----------

The output of the aligment is a SAM file.
Here is presented an overview of the structure of the SAM (Sequence Aligment/Map) format, which is a tab-delimited text file, divided in two sections: the header (optional) and the aligment records.

- Header lines start with @
- Aligment records start with a read name and mandatory contain 11 fields.


.. seealso::
.. _SAM format: https://samtools.github.io/hts-specs/SAMv1.pdf
Check the SAM format_ specification for a detailed explanation about the structure of the SAM and BAM file.



BAM QC
Expand Down Expand Up @@ -111,4 +123,7 @@ RNAseqMetrics is a tool from Picard Tools that provides a comprehensive set of m
It can be used to assess the quality of the alignment of reads to a reference genome, the coverage of the genome,
the distribution of reads across the genome and helps to detect biases.

**RSeQC**: Quality control of post aligment RNA-seq data, "inspect sequence quality, nuvleotide composition, PCR duplication, GC bias.
.. seealso::
.. _Picard Tools: https://gatk.broadinstitute.org/hc/en-us/articles/360037057492-CollectRnaSeqMetrics-Picard
Check the Picard Tools_ website for more information about usage.

0 comments on commit 1d8268e

Please sign in to comment.