Skip to content

Commit

Permalink
3 updated and 4 created
Browse files Browse the repository at this point in the history
  • Loading branch information
FabianAndradeLozano committed Aug 30, 2024
1 parent a47ddc4 commit e0ab682
Show file tree
Hide file tree
Showing 4 changed files with 116 additions and 44 deletions.
19 changes: 11 additions & 8 deletions docs/1- Library_preparation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,26 +93,29 @@ DNA Library preparation bias
Here are presented the the different steps of the DNA library preparation that have been implicated in bias introduction:

- Fragmentation
Chromatin sonication for ChIP-seq has been shown to be non-random, with euchromatin being sheared more efficiently than heterochromatin.
Chromatin sonication for ChIP-seq has been shown to be non-random, with euchromatin being sheared more efficiently than heterochromatin.

.. tip::
To solve this it has been developed the double-fragmentation ChIP-seq protocol.
To solve this it has been developed the double-fragmentation ChIP-seq protocol.

- Size Selection
Agarose gel slices by heating to 50 ºC in chaotropic salt buffer decreased the representation of AT-rich sequences.

.. tip::
Simple solution to this problem is to melt the gel slices in the supplied buffer at room temperature (18–22 ºC), considerably reducing GC bias.
Simple solution to this problem is to melt the gel slices in the supplied buffer at room temperature (18–22 ºC), considerably reducing GC bias.

- PCR
Introduce bias in sample composition, due to the fact that not all fragments in the mixture are amplified with the same efficiency.
GC-neutral fragments are amplified more efficiently than GC-rich or AT-rich fragments, and as a result fragments with high AT- or GC content may become underrepresented or are completely lost during library preparation

.. tip::
- Ligate adapters that contain all necessary elements for bridge amplification on Illumina flowcells are preferred, eliminating the need for PCR to add these sequences afterwards. Nevertheless, requires relatively large quantities (41 mg) of input material.
- In the extreme case of small input amount, the single cell,multiple displacement amplification (MDA) may be the preferred amplification method. MDA is an extremely powerful amplification method, allowing microgram quantities of DNA to be obtained from femtograms of starting material. For this reason, MDA has become the method of choice for whole genome amplification (WGA) from single cells
- PCR additives have also been reported to reduce bias, such as betaine or tetramethylammonium chloride (TMAC) may help to further improve coverage of extremely GC-rich or AT-rich regions.
- The best overall performing polymerase appears to be Kapa HiFi.
- Ligate adapters that contain all necessary elements for bridge amplification on Illumina flowcells are preferred, eliminating the need for PCR to add these sequences afterwards. Nevertheless, requires relatively large quantities (41 mg) of input material.
- In the extreme case of small input amount, the single cell,multiple displacement amplification (MDA) may be the preferred amplification method. MDA is an extremely powerful amplification method, allowing microgram quantities of DNA to be obtained from femtograms of starting material. For this reason, MDA has become the method of choice for whole genome amplification (WGA) from single cells
- PCR additives have also been reported to reduce bias, such as betaine or tetramethylammonium chloride (TMAC) may help to further improve coverage of extremely GC-rich or AT-rich regions.
- The best overall performing polymerase appears to be Kapa HiFi.

.. seealso::
For more information see the publication `Library preparation methods for next generation sequencing Tone down the bias <http://dx.doi.org/10.1016/j.yexcr.2014.01.008>`_.
For more information see the publication `Library preparation methods for next generation sequencing Tone down the bias <http://dx.doi.org/10.1016/j.yexcr.2014.01.008>`_.


RNA library bias
Expand Down
22 changes: 12 additions & 10 deletions docs/2- Sequencing_technologies.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,22 +10,24 @@ Short Reads sequencing (Illumina)
It consist in The polymerase-mediated Sequencing by synthesis (SBS), this works by coupling the four DNA bases to fluorescent markers alongside a terminator chemical group that pauses DNA synthesis.
While DNA is being synthesized, each fluorescent marker is optically verified before the tag and terminator are removed, and the next step in the sequence is recorded.

#. Cluster generation

Adapter attached to the DNA fragment is used to hybridisation to the flowcell, subsequentlty PCR amplification (bridge amplification) generates a cluster of the same sequence fragment to amplify the signal
when the nucleotide base is synthesized, thus obtaining a multiple cluster on a Flow Cell.

.. image:: images/illumina_Lu_et_al_2016.png
:width: 400
:align: center
:alt: *Source: https://www.researchgate.net/publication/357946568_New_approaches_and_concepts_to_study_complex_microbial_communities*

*Source: https://www.researchgate.net/publication/357946568_New_approaches_and_concepts_to_study_complex_microbial_communities*
#1. Cluster generation

Adapter attached to the DNA fragment is used to hybridisation to the flowcell, subsequentlty PCR amplification (bridge amplification) generates a cluster of the same sequence fragment to amplify the signal
when the nucleotide base is synthesized, thus obtaining a multiple cluster on a Flow Cell.

#2. Sequencing

#. Sequencing
On each cycle is incorporated one nucleotide to the template, it correspond to the read length (100 cycles equal to 100 bp read length).
In each incorportation is imaged the fluorescent signal that indicates the base incorporated, and the terminator is removed to start the next cycle.

On each cycle is incorporated one nucleotide to the template, it correspond to the read length (1'' cycles equal to 100 bp read length).
After imaging to determine which of the four colours was incorporated in each cluster of the flow cell.
During Library preparation adapters are added to the DNA fragments, which are used to hybridize the DNA to the flow cell, and also act as barcaodes to identify the sample, when multiples samples are pooled in the same run.
Depending on the type of sequencing, the adapater can be added to one end of the fragment (single end) or both ends (paired end).

.. image:: images/single_vs_pair_end.png
:width: 400
Expand All @@ -35,8 +37,8 @@ Single end
----------

Correspond to the basis of SBS, where the nucleotides added to the template sequence is read from one end of the fragment.
It's more simple and effcient, due to reduce the the number of stemps in the library preparation. nevertheless, the quality of nucleotides decreases as the sequencing process progresses.

It's more simple and effcient, also a reduce number of steps during the library preparation is required.
However, the quality of nucleotides decreases as the sequencing process progresses, so the ends of the reads tend to have lower quality scores.

Paired end
----------
Expand Down
69 changes: 43 additions & 26 deletions docs/3- Quality Control and Preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@ Illumina
Quality Control
---------------

Quality control of the reads contained in the fastq files needs to be check, in order to determine
After sequencing, Quality control of the reads contained in the fastq files needs to be check, in order to determine
if the reads could be used for further analysis. the main tools used for QC in Illumina reads are FASTQC and FASTQ-Screen.


FASTQC
~~~~~~

Is a tool developed to check failures in the reads produced either by the sequencing machine or during library preparation.
the extensions supported are:

- FASTQ
- Casava FASTQ files
- Colorspace fastq
Expand All @@ -29,45 +29,35 @@ the extensions supported are:

The html report generated for each file its divided in the following modules:

#. **Basic Statistics**: display the information related with the file, number and leght of the sequences, and overall %GC.
#. **Per base sequence quality**: shows how the quality score (y axis) varys throughout the sequence reads (x axis). For each position a BoxWhisker is displayed, the red line represents the median and the blue the mean. Commonly the quality score tend to decrease at the end of the reads, because the polymerase tends to make more mistakes as the read progresses.
#1. **Basic Statistics**: display the information related with the file, number and leght of the sequences, and overall %GC.
#2. **Per base sequence quality**: shows how the quality score (y axis) varys throughout the sequence reads (x axis). For each position a BoxWhisker is displayed, the red line represents the median and the blue the mean. Commonly the quality score tend to decrease at the end of the reads, because the polymerase tends to make more mistakes as the read progresses.
is the median os any base is less than 25 a warning will arise.
#. **Per tile sequence quality**: shows the quality score distribution for each tile in the flowcell.
#. **Per sequence quality score**: shows the distribution of the quality scores for all the reads in the file. If a huge amount of reads subset have a poor average quality this could indicate a systematic problem.
#. **Per base sequence content**: proportion of each base position for the four nucleotides. A strong bias in the nucleotide composition could indicate a problem in the library preparation.
#. **Per sequence GC content**: GC content distribution for all the reads in the file, and compared to a modelled normal distribution of human GC content.
#3. **Per tile sequence quality**: shows the quality score distribution for each tile in the flowcell.
#4. **Per sequence quality score**: shows the distribution of the quality scores for all the reads in the file. If a huge amount of reads subset have a poor average quality this could indicate a systematic problem.
#5. **Per base sequence content**: proportion of each base position for the four nucleotides. A strong bias in the nucleotide composition could indicate a problem in the library preparation.
#6. **Per sequence GC content**: GC content distribution for all the reads in the file, and compared to a modelled normal distribution of human GC content.

.. danger::
If the GC content is not close to the normal distribution, this could indicate a contamination or a problem in the library preparation.
Also, depending on the organism the GC content could vary, so it is important to know the GC content of the organism of interest (so avoid comparison with reference curve).

#. **Per Base N content**: If the sequencer is unable to determine the base in a position, it will be represented as an 'N'. This section shows the distribution of Ns in the reads.
#. **Sequence Lenght Distribution**: distribution of fragment sizes, for delimited size lenght (number of cycles) a peak only at one size is observed.
#. **Duplicate Sequences**: shows the number of duplicated sequences in the file. a high level of duplication could indicate a enrichment bias (i.e. PCR amplification). Low level of duplication may indicate a very high level of coverage of the target sequence.
#. **Overrepresented sequences**: show in a single sequence is very overrepresented in the file. This could indicate a contamination or a problem in the library preparation.
#. **Adapter content **: shows the presence of adapter sequences in the reads. If there is presence of adapters, the reads should be trimmed before further analysis.
#7. **Per Base N content**: If the sequencer is unable to determine the base in a position, it will be represented as an 'N'. This section shows the distribution of Ns in the reads.
#8. **Sequence Lenght Distribution**: distribution of fragment sizes, for delimited size lenght (number of cycles) a peak only at one size is observed.
#9. **Duplicate Sequences**: shows the number of duplicated sequences in the file. a high level of duplication could indicate a enrichment bias (i.e. PCR amplification). Low level of duplication may indicate a very high level of coverage of the target sequence.
#10. **Overrepresented sequences**: show in a single sequence is very overrepresented in the file. This could indicate a contamination or a problem in the library preparation.
#11. **Adapter content **: shows the presence of adapter sequences in the reads. If there is presence of adapters, the reads should be trimmed before further analysis.
.. seealso::
.. _FASTQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

For more information about FASTQC modules interpretation visit the FASTQC_ website.

hands on:
*********

Lets download the following data xxxxxxx and run FASTQC to check the quality of the reads.

.. code-block:: bash
mkdir -p ~/NGS-QC-Course/data/raw_data
wget

FASTQ-Screen
~~~~~~~~~~~~

Is a tool that checks if the reads are generated from the genome of the organism of interest, quantifying the proportion of reads that map to a reference genomes and also to a set of contaminants.
Is a tool that checks if the reads are generated from the genome of the organism of interest,
quantifying the proportion of reads that map to a reference genomes and also to a set of contaminants.
In human sequencing data the standard reference genomes to check are:

- Human
Expand All @@ -87,7 +77,34 @@ In human sequencing data the standard reference genomes to check are:

Example of a FASTQ-Screen report:


When working with several samples and reports theese could be aggregate in a unique report using "MULTIQC"" (https://multiqc.info/)

Pre-processing
---------------

After the quality
After the quality control, in case adapter content or low quality bases are detected,
the reads need to be pre-processed in order to get rid of them and improve quality of the reads for further analysis.

Typical tools used for pre-processing are:

- Trimmomatic
- Cutadapt, only remove the adapaters (it needs to be used in combination with sickle), requires the adapter sequence to be known.
- Sickle, remove low quality tail bases.
- fastp

fastp source: *https://academic.oup.com/bioinformatics/article/34/17/i884/5093234*

Fastp performs in all one the following corrections:

- Adapter removal: in paired-end data, fastp seeks the overlap of each pair and considers the bases that fall out of the
overlapped regions as adapter contents. Not need to specify the adapter sequence.
- Base correction: for good quality overlapped sequences, quality differences are corrected if one of the bases has a higher score.
Tipically base quality decrease towards the 3' end of the read, poor quality tails are removed to leave only-high quality reads for aligment.
sliding window method to drop the low-quality bases of each read’s head and tail. The window can slide from either 5′ to 3′ or from 3′ to 5′, and the average quality score within the window is evaluated.
If the average quality is lower than a given threshold, then the bases in the window will be marked as discarded
- Reads which are below a certain length are also removed.
- Poly-G tails are recognised and removed (Sequencing error in the end of the read produced by some artifacts, such as Illumina and Novaseq, for the use of two colors to detect the four bases)

After preprocessing our reads, its important to check again the Quality. Fastp generates both htm and json report for asses the quality of our reads.
The json reports could be aggregated with MULTIQC.
50 changes: 50 additions & 0 deletions docs/4- Quality of the mapping.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
.. _Sequencing_technologies-page:

***********************************
4 Quality of the Mapping
***********************************

Introduction to Mapping and tools
==================================

Once our reads are clean and with good Quality, most of the analysis requires the aligment of this reads respect a reference genome.
Depending on the origin of our sequencing data (WGS, WES, RNA-seq, Chip-seq, ...) and the downstream analysis, several alingers are available to adjust to the necessities of our analysis.

- BWA-MEM
- bowtie
- STAR
-

Previous aligment of the reads, a reference genome in fasta format is needed, Typical sources to look up are UCSC, Ensembl or Gencode. An indexing of the reference genome is perfomed to create a dictionary database of the redundant sequences of the genome and facilitate and accelerate the query of the reads respect this regions, thus, minimizing the the memory footprint.

SAM format
----------



BAM QC
===========================

Even if the Quality control of the reas was correct, there are some problems with the reads that are visualized after mapping (Low coverage, homopolymers biases, experimental artifacts, etc)
Most of the tools to asses Mapping Quality relies on the values of MAPQ, which is a Phred-scaled probability that the alignment is wrong.


.. math::
MAPQ = -10*log10(P)
Where P is the probability that the alignment is wrong.
For example, for a MAPQ value of 20 the probability that the alignment is wrong is 1 in 100 (0.01),

.. math::
MAPQ = -10*log10(0.01) = 20
The confidence of the alignment is higher when the MAPQ value is higher.

Main Tools to asses the quality of the mapping are:

- **SAMStat**: Is a CLI tool that offers Statistics of SAM/BAM files of unmapped, poorly and accuretly mapped raads.
.. seealso::
.. _SAMStat: https://github.com/TimoLassmann/samstat


BAM format. Note, that the BAM file has to be sorted by chromosomal coordinates. Sorting can be performed with samtools sort.

0 comments on commit e0ab682

Please sign in to comment.