-
Notifications
You must be signed in to change notification settings - Fork 0
/
deseq2_steps_and_normalization.Rmd
93 lines (57 loc) · 2.94 KB
/
deseq2_steps_and_normalization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
title: "Deseq2 Normalization and Steps"
author: "Payal Banerjee"
date: "12/16/2020"
output: html_document
---
# Deseq2 Normalization and Steps
## Normalization
* Different library sizes or Sequencing depth
* RNA composition bias
Since tools for differential expression analysis are comparing the counts between sample groups for the same gene, gene length does not need to be accounted for by the tool. However, sequencing depth and RNA composition do need to be taken into account.
To normalize for sequencing depth and RNA composition, DESeq2 uses the median of ratios method. On the user-end there is only one step, but on the back-end there are multiple steps involved, as described below.
**In Short:**
* Take geometric mean of gene's counts across all samples
* Divide gene's counts in a sample by the geometric mean
* Take median of these ratios -> sample's normalization factor (applied to gene counts)
**In Details:**
| | Sample 1 | Sample 2 |Sample 3|
| ------ |:-------------:| -----:|-----:|
| Gene 1| 0| 10 |4|
| Gene 2| 2 | 6 |12|
| Gene 3 | 33 |55|200|
_Step 1_:
Log of raw base counts
Log with base e
_Step 2_:
Average of the logs for each gene in each sample
_Step 3_:
Filters genes with 0 counst in more than one sample
_Step 4_:
Subtract log(raw counts) -log(average) for eacg gene
This is a ratio essentially of each gene across all samples
_Step 5_:
Calculate the median for each gene
This helps to remove extreme gene expression like genes with high expression influencing genes with low expression. Thus focusing on genes with median expression and houskeeping genes
_Step 6_:
Convert median to normal values which is the scaling factor
e^median = Normal
_Step 7_:
Divide original read counts by scaling factor
## Dispersion
When comparinng gene expression levels between groups, it is important to account for within group variabilty
It is diffcult to estimate within group variabilty. Solution - pool information across genes which are expessed at similar level from replicates. Assumes that genes of similar average expression strength have similar dispersion.
* **Maximum Likelihood** - Dispersion estimates
* Fits a **curve** to capture the dependance of these estimates on the average expression strength
* Shrinks **genewise values towards the curve** using an empirical Baryes approach
## Generalized Linear Model
Follows negative binomeal distribution
### Why negative binomeal distribution for analysing RNAseq data
Explained quite nicely [here](http://bridgeslab.sph.umich.edu/posts/why-do-we-use-the-negative-binomial-distribution-for-rnaseq)
### Statistical Significance and Multiple testing correction
Wald Test for significance
Benjamini Hochenberg
## References
1. [StatQuest: DESeq2, part 1, Library Normalization](https://youtu.be/UFB993xufUU)
2. [Differential expression analysis](https://youtu.be/5tGCBW3_0IA)
3. [HCB training](https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html)