-
Notifications
You must be signed in to change notification settings - Fork 16
/
Genavi.Rmd
139 lines (93 loc) · 23.6 KB
/
Genavi.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: "GENAVi"
output:
html_document:
theme: spacelab
highlight: tango
---
# Table of Contents
1. [Introduction](#intro)
2. [Uploading User Data- gene list and featurecounts matrix (Tab 1)/Querying Displayed Data](#data)
3. [Transformation and Normalization of Gene Expression Data (Tab 1)](#transform)
4. [Visualization (Tab 2)](#vis)
5. [Differential Expression Analysis](#dea)
6. [Gene Set Enrichment Analysis](#gsea)
# 1. Introduction <a name="intro"></a>
In order to improve access to normalization, visualization and analysis of RNA-Seq data we have combined a number of R packages using the Shiny Web App format to create GENAVi: Gene Expression Normalization Analysis and Visualization. This application allows users to browse a provided dataset; a panel of 20 cell lines commonly used in breast and ovarian cancer research, or upload their own dataset as a featureCounts matrix (PMID: 24227677). featureCounts uses read counting for summarization of all reads that overlap any exon for each gene, followed by read annotation with RefSeq or Ensembl. GENAVi applies a filtering step to both the provided data set as well as a user-uploaded featureCounts matrix that removes all genes/features with zero reads mapped in every sample. This is done to increase the speed of the transformation calculations and visualization as well as reduce noise in differential expression analysis. This filtered version of the data table is used for all functions within GENAVi.
The application is separated into three tabs:
1. Gene Expression: Users can search or browse a table of provided data or upload their own data, and apply different normalization methods. Genes can be selected in the table using the single field search bar and then selected within the table, or a list of gene symbols can be uploaded and then will be selected within the table. Selected genes will be used for the plotting functions on the next tab. The genes and normalization method selected on this tab will be retained for plotting and cluster analysis. It is important to note than the differential expression analysis functions within GENAVi act on the “raw counts” version of the displayed data only and not any of the transformed versions.
2. Visualization: Users can select between plotting a histogram of a single gene (Expression plot), or heatmaps (Cluster plots) plotting the Euclidean distance between samples based on either the selected genes from the data table or all genes (which can be toggled between from the Cluster by what genes menu.
3. Differential Expression Analysis: Users can upload a metadata file providing information on sample groups to be compared, run differential analysis based on the DESeq2 package from R, and plot the resulting P values in a customizable volcano plot.
## 2. Uploading User Data/Querying Displayed Data <a name="data"></a>
### 2.1 Functions
On the Data Expression tab of the app users can upload a featureCounts table as a .csv file via the Data upload option. Large matrices can take several minutes to upload and be processed by the normalization functions, and progress of this upload and normalization can be tracked by the blue progress bar on the bottom right of the app. Once uploaded the users’ data is displayed replacing the provided data table. Desired genes can be selected by searching in the search field at the top right of the data table and clicking on individual genes within the displayed table. Additionally, the user can also upload a list of gene symbols as a .txt file. Selected genes are automatically highlighted and moved to the top of the data table. The user can also download a data table containing only the selected genes by clicking the “Download” option in the top left of the data table. The displayed data table can be sorted by any of the columns using the bi-directional arrows next to the column title.
### 2.2 Format of input gene list
In addition to being able to search for and select individual genes within the displayed data table, the user can select multiple genes by uploading a list of gene symbols or Ensembl ID’s as a .txt file. This text file should be formatted such that each line in the file is a gene symbol or Ensemble ID. Once the input gene list is uploaded, the corresponding genes within the displayed data table are automatically selected and can be used in the visualization tab of the application.
### 2.3 Format of input count matrix
In addition to being able to query the default data set- a panel of 20 breast and ovarian cancer cell lines, the user can also upload their own gene expression data set in the form of count a matrix produced by the featurecounts function in the subread package. The addgeneinfo() function reads in count matrix in csv format. it checks if the first column is an ENSEMBLE ID or gene name. If the first column of the inputted count matrix is the proper format, addgeneinfo() adds the genes metadata from gencode (by default uses hg38 build). As stated earlier, GENAVi filters out genes/features with zero reads mapped across all samples. This is applied to both the default data table and the user-uploaded featureCounts matrix. This filtering step increases the speed of calculation for each transformation method described in the next section.
## 3 Transformation and Normalization of Gene Expression Data (Tab 1) <a name="transform"></a>
### 3.1 Background
i. Variability in RNA-seq count data usually increases with the mean, indicating that genes that are more highly expressed can often also have a greater variation; this type of data is called heteroskedastic. Expression data in a raw count format causes cluster analysis such as PCA, hierarchical clustering, or k-means analysis to be driven by genes or features with the highest expression in the form of high raw count values. This is because in its raw count format, these highly expressed genes/features are also the most variable. To improve cluster analysis we need to convert this expression data to a homoskedastic form wherein genes/features have the same variance even for a higher range of the mean. One method to impose homoskedasticity is the log2 transformation (log2(count + pseudocount)). While this method does prevent the most highly expressed genes/features from overpowering cluster analysis it can also give undue weight to the genes/features with the lowest counts depending on the choice of pseudocount. To understand this, consider this example; with the choice of 1 as the pseudocount in the case of log2 transforming a count value of 2, the resulting expression value is 1.584. Whereas with the same choice of pseudocount in the case of log2 transforming a count value of 200, the resulting expression value is 7.651. In this comparison, a difference in raw count expression that is two orders of magnitude is compressed to a difference of 6.067 under the log2 transform.
ii. Another problem that arises with RNA-seq count data is overdispersion- the phenomenon wherein the observed variance within a dataset is greater than the expected variance under a chosen statistical model. Previously, the Poisson distribution has been used to model RNA-seq count data. However, a property of the Poisson distribution is that the mean and variance are equal which prevents the model from accurately modeling RNA-seq count data which is often heteroskedastic. A more appropriate statistical model is the negative binomial model used in the edgeR package as the DESeq2 package. This is an extension of the Poisson distribution that has different parameters allowing for separate modeling of mean and variance. Both edgeR and DESeq2 assume that the expected value of counts for a gene in a given sample can be modeled using this distribution.
iii. We have provided four options for data normalization, which are described below. The user can toggle between transformation options by choosing from the “Select Transform” dropdown menu on left of the screen. Any choice of transformation will affect only the visualization and clustering of the displayed data. Differential expression analysis within GENAVi is performed on raw count data only as specified in the DESeq2 package.
### 3.2 raw counts
i. The default version of expression data displayed in GENAVi is raw count data. This is the measure of expression produced by featurecounts. These raw count measures are integer values counting the number of reads mapped to a feature in the reference genome to which fastq files were aligned in a specific sample. Raw count data must be normalized to account for differences in sequencing depth between samples so that the user can make meaningful comparisons of gene expression levels across different samples.
### 3.3 Row normalized
i. The “row normalized” transformation option is applied to raw count data similar to a Z-score normalization. For each gene in the displayed dataset, the mean expression and standard deviation across all samples/columns are calculated. The raw count expression values for each gene/feature are then transformed by subtracting the mean and scaling by the standard deviation across all samples. A custom function was implemented to perform row normalization on each gene of the displayed data table.
ii. When opening GENAVi and/or uploading your own count matrix, the application acquires metadata matching the featurecounts table (default or uploaded) and calculates the row normalized transformation and saves it as an object to avoid calculating it each time the “row normalized” option is selected.
iii. This row normalization rescales the displayed data so that the gene expression is reported in units of standard deviation away from the mean expression.
### 3.4 Log-counts-per-million: logCPM
i. The logCPM transformation implemented in GENAVi through the cpm() function comes from the edgeR package.
ii. This transformation is calculated in two steps. First, counts per million reads are calculated from raw counts by scaling the number of reads overlapping a feature in a specific sample by the total number of reads in that sample (library size). This is performed for every feature in every sample. These CPM values are then log2 transformed with the log=TRUE argument in the cpm() function.
iii. When opening GENAVi and/or uploading your own count matrix, the application acquires metadata matching the featurecounts table (default or uploaded) and calculates the logCPM transform and saves it as an object to avoid calculating it each time the option is selected (implemented in the cpm() function in edgeR).
iv. The logCPM transformation allows the comparison of gene expression across separate samples by accounting for differences in sequencing depth which allows for meaningful comparisons of gene expression across samples by accounting for between-sample variability.
### 3.5 Variance stabilizing transformation- vst
i. The variance stabilizing transformation, implemented in the DESeq2 package, attempts to address the issue of overdispersion of RNA-seq count data.
ii. The VST transformation produces transformed expression measures similar to those produced by the log2 transform when used on genes with high counts.
iii. Similar to the edgeR package, the VST transformation also assumes a negative binomial model to allow for a change in variance based on the dynamic range of the mean.
iv. For entire dataset, VST transformed counts approaches homoskedasticity and can be used directly for clustering.
v. VST is much faster to compute than rlog and is not as sensitive to counts with high values unlike the rlog function. So it is more appropriate to use VST to transform larger datasets (up to hundreds of samples/columns) instead of the rlog transformation for example.
vi. When opening GENAVi and/or uploading your own count matrix, the application acquires metadata matching the featurecounts table (default or uploaded) and calculates the vst transform and saves it as an object to avoid calculating it each time the option is selected (implemented in the rlog function from DESeq2)
### 3.6 Regularized-logarithm transformation- rlog (add warning about number of samples because compute time)
i. To address this issue, GENAVi uses the regularized logarithm (rlog) transform implemented in DESeq2.
ii. The rlog transform performs similar to the log2 transform when applied to genes/features with high raw count expression. When rlog is applied to lower count values, the resulting transformed values are more shrunken towards the mean value of a gene’s expression across all samples in data. Rlog transformed values are calculated by converting the raw count data to the log2 scale and then defining a size factor for each sample which accounts for differences in sequencing depth between samples. This approach is referred to as shrinkage.
iii. This shrinkage approach can affect raw count expression of genes/features to a varying degree. Therefore the rank or order of genes based on expression may not be conserved.
iv. Rlog transformed counts approach homoskedasticity and can be used directly for clustering.
v. Because the rlog function calculates a shrinkage term for each sample it takes the longest to compute (in GENAVi), so it is best used for small data sets (fewer samples/columns).
vi. When opening GENAVi and/or uploading your own count matrix, the application acquires metadata matching the featurecounts table (default or uploaded) and calculates the rlog transform and saves it as an object to avoid calculating it each time the option is selected (implemented in the rlog function from DESeq2).
## 4. Visualization (tab 2) <a name="vis"></a>
### 4.1 Barplot <a name="barplot"></a>
i. If the user wants to view the expression of a single gene/feature across all samples, GENAVi displays a simple barplot. The user can search the displayed data set under any transformation for a particular gene of interest and select it to be displayed in the barplot in the Visualization tab.
ii. If more than one gene is selected from the data table, the simple barplot in the Visualization tab is disabled and an interactive heatmap is displayed instead.
### 4.2 Expression heatmap <a name="expheatmap"></a>
GENAVi uses the R package iheatmapr to generate interactive heatmaps to visualize gene expression information. All capabilities of manipulating these heatmaps that iheatmapr provides are available in GENAVi. The user can toggle over cells in the displayed heat map to see row and column identification information, subset the heatmap to zoom in on cells of interest, and download the heatmap by clicking on the camera icon in the top right corner.
i. When more than one gene/feature from the displayed data is selected, an interactive heatmap is displayed in place of the barplot to visualize, cluster, and compare gene/ expression across samples.
ii. To generate the gene expression heatmap, the user must first select a version of the data table by selecting from the “Select Transform” dropdown menu on the left side of the application. See the above section on properties and advantages of specific transformations. It is worth noting that visualizing the “raw counts” version of an RNA-seq data set may make the heatmap color scale difficult to interpret because of the magnitude of the range of gene expression in raw counts. Next, the user can select genes/features (either by searching for and clicking on individual genes or uploading a gene list test file). Once the desired genes/features are selected in the displayed data table, the user can navigate to the Visualization tab and select the “Expression plots” subtab to see the resulting expression heatmap.
iii. The expression heatmap displays the expression matrix of the user-selected genes across all samples of the displayed dataset under the selected version of the data table.
iv. dendrogram/hierarchical clustering of samples
1. The user may see that the order of the samples (columns) and selected genes/features in the expression heatmap is not the same as the order in the displayed data table or the order of selection.
2. This is due to the `add_col_dendro()` and `add_row_dendro()` options in the iheatmapr package. These options order the columns and rows of the expression heatmap by similarity of expression of the selected genes/rows. This similarity is measured by calculating the Euclidian distance between the samples of the gene expression matrix by applying the dist() function directly to the gene expression information.
### 4.3 Correlation heatmap (Clustering heatmap) <a name="clusheatmap"></a>
i. In addition to viewing the expression matrix of selected genes, the user can also view how the expression of selected genes affects the similarity between samples measured through correlation. By selecting “Clustering plots” subtab, the user can view a heatmap representing the the Pearson correlation matrix calculated from the gene expression matrix. The correlation matrix is calculated using the cor() function with the option method=”pearson” option which produces a matrix containing the pairwise pearson correlations between the columns of the displayed data table. When viewing the correlation heatmap, the user can select to either cluster samples based on the entire gene expression data set by selecting “All genes” from the dropdown menu in the top left of the tab. The user can also choose “Selected genes” to cluster samples using the gene expression information from only the selected genes/features.
ii. dendrogram/hierarchical clustering of samples
1. You will notice that the order of the samples (columns) in the clustering heatmap is not the same as the order in the displayed data table.
2. This is due to the `add_col_dendro()` and `add_row_dendro()` options in the iheatmapr package. Unlike the expression heatmap where these options order columns and rows of the data by similarity of expression of the selected genes/rows by calculating the Euclidian distance between the samples. based on the Euclidian distance calculated directly from the gene expression matrix, the correlation heatmap orders columns and rows directly from the pearson correlation matrix by converting it into dist object.
## 5. Differential Expression Analysis <a name="dea"></a>
i. Background
1. A natural use/application of RNA-seq data is differential expression analysis (DEA): the use of statistical methods to quantify the effects of experimental conditions on gene expression. To do this, GENAVi utilizes the R package DESeq2. The differential expression analysis functions within DESeq2 use negative binomial generalized linear models as well as empirically driven prior distributions to model gene count distributions, estimate dispersion, and calculate log fold changes. GENAVi performs these DEA functions only on raw count data, so any choice of transformation does not affect the functions in this tab. The user can define experimental conditions of their uploaded data, perform differential expression analysis, define statistical thresholds, and visualize results within GENAVi.
ii. Uploading Metadata and Defining Gene Model for DEA
1. Before performing differential expression analysis, the user must first upload a metadata file, in csv format, containing the experimental and batching conditions of each sample in the displayed data. This is done by clicking the `"Browse..."` button in the “Metadata upload” section of the tab. The first column of the csv file should be labeled "sample" and each row of that column should be an exact column name of the uploaded counts matrix. The following columns can be labeled in whatever manner that reflects the experimental design and batching used to generate the samples. We have included an example metadata csv file that describes the grouping of the samples in our cell line panel as well as a "batch" column that reflects how the samples were prepared for sequencing.
2. After the metadata table has been uploaded, the user can then select the variables that used for differential expression analysis. The user can select the experimental condition of interest from the `"Select condition column for DEA"` drop down menu that contains the information from the metadata file. Next the user can select a covariate to correct for under the "Select covariates for DEA" drop down menu. We recommend that this covariate account for batch effect between samples. Lastly, the user must select the reference condition of the selected experimental condition. This is done by selecting from the `"Select reference level for DEA"` dropdown menu which will again contain information from the uploaded metadata table. The gene expression information within this reference group will be used as the baseline for comparisons will all other samples. For example, in the case where the selected experimental condition reflected drug treatment, the reference level would be the untreated group.
3. Once all parts of the model are defined, the user can click `"Perform DEA"`, and then GENAVi will produce the results of differential expression analysis by implementing the `DESeq()` function. This analysis may take some time and a progress bar explaining the current stage of analysis will be displayed in the bottom right of the screen.
iii. DEA results tables
1. After performing differential expression analysis, a result table for each comparison between the experimental condition to the reference condition will be available. The results available can be selected using the `"DEA - select results"` dropdown menu. The results of the DEA will be shown as a table giving base means across samples, log2 fold changes, standard errors, test statistics, p-values and adjusted p-values (Benjamini-Hochberg multiple hypothesis correction);
2. The results table can also be visualized as a volcano plot (y-axis: - log10(p-value adjusted), x-axis: log2FoldChange) using the volcano plot menu in which the cut-offs for both axis can be specified. As results, an interactive plot is created highlighting the condition of each gene (Downregulated in experimental condition, Upregulated in experimental condition, or differential expression not significant) which can be further explored through zooming, hovering, clicking actions.
## 6. Gene Set Enrichment Analysis <a name="gsea"></a>
i. Background
1. One important step after identifying the differentially expressed genes is to retrieve a functional profile of those genes in order to better understand the underlying biological processes. Several databases provides annotated gene sets that can be used for the enrichment analysis (i.e. Molecular Signatures Database - MSigDB, WikiPathways, Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO)).
ii. Uploading DEA results
1. Before performing gene enrichment analysis, the user must first upload a DEA results file, in csv format, in the same format in the Differential Expression Analysis section. This is done by clicking the `"Browse..."` button in the “DEA results upload” section of the tab.
2. After the DEA results csv has been uploaded, the user can then select the the subset of genes that will be used for an Over Representation Analysis (ORA) (Boyle et al. 2004) (i.e. the significant upregulated or downregulated genes) and the ranking method used for a gene set enrichment analysis (GSEA) (Subramanian et al. 2005).
3. Using the subset of genes for ORA and the ranking method for GSEA, the user can then choose between both analysis using the `"Select the type of analysis"` dropdown menu and which databases information it will use for the enrichment analysis using the `"Select the type of analysis"` dropdown menu.
4. The GSEA results can be visualized through several types of plots which can be selected using the `"Plot type"` dropdown menu.
5. For more information the users can read the documentation available at https://guangchuangyu.github.io/pathway-analysis-workshop/.