forked from fhdsl/reproducibility-sandbox
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-heatmap.Rmd
215 lines (166 loc) · 6.55 KB
/
01-heatmap.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: "Acute Myeloid Leukemia Heatmap"
author: "CCDL for ALSF - Adapted for this repository by Candace Savonen"
date: "October 2021"
output:
html_notebook:
toc: true
toc_float: true
number_sections: true
---
_This analysis has been adapted from this [refine.bio-examples notebook](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/clustering_rnaseq_01_heatmap.html)_
# Purpose of the analysis
In this analysis, we will use this [acute myeloid leukemia sample dataset](https://www.refine.bio/experiments/SRP070849) from [Shih et al., 2017](https://pubmed.ncbi.nlm.nih.gov/28193779/) and pre-processed by [refinebio](https://www.refine.bio/).
The data that we downloaded from [refine.bio](https://www.refine.bio/) for this analysis has 19 samples (obtained from 19 acute myeloid leukemia (AML) model mice), containing RNA-sequencing results for types of AML under controlled treatment conditions.
This [dataset can be downloaded from this page on refine.bio](https://www.refine.bio/experiments/SRP070849).
We will download it already [processed and quantile normalized](http://docs.refine.bio/en/latest/main_text.html#refine-bio-processed-refinebio-processedibadge).
## Set up analysis folders
```{r}
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
plots_dir <- "plots"
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
results_dir <- "results"
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
```
```{r}
# Define the file path to the data directory
data_dir <- file.path("data", "SRP070849")
# Declare the file path to the gene expression matrix file
data_file <- file.path(data_dir, "SRP070849.tsv")
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv")
```
# Clustering Heatmap - RNA-seq
## Install libraries
We will use []`pheatmap` by Slowikowski et al., 2017](https://www.rdocumentation.org/packages/pheatmap/versions/1.0.12/topics/pheatmap) for clustering and creating a heatmap.
```{r}
if (!("pheatmap" %in% installed.packages())) {
# Install pheatmap
install.packages("pheatmap", update = FALSE)
}
```
Attach the `pheatmap` and `magrittr` libraries:
```{r message=FALSE}
# Attach the `pheatmap` library
library(pheatmap)
# We will need this so we can use the pipe: %>%
library(magrittr)
# Set the seed so our results are reproducible:
set.seed(12345)
```
## Import and set up data
This chunk of code will read in both TSV files and add them as data frames to your environment.
```{r}
# Read in metadata TSV file
metadata <- readr::read_tsv(metadata_file)
# Read in data TSV file
expression_df <- readr::read_tsv(data_file) %>%
# Here we are going to store the gene IDs as row names so that
# we can have only numeric values to perform calculations on later
tibble::column_to_rownames("Gene")
```
Let's take a look at the metadata object that we read into the R environment.
```{r}
head(metadata)
```
Now let's ensure that the metadata and data are in the same sample order.
```{r}
# Make the data in the order of the metadata
expression_df <- expression_df %>%
dplyr::select(metadata$refinebio_accession_code)
# Check if this is in the same order
all.equal(colnames(expression_df), metadata$refinebio_accession_code)
```
Now we are going to use the `pheatmap` package to look at how are samples and genes are clustering.
## Choose genes of interest
For this example, we will sort genes by variance and select genes in the upper quartile, but there are many alternative criterion by which you may want to sort your genes, <i>e.g.</i> fold change, t-statistic, membership in a particular gene ontology, so on.
```{r}
# Calculate the variance for each gene
variances <- apply(expression_df, 1, var)
# Determine the upper quartile variance cutoff value
upper_var <- quantile(variances, 0.75)
# Filter the data choosing only genes whose variances are in the upper quartile
df_by_var <- data.frame(expression_df) %>%
dplyr::filter(variances > upper_var)
```
Let's save these to our results folder as a TSV file.
```{r}
readr::write_tsv(df_by_var, file.path(results_dir, "top_90_var_genes.tsv"))
```
## Prepare metadata for annotation
From the accompanying [paper](https://pubmed.ncbi.nlm.nih.gov/28193779/), we know that the mice with `IDH2` mutant AML were treated with vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials) and the mice with `TET2` mutant AML were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent) [Shih et al., 2017](https://pubmed.ncbi.nlm.nih.gov/28193779/).
```{r}
# Let's prepare the annotation for the uncollapsed `DESeqData` set object
# which will be used to annotate the heatmap
annotation_df <- metadata %>%
# Create a variable to store the cancer type information
dplyr::mutate(
mutation = dplyr::case_when(
startsWith(refinebio_title, "TET2") ~ "TET2",
startsWith(refinebio_title, "IDH2") ~ "IDH2",
startsWith(refinebio_title, "WT") ~ "WT",
# If none of the above criteria are satisfied,
# we mark the `mutation` variable as "unknown"
TRUE ~ "unknown"
)
) %>%
# select only the columns we need for annotation
dplyr::select(
refinebio_accession_code,
mutation,
refinebio_treatment
) %>%
# The `pheatmap()` function requires that the row names of our annotation
# data frame match the column names of our `DESeaDataSet` object
tibble::column_to_rownames("refinebio_accession_code")
```
### Create annotated heatmap
```{r}
# Create and store the annotated heatmap object
heatmap_annotated <-
pheatmap(
df_by_var,
cluster_rows = TRUE,
cluster_cols = TRUE,
show_rownames = FALSE,
annotation_col = annotation_df, # Specify our annotation here
main = "Annotated Heatmap",
colorRampPalette(c(
"deepskyblue",
"black",
"yellow"
))(25
),
scale = "row" # Scale values in the direction of genes (rows)
)
```
### Save annotated heatmap as a PNG
You can switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.
```{r}
# Open a PNG file
png(file.path(
plots_dir,
"aml_heatmap.png" # Replace with a relevant file name
))
# Print the heatmap
heatmap_annotated
# Close the PNG file:
dev.off()
```
# Session info
```{r}
# Print session info
sessioninfo::session_info()
```