-
Notifications
You must be signed in to change notification settings - Fork 2
/
HOWTOXtX.Rmd
219 lines (170 loc) · 7.71 KB
/
HOWTOXtX.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
title: "XtX subpopulation differentiation"
output:
html_document:
fig_caption: yes
highlight: zenburn
theme: cerulean
toc: yes
toc_depth: 4
pdf_document:
fig_caption: yes
highlight: zenburn
toc: yes
toc_depth: 4
word_document: default
date: '`r format(Sys.Date(),"%d/%b/%Y")`'
---
```{r knitr setup, include=FALSE, eval=TRUE, echo=FALSE, warning=FALSE}
library(knitr)
knitr::opts_chunk$set(eval=TRUE, cache=FALSE, message=FALSE, warning=FALSE,
comment = "", results="markup")
```
This document describes how SNP genotypes from 135 landraces of the Spanish Barley Core Collection (SBCC) grouped into 4 subpopulations, were used to obtain XtX statistics by CP Cantalapiedra.
# Rebuilding data files to obtain subpopulation-based bayenv input files
## Files for covariance matrix
```{r b1, engine='bash'}
cd xtx_subpops
./bayenv_lines_to_subpops.py SBCC_order.subpops.tab ../SBCC_order.txt ../SBCC_nr_SNPs.tsv show \
> SBCC_nr_subpops.tsv 2> SBCC_nr_subpops.err
```
## Files for bayenv XtX analysis
```{r b2, engine='bash'}
cd xtx_subpops
./bayenv_lines_to_subpops.py SBCC_order.subpops.tab ../SBCC_order.txt ../SBCC_9K_SNPs.tsv show \
> SBCC_9K_subpops.tsv 2> SBCC_9K_subpops.err
```
**NOTE**: the "show" option is used to include in the output file those SNPs with missing data in some of the genotypes. The opposite could be obtained using "hide" as option.
# Running bayenv2 XtX analyses
Following the work of [Rusell et al 2016](http://www.nature.com/ng/journal/v48/n9/full/ng.3612.html),
200K MCMC steps are performed.
First, a dummy climate file is required in order to run bayenv:
```{r b3, engine='bash'}
cd xtx_subpops
cat ../SBCC_environfile.tsv | head -1 | awk '{print $1"\t"$2"\t"$3"\t"$4}' > envfile.dummy
```
A covariance matrix is also required. Here it is computed 10 times:
```{r b4, engine='bash', eval=FALSE}
cd xtx_subpops
mkdir -p matrices/raw/
# Create 10 covariance matrices, this will take some time
for i in {1..10}; do
rnd=$(perl -e 'printf("%05d",rand(99999))');
echo $rnd >> bayenv.covar.rnd_seeds
mkdir _job$i; cd _job$i; ln -s ../SBCC_nr_subpops.tsv .; \
../soft/bayenv2/bayenv2 -i SBCC_nr_subpops.tsv -p 4 -k 100000 -r $rnd \
> ../SBCC_nr_subpops_matrix_it100K_$i.out &
cd ..;
done;
# We can now put the resulting files away:
rm -rf _job*
gzip SBCC_nr_subpops_matrix_it100K_*
mv SBCC_nr_subpops_matrix_it100K_* matrices/raw/
# Extract final matrix from each replicate
for i in {1..10}; do
zcat matrices/raw/SBCC_nr_subpops_matrix_it100K_$i.out.gz | \
perl -lne 'if(/ITER = 100000/){$ok=1}elsif($ok){ print }' \
| sed '/^$/d' > matrices/SBCC_nr_subpops_matrix_$i.txt
done
```
A single average matrix can be computed as follows:
```{r average, eval=FALSE}
setwd(xtx_subpops)
# read all final matrices
m1 = as.matrix( read.table(file="matrices/raw/SBCC_nr_subpops_matrix_1.txt", header=F) )
m2 = as.matrix( read.table(file="matrices/SBCC_nr_subpops_matrix_2.txt", header=F) )
m3 = as.matrix( read.table(file="matrices/SBCC_nr_subpops_matrix_3.txt", header=F) )
m4 = as.matrix( read.table(file="matrices/SBCC_nr_subpops_matrix_4.txt", header=F) )
m5 = as.matrix( read.table(file="matrices/SBCC_nr_subpops_matrix_5.txt", header=F) )
m6 = as.matrix( read.table(file="matrices/SBCC_nr_subpops_matrix_6.txt", header=F) )
m7 = as.matrix( read.table(file="matrices/SBCC_nr_subpops_matrix_7.txt", header=F) )
m8 = as.matrix( read.table(file="matrices/SBCC_nr_subpops_matrix_8.txt", header=F) )
m9 = as.matrix( read.table(file="matrices/SBCC_nr_subpops_matrix_9.txt", header=F) )
m10 = as.matrix( read.table(file="matrices/SBCC_nr_subpops_matrix_10.txt", header=F) )
# make a list of matrices and get mean as explained in:
mat_list = list( m1, m2, m3, m4, m5, m6, m7, m8, m9, m10 )
mean_mat = apply(simplify2array(mat_list), c(1,2), mean)
# write resulting mean cov matrix
write.table(mean_mat,file="matrices/SBCC_nr_subpops_matrix_mean.txt",
sep="\t",row.names=F,col.names=F,quote=F)
```
This should create file *xtx_subpops/matrices/SBCC_nr_subpops_matrix_mean.txt*
Now three replicates of bayenv XTX will be run:
```{r b5, engine='bash', eval=FALSE}
cd xtx_subpops
# 1st replicate run
rnd=$(perl -e 'printf("%05d",rand(99999))');
echo "$rnd" > SBCC_9K_subpops.tsv.xtx.rep1.rnd
../soft/bayenv2/calc_xtx_parallel.pl 5 bayenv2 -X -t -i SBCC_9K_subpops.tsv -p 4 \
-e envfile.dummy -n 1 \
-m matrices/SBCC_nr_subpops_matrix_mean.txt \
-k 200000 -r "$rnd" -o SBCC_9K_subpops.tsv.xtx.rep1 -c \
> SBCC_9K_subpops.tsv.xtx.rep1.stdout 2>&1;
# 2nd run
rnd=$(perl -e 'printf("%05d",rand(99999))');
echo "$rnd" > SBCC_9K_subpops.tsv.xtx.rep2.rnd
../soft/bayenv2/calc_xtx_parallel.pl 5 bayenv2 -X -t -i SBCC_9K_subpops.tsv -p 4 \
-e envfile.dummy -n 1 \
-m matrices/SBCC_nr_subpops_matrix_mean.txt \
-k 200000 -r "$rnd" -o SBCC_9K_subpops.tsv.xtx.rep2 -c \
> SBCC_9K_subpops.tsv.xtx.rep2.stdout 2>&1;
# 3rd run
rnd=$(perl -e 'printf("%05d",rand(99999))');
echo "$rnd" > SBCC_9K_subpops.tsv.xtx.rep3.rnd
../soft/bayenv2/calc_xtx_parallel.pl 5 bayenv2 -X -t -i SBCC_9K_subpops.tsv -p 4 \
-e envfile.dummy -n 1 \
-m matrices/SBCC_nr_subpops_matrix_mean.txt \
-k 200000 -r "$rnd" -o SBCC_9K_subpops.tsv.xtx.rep3 -c \
> SBCC_9K_subpops.tsv.xtx.rep3.stdout 2>&1;
# compress outfiles
gzip SBCC_9K_subpops.tsv.xtx.rep1
gzip SBCC_9K_subpops.tsv.xtx.rep2
gzip SBCC_9K_subpops.tsv.xtx.rep3
```
Let's inspect one of the XtX bayenv files:
```{r, engine='bash'}
cd xtx_subpops
zcat SBCC_9K_subpops.tsv.xtx.rep1.gz | head -6
```
# Annotating and exporting XtX results
```{r, engine='bash'}
cd xtx_subpops
# add SNP name to XtX results
perl xtx2table.pl SBCC_9K_subpops.tsv.xtx.rep1.gz ../SBCC_9K_SNPs.annot.tsv > \
SBCC_9K_subpops.tsv.xtx.rep1.tsv
perl xtx2table.pl SBCC_9K_subpops.tsv.xtx.rep2.gz ../SBCC_9K_SNPs.annot.tsv > \
SBCC_9K_subpops.tsv.xtx.rep2.tsv
perl xtx2table.pl SBCC_9K_subpops.tsv.xtx.rep3.gz ../SBCC_9K_SNPs.annot.tsv > \
SBCC_9K_subpops.tsv.xtx.rep3.tsv
head -6 SBCC_9K_subpops.tsv.xtx.rep1.tsv
gzip -f SBCC_9K_subpops.tsv.xtx.rep1.tsv
gzip -f SBCC_9K_subpops.tsv.xtx.rep2.tsv
gzip -f SBCC_9K_subpops.tsv.xtx.rep3.tsv
# add cM & bp map positions (./obtain_map.R IN OUT MAP)
Rscript ./obtain_map.R SBCC_9K_subpops.tsv.xtx.rep1.tsv.gz \
SBCC_9K_subpops.tsv.xtx.rep1.bp.tsv ../raw/9920_SNPs_SBCC_bp_map2017.curated.tsv
Rscript ./obtain_map.R SBCC_9K_subpops.tsv.xtx.rep2.tsv.gz \
SBCC_9K_subpops.tsv.xtx.rep2.bp.tsv ../raw/9920_SNPs_SBCC_bp_map2017.curated.tsv
Rscript ./obtain_map.R SBCC_9K_subpops.tsv.xtx.rep3.tsv.gz \
SBCC_9K_subpops.tsv.xtx.rep3.bp.tsv ../raw/9920_SNPs_SBCC_bp_map2017.curated.tsv
Rscript ./obtain_map.R SBCC_9K_subpops.tsv.xtx.rep1.tsv.gz \
SBCC_9K_subpops.tsv.xtx.rep1.cM.tsv ../raw/9920_SNPs_SBCC_cM_map2017.curated.tsv
Rscript ./obtain_map.R SBCC_9K_subpops.tsv.xtx.rep2.tsv.gz \
SBCC_9K_subpops.tsv.xtx.rep2.cM.tsv ../raw/9920_SNPs_SBCC_cM_map2017.curated.tsv
Rscript ./obtain_map.R SBCC_9K_subpops.tsv.xtx.rep3.tsv.gz \
SBCC_9K_subpops.tsv.xtx.rep3.cM.tsv ../raw/9920_SNPs_SBCC_cM_map2017.curated.tsv
# join all results into a single file
join -t $'\t' SBCC_9K_subpops.tsv.xtx.rep1.bp.tsv SBCC_9K_subpops.tsv.xtx.rep2.bp.tsv \
| join -t $'\t' - SBCC_9K_subpops.tsv.xtx.rep3.bp.tsv \
| awk '{print $1"\t"$3"\t"$4"\t"$2"\t"$5"\t"$8}' \
> SBCC_9K_subpops.tsv.xtx.bp.tsv
join -t $'\t' SBCC_9K_subpops.tsv.xtx.rep1.cM.tsv SBCC_9K_subpops.tsv.xtx.rep2.cM.tsv \
| join -t $'\t' - SBCC_9K_subpops.tsv.xtx.rep3.cM.tsv \
| awk '{print $1"\t"$3"\t"$4"\t"$2"\t"$5"\t"$8}' \
> SBCC_9K_subpops.tsv.xtx.cM.tsv
head -6 SBCC_9K_subpops.tsv.xtx.bp.tsv
head -6 SBCC_9K_subpops.tsv.xtx.cM.tsv
```
The final results file of tis protocol are
[SBCC_9K_subpops.tsv.xtx.bp.tsv](xtx_subpops/SBCC_9K_subpops.tsv.xtx.bp.tsv) and
[SBCC_9K_subpops.tsv.xtx.cM.tsv](xtx_subpops/SBCC_9K_subpops.tsv.xtx.cM.tsv)