-
Notifications
You must be signed in to change notification settings - Fork 1
/
Supplement.Rmd
1207 lines (888 loc) · 50.7 KB
/
Supplement.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Supplementary Information"
author: "Ava Hoffman, Jenny Cocciardi"
output:
pdf_document:
toc: true
number_sections: true
md_document:
variant: gfm
toc: true
number_sections: true
urlcolor: blue
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(tidy.opts = list(width.cutoff = 72), tidy = TRUE)
```
```{=html}
<!-- To nicely format this markdown doc, I recommend using
Sublime Text editor. Highlight the text in question and select Edit >
Wrap > Wrap paragraph at 72 characters. -->
```
<!-- Allow "S" labeling on Figs and Tables -->
\newcommand{\beginsupplement}{%
\setcounter{table}{0}
\renewcommand{\thetable}{S\arabic{table}}%
\setcounter{figure}{0}
\renewcommand{\thefigure}{S\arabic{figure}}%
}
\beginsupplement
<!-- End -->
# Introduction {#1-introduction}
In this experiment, we used quaddRAD library prep to prepare the sample
DNA. This means that there were both two unique outer barcodes
(typical Illumina barcodes) *AND* two unique inner barcodes
(random barcode bases inside the adapters) for each sample - over 1700
to be exact!
The sequencing facility demultiplexes samples based on the outer
barcodes (typically called 5nn and i7nn). Once this is done, each file
still contains a mix of the inner barcodes. We will refer to these as
"sublibraries" because they are sort of halfway demultiplexed. We
separate them out bioinformatically later.
## Raw Data File Naming - Sublibraries {#11-raw-data-file-naming---sublibraries}
Here's a bit of information on the file name convention. The typical raw
file looks like this:
``` AMH_macro_1_1_12px_S1_L001_R1_001.fastq.gz ```
- These are author initials and "macro" stands for "Macrosystems". These
are on every file.
`AMH_macro`
- The first number is the *i5nn* barcode for the given sublibrary. We
know all these samples have a i5nn barcode "1", so that narrows down
what they can be. The second number is the *i7nn* barcode for the
given sublibrary. We know all these samples have a i7nn barcode "1",
so that further narrows down what they can be.
`1_1`
- This refers to how many samples are in the sublibrary. "12px" means
12-plexed, or 12 samples. In other words, we will use the inner
barcodes to further distinguish 12 unique samples in this
sublibrary.
`12px`
- This is a unique sublibrary name. S1 = 1 i5nn and 1 i7nn.
`S1`
- This means this particular file came from lane 1 of the NovaSeq. There
are four lanes. All samples should appear across all four lanes.
`L001`
- This is the first (R1) of two paired-end reads (R1 and R2).
`R1`
- The last part doesn't mean anything - it was just added automatically
before the file suffix (`fastq.gz`)
`001.fastq.gz`
## A Note on File Transfers {#12-a-note-on-file-transfers}
There are three main systems at play for file transfer: the local
machine, the sequencing facility's (GRCF) Aspera server, and
[MARCC](https://www.marcc.jhu.edu/). The Aspera server is where the data
were/are stored immediately after sequencing. MARCC is where we plan to
do preprocessing and analysis. Scripts and text files are easy for me
to edit on my local machine. We used [Globus](https://app.globus.org)
to transfer these small files from my local machine to MARCC.
Midway through this analyses, we transitioned to another cluster, JHU's
Rockfish. Scripts below, with the exception of file transfer from the
Aspera server, should reflect the new filesystem, though you will have
to adjust the file paths accordingly.
![File transfer schematic](figures/file_transfer.jpg)\
# Preprocessing {#2-preprocessing}
## Transfer Files {#21-transfer-files}
Referred to through files as "Step 1". Files can be found in the `01_transfer_files/` directory.
This directory contains files named in this convention: `01-aspera_transfer_n.txt`.
These are text files containing the *names* of `fastq.gz` files that we
wanted to transfer from the sequencing facility's Aspera server to the
computing cluster ([MARCC](https://www.marcc.jhu.edu/)). This was to
maximize ease of transferring only certain files over at once, since
transferring could take a long time. We definitely did this piecemeal.
Possible file names shown in
[Aspera Transfer File Names](#aspera-transfer-file-names).
There are multiple of these files so
that we could parallelize (replace n with the correct number in the
command used below). This text file will need to be uploaded to your
scratch directory in MARCC.
Files were then transferred using the following commands. Before
starting, make sure you are in a data transfer node. Then, load the
aspera module. Alternatively, you can install the Aspera transfer
software and use that.
```
module load aspera
```
Initiate the transfer from within your scratch directory:
```
ascp -T -l8G -i /software/apps/aspera/3.9.1/etc/asperaweb_id_dsa.openssh
--file-list=01-aspera_transfer_n.txt
--mode=recv --user=<aspera-user> --host=<aspera-IP> /scratch/users/<me>@jhu.edu
```
## Concatenate Files and Install Stacks {#22-concatenate-files-and-install-stacks}
Referred to through files as "Step 2". Files can be found in the `02_concatenate_and_check/` directory.
### Concatenate Files for each Sublibrary {#221-concatenate-files-for-each-sublibrary}
**Step 2a**. We ran my samples across the whole flow cell of the NovaSeq, so results
came in 8 files for each demultiplexed sublibrary (4 lanes \* paired
reads). For example, for sublibrary 1_1, we'd see the following 8 files:
AMH_macro_1_1_12px_S1_L001_R1_001.fastq.gz
AMH_macro_1_1_12px_S1_L001_R2_001.fastq.gz
AMH_macro_1_1_12px_S1_L002_R1_001.fastq.gz
AMH_macro_1_1_12px_S1_L002_R2_001.fastq.gz
AMH_macro_1_1_12px_S1_L003_R1_001.fastq.gz
AMH_macro_1_1_12px_S1_L003_R2_001.fastq.gz
AMH_macro_1_1_12px_S1_L004_R1_001.fastq.gz
AMH_macro_1_1_12px_S1_L004_R2_001.fastq.gz
The `02_concatendate_and_check/02-concat_files_across4lanes.sh`
script finds all files in the
working directory with the name pattern `*_L001_*.fastq.gz` and then
concatenates across lanes 001, 002, 003, and 004 so they can be managed
further. The "L001" part of the filename is then eliminated. For example
the 8 files above would become:
AMH_macro_1_1_12px_S1_R1.fastq.gz
AMH_macro_1_1_12px_S1_R2.fastq.gz
Rockfish uses [slurm](https://slurm.schedmd.com/overview.html) to manage
jobs. To run the script, use the `sbatch` command. For example:
sbatch ~/code/02-concat_files_across4lanes.sh
This command will run the script from within the current directory, but
will look for and pull the script from the code directory. This will
concatenate all files within the current directory that match the loop
pattern.
### Download and Install Stacks {#222-download-and-install-stacks}
**Step 2b**. On Rockfish, [Stacks](https://catchenlab.life.illinois.edu/stacks/) will
need to be downloaded to each user's code directory. Stacks, and
software in general, should be compiled in an interactive mode or loaded via module. For more
information on interactive mode, see `interact --usage`.
interact -p debug -g 1 -n 1 -c 1
module load gcc
Now download Stacks. We used version 2.60.
wget http://catchenlab.life.illinois.edu/stacks/source/stacks-2.60.tar.gz
tar xfvz stacks-2.60.tar.gz
Next, go into the stacks-2.60 directory and run the following commands:
./configure --prefix=/home/<your_username>/code4-<PI_username>
make
make install
export PATH=$PATH:/home/<your_username>/code4-<PI_username>/stacks-2.60
The filesystem patterns on your cluster might be different, and you
should change these file paths accordingly.
## Remove PCR Clones {#23-remove-pcr-clones}
Referred to through files as "Step 3". Files can be found in the `03_clone_filter/` directory.
### Run PCR Clone Removal Script {#231-run-pcr-clone-removal-script}
**Step 3a**. The `03-clone_filter.sh` script runs `clone_filter` from
[Stacks](https://catchenlab.life.illinois.edu/stacks/). The program was
run with options `--inline_inline --oligo_len_1 4 --oligo_len_2 4`. The
`--oligo_len_x 4` options indicate the 4-base pair degenerate sequence
was included on the outside of the barcodes for detecting PCR
duplicates. The script uses the file name prefixes listed for each
single sub-pooled library in `03-clone_filter_file_names.txt` and loops
to run `clone_filter` on all of them. Possible file names shown in
[`clone_filter` File Names](#clone_filter-file-names).
### Parse PCR Clone Removal Results {#232-parse-pcr-clone-removal-results}
**Step 3b**. If you want to extract descriptive statistics from the `clone_filter`
output, you can use the `03.5-parse_clone_filter.py` script to do so. It
can be run on your local terminal after transferring the
`clone_filter.out` logs to your local computer.
```{r PCR_clones, message=FALSE, fig.cap = "PCR clone removal statistics"}
source("03_clone_filter/examine_clones.R")
make_cloneplot()
```
## Step 4 - Demultiplexing and Sample Filtering {#step-4---demultiplexing-and-sample-filtering}
Files can be found in the `04_demux_filter/` directory.
### Step 4a - Demultiplex and Filter {#step-4a---demultiplex-and-filter}
The `04-process_radtags.sh` script runs `process_radtags` from
[Stacks](https://catchenlab.life.illinois.edu/stacks/). The program was
run with options
`-c -q --inline_inline --renz_1 pstI --renz_2 mspI --rescue --disable_rad_check`.
The script uses the same file prefixes as [Step 3 -
`03-clone_filter.sh`](#step-3---remove-pcr-clones). Each sub-pooled
library has a forward and reverse read file that was filtered in the
previous step. Like the [above section](#step-3---03-clone_filtersh),
the script uses the file name prefixes listed for each single sub-pooled
library in `04-process_radtags_file_names.txt` and loops to run
`process_radtags` on all of them. Possible file names shown in
[`clone_filter` File Names](#clone_filter-file-names).
Each sub-pooled library also has a demultiplexing file (`04-demux/`
directory) that contains the sample names and inner(i5 and i7) barcodes.
For example, the sublibrary 1_1, we'd see the following barcode file:
ATCACG AGTCAA DS.BA.PIK.U.1
CGATGT AGTTCC DS.BA.PIK.U.2
TTAGGC ATGTCA DS.BA.PIK.U.3
TGACCA CCGTCC DS.BA.PIK.U.4
ACAGTG GTCCGC DS.BA.PIK.U.5
GCCAAT GTGAAA DS.BA.DHI.U.1
CAGATC GTGGCC DS.BA.DHI.U.2
ACTTGA GTTTCG DS.BA.DHI.U.3
GATCAG CGTACG DS.BA.DHI.U.4
TAGCTT GAGTGG DS.BA.DHI.U.5
GGCTAC ACTGAT DS.BA.GA.U.1
CTTGTA ATTCCT DS.BA.GA.U.2
The 'process_radtags' command will demultiplex the data by separating
out each sublibrary into the individual samples. It will then clean the
data, and will remove low quality reads and discard reads where a
barcode was not found.
### Step 4b - Organize files {#step-4b---organize-files}
In a new directory, make sure the files are organized by species. In the
`process_radtags` script, we specified that files be sent to
`~/scratch/demux/*sublibrary_name*` (reasoning for this is in [Step
4c](#step-4c---assess-the-raw-processed-and-cleaned-data)), but files
should manually be organized into species folders (i.e.,
`~/scratch/demux/*SPP*`) after `process_radtags` is performed. For
example, the file "DS.MN.L01-DS.M.1.1.fq.gz" should be sent to the
`~/scratch/demux/DS` directory.
Note: this is not automated at this point but it would be nice to
automate the file moving process so it's not forgotten at this point.
### Step 4c - Assess the raw, processed, and cleaned data {#step-4c---assess-the-raw-processed-and-cleaned-data}
In the script for [Step 4](#step-4---demultiplex-and-filter), we have
specified that a new output folder be created for each sublibrary. The
output folder is where all sample files and the log file will be dumped
for each sublibrary. It is important to specify a different output
folder if you have multiple sublibraries because we will be assessing
the output log for each sublibrary individually (and otherwise, the log
is overwritten when the script loops to a new sublibrary).
The utility `stacks-dist-extract` can be used to extract data from the
log file. First, we examined the library-wide statistics to identify
sublibraries where barcodes may have been misentered or where sequencing
error may have occurred. We used:
stacks-dist-extract process_radtags.log total_raw_read_counts
to pull out data on the total number of sequences, the number of
low-quality reads, whether barcodes were found or not, and the total
number of retained reads per sublibary. Look over these to make sure
there are no outliers or sublibraries that need to be checked and rerun.
Next, we used:
stacks-dist-extract process_radtags.log per_barcode_raw_read_counts
to analyze how well each sample performed. There are three important
statistics to consider for each sample.
1. *The proportion of reads per sample for each sublibrary* indicates
the proportion that each individual was processed and sequenced
within the overall library. This is important to consider as cases
where a single sample dominates the sublibrary may indicate
contamination.
2. *The number of reads retained for each sample* can be an indicator
of coverage. It is most likely a good idea to remove samples with a
very low number of reads. Where you decide to place the cutoff for
low coverage samples is dependent on your dataset. For example, a
threshold of 1 million reads is often used but this is not
universal.
3. *The proportion of reads retained for each sample* can also indicate
low-quality samples and will give an idea of the variation in
coverage across samples.
Output for sublibraries for this step are summarized in
[`process_radtags-library_output.csv`](output/process_radtags-sample_output.csv).
Output for individual samples for this step are summarized in
[`process_radtags-sample_output.csv`](output/process_radtags-sample_output.csv).
The script `04c-process_radtags_stats.R` was used to create many plots for
easily assessing each statistic. Output from this step can be found in
`figures/process_radtags/` where figures are organized by species.
```{r radtag_stats, message=FALSE, fig.cap = "RAD tag processing statistics"}
source("04_demux_filter/04c-radtags_filter_summary.R")
make_filterplot()
```
### Step 4d - Identify low-coverage and low-quality samples from {#step-4d---identify-low-coverage-and-low-quality-samples-from}
downstream analysis
Using the same output log and the above statistics, we removed
low-coverage and low-quality samples that may skew downstream analyses.
Samples were identified and removed via the following procedure:
1. First, samples that represented less than **1% of the sequenced
sublibrary** were identified and removed. These samples correlate to
low-read and low-coverage samples.
2. Next, a threshold of **1 million retained reads per sample** was
used to remove any remaining low-read samples. Low-read samples
correlate to low coverage and will lack enough raw reads to
contribute to downstream analyses.
Good/kept samples are summarized in
[`process_radtags-kept_samples.csv`](output/process_radtags-kept_samples.csv).
Discarded samples are summarized in
[`process_radtags-discarded_samples.csv`](output/process_radtags-discarded_samples.csv).
```{r radtag_manual_filter, message=FALSE, fig.cap = "RAD tag manual filtering summary"}
source("04_demux_filter/04c-radtags_filter_summary.R")
make_manual_discard_plot()
```
Note: At this point, we started using Stacks 2.62 for its
multi-threading capabilities. Functionality of the previous steps should
be the same, however.
# Generating Stacks Catalogs and Calling SNPs {#generating-stacks-catalogs-and-calling-snps-mag}
## Step 5 - Metapopulation Catalog Building and Parameter Search {#step-5---metapopulation-catalog-building-and-parameter-search}
Files can be found in the `05_ustacks_and_params/` directory.
Going forward, when we use the term **metapopulation**, we are referring
to the collection of all samples within species among all cities where
the species was present.
It is important to conduct preliminary analyses that will identify an
optimal set of parameters for the dataset (see [Step
5a](#step-5a---denovo_mapsh)). Following the parameter optimization, the
program `ustacks` can be run to generate a catalog of loci.
### Step 5a - Run `denovo_map.sh`
Stack assembly will differ based on several different aspects of the
dataset(such as the study species, the RAD-seq method used, and/or the
quality and quantity of DNA used). So it is important to use parameters
that will maximize the amount of biological data obtained from stacks.
There are three main parameters to consider when doing this:
1. *m* = controls the minimum number of raw reads required to form a
stack(implemented in `ustacks`)
2. *M* = controls the number of mismatches between stacks to to merge
them into a putative locus (implemented in `ustacks`)
3. *n* = controls the number of mismatches allowed between stacks to
merge into the catalog (implemented in `cstacks`)
There are two main ways to optimize parameterization:
1. an iterative method were you sequentially change each parameter
while keeping the other parameters fixed (described in *Paris et al.
2017*), or
2. an iterative method were you sequentially change the values of *M*
and *n*(keeping *M* = *n*) while fixing *m* = 3, and then test *m* =
2, 4 once the optimal *M* = *n* is determined(described in *Rochette
and Catchen 2017*, *Catchen 2020*).
We performed the second method and used the `denovo_map.sh` script to
run the `denovo_map.pl` command to perform iterations. This script
requires that we first choose a subset of samples to run the iterations
on. The samples should be representative of the overall dataset; meaning
they should include all populations and have similar read coverage
numbers. Read coverage numbers can be assessed by looking at the
descriptive statistics produced from [Step
4c](#step-4c---assess-the-raw-processed-and-cleaned-data).
Place these samples in a text file (`popmap_test_samples.txt`) with the
name of the sample and specify that all samples belong to the same
population. For example, `popmap_test_samples.txt` should look like...
DS.BA.GA.U.1 A
DS.PX.BUF.M.5 A
DS.B0.HC4.M.1 A
...
It is important to have all representative samples treated as one
population because you will assess outputs found across 80% of the
individuals. The script will read this text file from the `--popmap`
argument.
The script also requires that you specify an output directory after
`-o`. This should be unique to the parameter you are testing... for
example, if you are testing *M* = 3, then you could make a subdirectory
labeled `stacks.M3` where all outputs from `denovo_map.sh` will be
placed. Otherwise, for each iteration, the outputs will be overwritten
and you will lose the log from the previous iteration. The
`denovo_map.sh` script also requires that you direct it toward where
your samples are stored, which is your directory built in [Step
4b](#step-4b---organize-files). Make sure to run the
`--min-samples-per-pop 0.80` argument.
To decide which parameters to use, examine the following from each
iteration:
1. the average sample coverage: This is obtained from the summary log
in the `ustacks` section of `denovo_map.log`. If samples have a
coverage \<10x, you will have to rethink the parameters you use
here.
2. the number of assembled loci shared by 80% of samples: This can be
found in the `haplotypes.tsv` by counting the number of loci:
`cat populations.haplotypes.tsv | grep -v ^"#" | wc -l`
3. the number of polymorphic loci shared by 80% of samples: This can be
found in `populations.sumstats.tsv` or by counting
`populations.hapstats.tsv`:
`cat populations.hapstats.tsv | grep -v "^#" | wc -l`
4. the number of SNPs per locus shared by 80% of samples: found in
`denovo_map.log` or by counting the number of SNPs in
`populations.sumstats.tsv`:
`populations.sumstats.tsv | grep -v ^"#" | wc -l`
The script `05a-param_opt-figures_script.R` was used to create plots for
assessing the change in shared loci across parameter iterations.
Based on this optimization step, we used the following parameters:
```{r param_table, echo = FALSE, message = FALSE}
table_ <-
readr::read_csv("output/parameter_opt-final_params_per_species.csv")[, 2:5]
colnames(table_) <-
c("Species",
"M (locus mismatches)",
"n (catalog mismatches)",
"m (minimum reads)")
knitr::kable(
table_,
format = "simple",
caption = "Final parameter optimization values for the Stacks pipeline."
)
```
### Step 5b - Run `ustacks` {#step-5b---run-ustacks}
`ustacks` builds *de novo* loci in each individual sample. We have
designed the `ustacks` script so that the process requires three files:
- `05-ustacks_n.sh` : the shell script that executes `ustacks`
- `05-ustacks_id_n.txt` : the sample ID number
- `05-ustacks_samples_n.txt` : the sample names that correspond to the
sample IDs
The sample ID should be derived from the `order_id` column(first column)
on the master spreadsheet. It is unique (1-1736) across all of the
samples.
The sample name is the corresponding name for each sample ID in the
spreadsheet. E.g., sample ID "9" corresponds to sample name
"DS.BA.DHI.U.4". Sample naming convention is
species.city.site.management_type.replicate_plant.
`05-ustacks_n.sh` should have an out_directory (`-o` option) that will
be used for all samples (e.g., `stacks/ustacks`). Files can be processed
piecemeal into this directory. There should be three files for every
sample in the output directory:
- `<samplename>.alleles.tsv.gz`
- `<samplename>.snps.tsv.gz`
- `<samplename>.tags.tsv.gz`
Multiple versions of the `05-ustacks_n.sh` script can be run in parallel
(simply replace n in the three files above with the correct number).
A small number of samples (13) were discarded at this stage as the `ustacks` tool was
unable to form any primary stacks corresponding to loci. See
[output/ustacks-discarded_samples.csv](output/ustacks-discarded_samples.csv).
### Step 5c - Correct File Names {#step-5c---correct-file-names}
This step contains a script `05b-fix_filenames.sh` which uses some
simple regex to fix filenames that are output in previous steps. Stacks
adds an extra "1" at some point at the end of the sample name which is
not meaningful. The following files:
- DS.MN.L02-DS.M.3.1.alleles.tsv.gz
- DS.MN.L03-DS.U.2.1.tags.tsv.gz
- DS.MN.L09-DS.U.1.1.snps.tsv.gz
become:
- DS.MN.L02-DS.M.3.alleles.tsv.gz
- DS.MN.L03-DS.U.2.tags.tsv.gz
- DS.MN.L09-DS.U.1.snps.tsv.gz
The script currently gives some strange log output, so it can probably
be optimized/improved. The script should be run from the directory where
the changes need to be made. Files that have already been fixed will not
be changed.
### Step 5d - Choose catalog samples/files {#step-5d---choose-catalog-samplesfiles}
In the next step, we will choose the files we want to go into the
catalog. This involves a few steps:
1. Create a meaningful directory name. This could be the date (e.g.,
`stacks_22_01_25`).
2. Copy the `ustacks` output for all of the files you want to use in
the reference from Step 5b. Remember this includes three files per
sample. So if you have 20 samples you want to include in the
reference catalog, you will transfer 3 x 20 = 60 files into the
meaningful directory name. The three files per sample should follow
this convention:
- `<samplename>.alleles.tsv.gz`
- `<samplename>.snps.tsv.gz`
- `<samplename>.tags.tsv.gz`
3. Remember the meaningful directory name. You will need it in Step 6.
## Step 6 - Metapopulation catalog with `cstacks` {#step-6---metapopulation-catalog-with-cstacks}
Files can be found in the `06_cstacks/` directory.
`cstacks` builds the locus catalog from all the samples specified. The
accompanying script, `cstacks_SPECIES.sh` is relatively simple since it
points to the directory containing all the sample files. It follows this
format to point to that directory:
cstacks -P ~/directory ...
Make sure that you use the meaningful directory from Step 5c and that
you have copied all the relevant files over. Otherwise this causes
[problems
downstream](https://groups.google.com/g/stacks-users/c/q3YYPprmYnU/m/cH5RB5KwBQAJ).
For example, you might edit the code to point to
`~/scratch/stacks/stacks_22_01_25`.
cstacks -P ~/scratch/stacks/stacks_22_01_25 ...
The tricky thing is ensuring enough compute memory to run the entire
process successfully. There is probably space to optimize this process.
The `cstacks` method uses a "population map" file, which in this project
is `cstacks_popmap_SPECIES.txt`. This file specifies which samples to build
the catalog from and categorizes them into your 'populations', or in
this case, cities using two tab-delimited columns, e.g.:
DS.BA.GA.U.1 Baltimore
DS.BA.GA.U.2 Baltimore
DS.BA.GA.U.3 Baltimore
DS.BA.GA.U.4 Baltimore
DS.BA.GA.U.5 Baltimore
...
Make sure the samples in this file correspond to the input files located
in e.g., `~/scratch/stacks/stacks_22_01_25`.
`cstacks` builds three files for use in all your samples (in this
pipeline run), mirroring the sample files output
by[`ustacks`](#step-5---ustacks):
- `catalog.alleles.tsv.gz`
- `catalog.snps.tsv.gz`
- `catalog.tags.tsv.gz`
```{r catalog_samples_table, echo = FALSE, message = FALSE}
table_ <-
readr::read_csv("output/cstacks-metapop-catalog_samples-included.csv")[, 1:3]
colnames(table_) <- c("Sample", "Species", "City")
knitr::kable(table_,
format = "simple",
caption = "Subset of samples used in SNP catalog creation.")
```
## Step 7 - Metapopulation locus matching with `sstacks` {#step-7---metapopulation-locus-matching-with-sstacks}
Files can be found in the `07_sstacks/` directory.
All samples in the population (or all samples you want to include in the
analysis) are matched against the catalog produced in [`cstacks`](#step-6---cstacks) with `sstacks`, run in script `stacks_SPECIES.sh`
and `stacks_SPECIES_additional.sh`. It runs off of the samples based in
the output directory *and* the listed samples in
`sstacks_samples_SPECIES.txt` and
`sstacks_samples_SPECIES_additional.txt` (respectively), so make sure
all your files (sample and catalog, etc.) are there and match.
`sstacks_samples_SPECIES.txt` takes the form:
DS.BA.GA.U.1
DS.BA.GA.U.2
DS.BA.GA.U.3
DS.BA.GA.U.4
DS.BA.GA.U.5
...
There should be a new file produced at this step for every sample in the
output directory:
- `<samplename>.matches.tsv.gz`
A small number of samples generated very few matches to the catalog (such as only 4 loci matching, obviously not enough to draw any conclusions) and therefore aren't used in the next step. See [output/sstacks-discarded_samples.csv](output/sstacks-discarded_samples.csv).
## Step 8 - Genotype probabilities with `polyRAD` {#step-8---polyrad}
Files can be found in the `08_polyRAD/` directory.
### Make `RADdata` object
We used the [polyRAD package](https://github.com/lvclark/polyRAD) to call genotypes because many of our species are polyploid or have historical genome duplication. PolyRAD takes the catalog output (`catalog.alleles.tsv.gz`) and accompanying matches to the catalog (e.g., `CD.BA.AA.U.1.matches.tsv.gz`) to create genotype likelihoods for species with diploidy and/or polyploidy. We used the catalog and match files to create a RADdata object class in R for each species. We ran this on the Rockfish compute cluster, with the `make_polyRAD_<spp>.R` script doing the brunt of the work. The R script was wrapped by `polyrad_make_<spp>.sh` to submit the script to the SLURM scheduler.
_Relevant Parameters:_
- `min.ind.with.reads` was set to 20% of samples. This means we discarded any loci not found in at least 20% of samples for each species.
- `min.ind.with.minor.allele` was set to `2`. This means a locus must have at least this many samples with reads for the minor allele in order to be retained.
_Requires:_
- `popmap_<spp>_polyrad.txt`, a list of samples and population
- output from sstacks
_Outputs:_
- `<spp>_polyRADdata.rds`, RDS object (the RADdata object)
### Calculate overdispersion
Next, we calculated overdispersion using the `polyRAD_overdispersion_<spp>.R` script, wrapped by `polyrad_overd_<spp>.sh` to submit the script to the SLURM scheduler.
Requires:
- `popmap_<spp>_polyrad.txt`, a list of samples and population
- `<spp>_polyRADdata.rds`, RDS object (the RADdata object) output from the previous step
Outputs:
- `<spp>_overdispersion.rds`, RDS object (the overdispersion test output)
### Estimate genotypes
Next, we calculated filtered loci based on the expected Hind/He statistic and estimated population structure/genotypes using the `polyRAD_filter_<spp>.R` script, wrapped by `polyrad_filt_<spp>.sh` to submit the script to the SLURM scheduler.
We used the table in [this tutorial](https://lvclark.r-universe.dev/articles/polyRAD/polyRADtutorial.html), which estimated an inbreeding based on the ploidy, optimal overdispersion value, and mean Hind/He. These values are hardcoded in `polyRAD_filter_<spp>.R`.
Requires:
- `popmap_<spp>_polyrad.txt`, a list of samples and population
- `<spp>_polyRADdata.rds`, RDS object (the RADdata object) output from the previous step
- `<spp>_overdispersion.rds`, RDS object (the overdispersion test output) output from the previous step
Outputs:
- `<spp>_filtered_RADdata.rds`, RDS object (RADdata object filtered for appropriate Hind/He)
- `<spp>_IteratePopStructPCA.csv`, data output from the genotype estimate PCA, suitable for plotting
- `<spp>_estimatedgeno_RADdata.rds`, RDS object (RADdata object with genotype estimates)
### Final filter and file cleanup
The output `<spp>_estimatedgeno_RADdata.rds` needs to be converted to genind and structure format for further analysis and steps. There is a little cleanup involved so the population information is retained. For example, Structure needs the population identity to be an integer, not a string. This set of functions can be run on a laptop.
At this stage, we also visually assessed the $H_{ind}/H_e$ statistic versus the locus depth (see `check_coverage` inside the `convert_genomics.R` script). We removed the following samples from further analysis:
```{r polyrad_filtered_table, echo = FALSE, message = FALSE}
table_ <-
readr::read_csv("output/HindHe/polyrad_discarded.csv")
knitr::kable(table_,
format = "simple",
caption = "Subset of samples discarded after genotype estimation using polyRAD.")
```
```{r convert_1, message=FALSE}
source("08_polyRAD/convert_genomics.R")
```
```{r convert_2, eval = FALSE, message=FALSE}
convert_all()
```
## Step 9 - Populations with `Structure`
Files, inlcuding model parameters, can be found in the `09_structure/` directory.
Structure documentation can be found [here](https://web.stanford.edu/group/pritchardlab/structure_software/release_versions/v2.3.4/structure_doc.pdf).
### Running Structure
`polyRAD` outputs genotype probabilites in a format suitable for Structure.
These files were named as:
CD_estimatedgeno.structure
DS_estimatedgeno.structure
EC_estimatedgeno.structure
LS_estimatedgeno.structure
PA_estimatedgeno.structure
TO_estimatedgeno.structure
We ran all species using a naive approach (not using prior information) with
$K={1,2,3,4,5}$ (`MAXPOPS` argument). To search for the most appropriate K, We
ran Structure through 5 replicate runs for each combination of species and K,
with 10000 iterations discarded as burn-in and retained 20000 iterations. These
runs created files that look like:
structure_out_CD1_naive_f // K = 1, rep 1
structure_out_CD1_naive_rep2_f // K = 1, rep 2
structure_out_CD1_naive_rep3_f // K = 1, rep 3
structure_out_CD1_naive_rep4_f // K = 1, rep 4
structure_out_CD1_naive_rep5_f // K = 1, rep 5
structure_out_CD2_naive_f // K = 2, rep 1
structure_out_CD2_naive_rep2_f // K = 2, rep 2
structure_out_CD2_naive_rep3_f // K = 2, rep 3
...
Within each species, we compressed the result files for all K and reps and
submitted to [Structure Harvester](https://taylor0.biology.ucla.edu/structureHarvester/)
to choose the optimal K using the Delta-K method
(see https://link.springer.com/article/10.1007/s12686-011-9548-7). Once the
optimal K was selected per species, we re-ran Structure using a greater number
of iterations (100000) for final output and plotting.
# Analysis
## NLCD Data
From the USGS:
> The U.S. Geological Survey (USGS), in partnership with several federal agencies,
has developed and released four National Land Cover Database (NLCD) products over
the past two decades: NLCD 1992, 2001, 2006, and 2011. This one is for data from
2016 and describes urban imperviousness.
> https://www.mrlc.gov/data/type/urban-imperviousness
> NLCD imperviousness products represent urban impervious surfaces as a percentage of
developed surface over every 30-meter pixel in the United States. NLCD 2016 updates all
previously released versions of impervious products for CONUS (NLCD 2001, NLCD 2006,
NLCD 2011) along with a new date of impervious surface for 2016. New for NLCD 2016 is
an impervious surface descriptor layer. This descriptor layer identifies types of roads,
core urban areas, and energy production sites for each impervious pixel to allow deeper
analysis of developed features.
First, we trimmed the large data. This makes a smaller `.rds` file for each city.
```{r trim_spatial_1, message=FALSE}
source("R/trim_NLCD_spatial_data.R")
```
```{r trim_spatial_2, eval = FALSE, message=FALSE}
create_spatial_rds_files()
```
## Make maps of sampling locations
Next, we made plots for each city's sampling locations. Note that these only include sites that had viable SNPs.
```{r plot_map_1, message=FALSE}
source("R/plot_map_of_samples.R")
```
```{r plot_map_2, eval = FALSE, message=FALSE}
make_all_urban_site_plots()
```
## Continental population structure: population statistics by species
We used `polyrad::calcPopDiff()` to calculate continental population statistics for each species.
```{r continental_stats_libs, message=FALSE}
source("R/calc_continental_stats.R")
```
```{r continental_stats_code, eval = FALSE, message=FALSE}
do_all_continental_stats()
```
```{r continental_stats_table_readlines, message=FALSE}
# CD as an example
read.csv("output/population_stats/CD_continental_stats.csv")
```
## Continental population structure: Structure software results
Within each species, we compressed the result files for all K and reps and
submitted to [Structure Harvester](https://taylor0.biology.ucla.edu/structureHarvester/)
to choose the optimal K using the Delta-K method
(see https://link.springer.com/article/10.1007/s12686-011-9548-7).
The results were:
CD: K=3
DS: K=3
EC: K=2
LS: K=3
PA: K=4
TO: K=3
```{r struct_k_2, message=FALSE, eval=FALSE}
# This file contains output from various K from Structure..
read_csv("output/structure/structure_k_Pr.csv")
```
## Continental population structure: Structure plots
The code below generates plots for Structure results.
```{r plot_struct_1, message=FALSE}
source("R/plot_structure.R")
```
```{r plot_struct_2, eval = FALSE, message=FALSE}
make_structure_multi_plot()
```
## Validation of Structure results with sNMF
We ran sNMF as an alternative to Structure to validate the results. We coerced all polyploid data to diploid data to make the file types compatible with the sNMF function in R. The snmf() function computes an entropy criterion that evaluates the quality of fit of the statistical model to the data by using a cross-validation technique. We plotted the cross-entropy criterion for K=[2:10] for all species. Using the best K, we then selected the best of 10 runs in each K using the `which.min()` function.
```{r snmf_1, message=FALSE}
source("R/sNMF.R")
```
The following runs sNMF and generates the figure. Note that in the pdf version of this document, the figure might appear on the next page.
```{r snmf_2, message=FALSE, fig.cap = "Ancestry coefficients obtained using `snmf()`. As with the Structure analysis, horseweed and prickly lettuce appear to have the most population structure. Phoenix crabgrass, horseweed, and prickly lettuce appear unique. In general, sNMF produced larger K for most species, which will create more sensitivity to admixture.", fig.height=7}
do_all_sNMF()
```
## AMOVA
We performed hierarchical analysis of molecular variance (AMOVA; using GenoDive 3.06) based on the Rho-statistics, which is based on a Ploidy independent Infinite Allele Model. AMOVA is under the "Analysis" menu.
## Local: $F_{IS}$ - Homozygosity within population
We used [GenoDive v. 3.0.6](https://doi.org/10.1111/1755-0998.13145) to calculate $F_{IS}$. This gives a good estimate of whether there are more homozygotes than expected (positive number) or more heterozygotes than expected (negative number). Notably, GenoDive accommodates polyploids and reduces the bias on $F_{IS}$ by performing a permutation test. By default, there are 999 permutations.
This can be run in GenoDive by selecting Analysis > Hardy-Weinberg > Heterozygosity-based (Nei) method.
```{r Fis_table_readlines, message=FALSE}
head(read.csv("output/population_stats/genodive_output_Fis.csv"))
```
## Local: $\rho$ - Pairwise comparison
We used [GenoDive v. 3.0.6](https://doi.org/10.1111/1755-0998.13145) to calculate pairise $\rho$ (rho) among cities within species. Note that there is a p-value correction for testing multiple cities (species are treated as independent, however).
This can be run in GenoDive by selecting Pairwise Differentiation from the Analysis menu and selecting the "rho" statistic from the dropdown.
We used the following script to clean up the results.
```{r rho_code, eval = FALSE, message=FALSE}
source("R/rho.R")
compile_rho_table()
```
```{r rho_table_display, echo = FALSE, message = FALSE}
table_ <-
readr::read_csv("output/population_stats/rho_all.csv")
knitr::kable(table_,
format = "simple",
caption = "Rho statistics for pairwise comparison between cities.")
```
## Local: $\bar{r}_d$ - Linkage disequilibrium
We used `poppr::ia()` to calculate the standardized index of association of loci in the dataset ($\bar{r}_d$ or `rbarD`). We use the standardized index of association to avoid the influence of different sample sizes, as described by [Agapow and Burt 2001](https://doi.org/10.1046/j.1471-8278.2000.00014.x).
When `p.rD` is small (<0.05) and rbarD is (relatively) higher, that is a sign that the population could be in linkage disequilibrium.
An interesting note from the documentation:
> It has been widely used as a tool to detect clonal reproduction within populations. Populations whose members are undergoing sexual reproduction, whether it be selfing or out-crossing, will produce gametes via meiosis, and thus have a chance to shuffle alleles in the next generation. Populations whose members are undergoing clonal reproduction, however, generally do so via mitosis. This means that the most likely mechanism for a change in genotype is via mutation. The rate of mutation varies from species to species, but it is rarely sufficiently high to approximate a random shuffling of alleles. The index of association is a calculation based on the ratio of the variance of the raw number of differences between individuals and the sum of those variances over each locus. You can also think of it as the observed variance over the expected variance.
There is a nice description [here](https://grunwaldlab.github.io/Population_Genetics_in_R/Linkage_disequilibrium.html).
```{r rbarD_code, eval = FALSE, message=FALSE}
source("R/rbarD.R")
calc_rbarD()
```
```{r rbarD_table_readlines, message=FALSE}
head(read.csv("output/population_stats/rbarD.csv"))
```
## Isolation by distance
We assessed isolation by distance by comparing genetic distance to geographic distance. Specifically, we took the traditional approach of comparing a geographic dissimilarity matrix (based on latitude and longitude) to a genetic dissimilarity matrix. We calculated the genetic dissimilarity matrix with the `dist.genpop` function int the adegenet package. We use the [Cavalli-Sforza](https://doi.org/10.2307/2406616) distance metric, or `method = 2` argument for the `dist.genpop` function.
Note that for this analysis, we treated each *sampling site* as a distinct location. There would not be enough power to do a distance matrix among 3-5 cities. Code for generating stats and figures from the Mantel test can be found in the source code below.
```{r ibd_1, message=FALSE}
source("R/isolation_by_distance.R")
```
```{r ibd_2, eval = FALSE, message=FALSE}
extract_ibd_stats_and_plots()
```
Below are the results of the mantel test. Note that there is a p-value correction for testing multiple cities (species are treated as independent, however).
```{r ibd_3, echo = FALSE, message=FALSE}
table_ <- readr::read_csv("output/IBD/isolation-by-distance-mantel-test.csv") %>% dplyr::select(-c(Hypothesis, Reps))
knitr::kable(
table_,
format = "simple",
caption = "Statistics from running 9999 permutations ('Reps') via mantel test, limited to genomic versus distance comparisons. Hypothesis for all tests is 'greater'."
)
```
We also repeated this within city. Note that there is a p-value correction for testing multiple environmental variables and cities (species are treated as independent, however).
\tiny
```{r ibd_4, echo = FALSE, message=FALSE}
table_ <- readr::read_csv("output/IBD/isolation-by-distance-within-city-mantel-test.csv") %>% dplyr::select(-c(Hypothesis, Reps))
knitr::kable(
table_,
format = "simple",
caption = "Statistics from running 9999 permutations ('Reps') via mantel test, limited to within city for genomic versus distance comparisons. Hypothesis for all tests is 'greater'."
)
```
\normalsize
## Isolation by environment
### Environmental data
Environmental variables include the monthly averages in the middle of the day for: