Original author: Ryan Waples ([email protected]); edited by Fernando Racimo
- Download pink salmon data, apply filters. (
Plink
) - Visualize the effect of data filters with a PCA. (
Plink
andR
) - Explore patterns of population structure pink salmon. (
Plink
andR
) - Estimate pairwise linkage disequilibrium (LD) between all SNPs in each population. (
Plink
) - Use the LD estimates to estimate effective population size (Ne) in each population. (
R
) - Compare estimates of the census (Nc) and effective (Ne) population sizes.
- Get confortable with using PLINK for common filtering procedures.
- Get confortable with using R for data analysis and plotting.
- Consider the impact that filtering and data quality has on different analyses.
- Understand the relationship between LD and Ne.
- Understand the how Ne and Nc vary in natural populations.
- Wright-Fisher Model: p. 22-27
- Effective population size: p. 43-46
- Linkage Disequilibrium: p. 108-112, including boxes 6.1-6.3
Pink salmon in the Pacific have an obligate 2 year life-cycle; they live to be 2 years old, reproduce, then die. This results in two reproductively isolated lineages, in the odd and even years. "Pink salmon, a highly abundant and widely ranging salmonid, provide a naturally occurring opportunity to study the effects of similar environments on divergent genetic backgrounds due to a strict two-year semelparous life history. The species is composed of two reproductively isolated lineages with overlapping ranges that share the same spawning and rearing environments in alternate years." (Seeb et al 2014)
We have samples from adult fish from six pink salmon populations at three different sites. At each site we have samples from both the odd- and even-year lineage.
- Nome River, Norten Sound, Alaska, USA
- Nome, Alaska is the end of the Iditarod dog sled race
- Koppen Creek, Prince William Sound, Alaska, USA
- in southeast Alaska
- Snohomish River, Puget Slound, Washington state, USA
- Near Seattle, WA
Rough estimates of the census population sizes (Nc).
Lineage | Population | Nc | Ne | Ratio |
---|---|---|---|---|
Odd-year | Nome R. | ~300K (source) | ? | ? |
Odd-year | Koppen Cr. | 200K (metapopulation) (source) | ? | ? |
Odd-year | Puget S. | ~1.4M (source) | ? | ? |
Even-year | Nome R. | ~10K (source) | ? | ? |
Even-year | Koppen Cr. | 200K (metapopulation) (source) | ? | ? |
Even-year | Puget S. | 4K (source) | ? | ? |
You are reading README.md, a markdown document that decribes the exercise.
About the *.ipynb files. These are Jupyter notebook files that help organize and communicate the analyses in this exercise. You can view these (non-interactively) on Github.
- ./data - raw data, this will be provided
- ./scripts - analysis files
- *.sh files contain code meant to be run in the terminal
- *.r files contain code meant to be run in R
- ./work - intermediate data files and results
- ./plots - figures and plots
How to run this exercise. Navigate to a desired base directory and then you can execute all the analyses in this exercise with this series of commands:
We will go over each of these scripts in turn.
- Clone or download this repository (to be run in terminal from ~/exercises or a similar directory)
mkdir popgen-pink-salmon
cd popgen-pink-salmon
wget https://api.github.com/repos/FerRacimo/popgen-pink-salmon/tarball/master -O - | tar xz --strip=1
or
git clone https://github.com/FerRacimo/popgen-pink-salmon.git
or
- go to the repository on Github and click Clone or download and then Download ZIP. Download and unzip the repository in the appropriate directory. Notice the name of the directory might have a 'master' suffix.
Before we get started, take a look at the files in the data folder. Can you read what is inside them?
Now go back to the main directory (maybe ~/exercises/popgen-pink-salmon or somewhere else depending on where you downloaded the folder) and begin running the analysis scripts detailed below.
bash ./scripts/1_clean_data.sh
see: ./scripts/1_clean_data.sh to see the commands that are executed, or see ./1_clean_data.md for an anotated version that describes each line.
We'll perform a principal component analysis (PCA) on our genetic data. PCAs are a way of summarizing large multi-dimensional datasets into a few axes of variation (we'll cover them in more detail next week). In this case, we have many dimensions (thousands of loci) and the frequencies of particular alleles at each of these loci all carry some information about the average genetic relationships between the individuals we are studying. We'll try to summarize this large quantity of information into the two strongest axes of variation (principal component 1 and principal component 2), which we will then plot, so as to understand how our individuals are related to each other. Importantly, we need to make sure that the sites we use to compute our PCA are actually polymorphisms (variable sites) that do not have large amounts of sequencing errors (so that we can be confident about whether an allele is absent or present in an individual).
bash ./scripts/2_do_PCA.sh
see: ./scripts/2_do_PCA.sh to see just the commands that are executed, or see ./2_do_PCA.md for an anotated version that describes each line.
Rscript ./scripts/3_plot_PCA.r
see: ./scripts/3_plot_PCA.r to see just the commands that are executed, or see ./3_plot_PCA.ipynb for an anotated version that describes each line.
This will create two PCA plots. These plots show the first two PC axes from the inital and post-filtering data sets of the six populations.
- ./plots/PCA.pink_salmon.clean.png
- ./plots/PCA.pink_salmon.inital.png
To see the first file, for example, you can type:
display ./plots/PCA.pink_salmon.clean.png
bash ./scripts/4_calculate_LD.sh
see: ./scripts/4_calculate_LD.sh to see the commands that are executed, or see ./4_calculate_LD.md for an anotated version that describes each line.
Rscript ./scripts/5_estimate_Ne.r
see: ./scripts/5_estimate_Ne.r to see just the commands that are executed, or see ./5_estimate_Ne.ipynb for an anotated version that describes each line.
Becuase it can be very computationally intensive, I have omitted calculating confidence intervals for the Ne estimates, with a bootstrap or jackknife procedure. There is still research into the best was to provide accurate confidence intervals with the LD method of estimating Ne. See this paper and also this paper for some discussion of this issue.
Rscript ./scripts/6_plot_Ne_Nc.r
see: ./scripts/6_plot_Ne_Nc.r to see just the commands that are executed, or see ./6_plot_Ne_Nc.ipynb for an anotated version that describes each line.
This will create four plots looking at Ne and Nc in the six pink salmon populations.
Each plot can be viewed with
display [path_to_image]
- ./plots/Ne_estimates.png
You can see how large some populations are in absolute number.
- ./plots/Ne_and_Nc_estimates.png
- ./plots/Ne_and_Nc_estimates_log-scaled.png
- ./plots/Ne-Nc_ratios.png
In these plots yellow is low LD and orange is high LD. You can see the raw r2 values in your ./work/Puget_EVEN.ld and similar files.
- ./plots/LD_Koppen_EVEN.png
- ./plots/LD_Koppen_ODD.png
- ./plots/LD_Nome_EVEN.png
- ./plots/LD_Nome_ODD.png
- ./plots/LD_Puget_EVEN.png
- ./plots/LD_Puget_ODD.png
You can run ./do_everything run the entire analysis:
bash ./do_everything
-
Check the text produced by running plink on your files. What are the sample sizes and number of loci used in the analysis of each population, why do they differ?
-
Check the text produced by running plink on your files. What are the reasons that some variants are removed in this filtering step? Why do you think it’s important to remove them? See here for an explanation of all the filters implemented in plink: https://www.cog-genomics.org/plink/1.9/filter
-
Why is it important to separate each population before calculating LD?
-
What is shown in the first few axes of the PCA projection? What does each dot represent?
-
Describe the differences between the two PCAs (before and after filtering).
- How are they different?
- How are they similar?
-
Why do you think there is a Puget_EVEN individual that is projected near the the Koppen_EVEN individuals?
- Give a possible biological explanation
- Give a possible laboratory explanation
-
Here we analyzed all six populations together. Would it have been useful to perform PCA on the data from each population separately? What would that reveal?
-
What does the r2 statistic measure? How is r2 related to D?
-
How would our estimates of LD have changed if we did not exclude locus pairs on the same chromosome?
-
How is our estimate of LD affected by sample size?
-
Which lineage of pink salmon has higher Ne in the north, south, and middle of the range?
-
Based on your estimates of effective population size, which population do you expect to have be most affected by genetic drift? Which ones do you expect to be the least affected?
-
What more would you want to research about the pink salmon populations in order to understand the Ne/Nc ratios?
-
What is the difference between a population's Ne and Nc. Why are both important when seeking to underestand population dynamics?
-
Can you calculate the relative signal (due to Ne) and the noise (due to sample size?) in the mean r2 value?
-
How would recent migration into a population affect estimates of LD and Ne?
-
Given time and money how would you improve this analysis - more samples? more loci? more populations?