Estimation of recent effective population size in pink salmon

Population Genetics 2019, University of Copenhagen

Original author: Ryan Waples ([email protected]); edited by Fernando Racimo

Program

Download pink salmon data, apply filters. (Plink)
Visualize the effect of data filters with a PCA. (Plink and R)
Explore patterns of population structure pink salmon. (Plink and R)
Estimate pairwise linkage disequilibrium (LD) between all SNPs in each population. (Plink)
Use the LD estimates to estimate effective population size (N_e) in each population. (R)
Compare estimates of the census (N_c) and effective (N_e) population sizes.

Learning outcomes

Get confortable with using PLINK for common filtering procedures.
Get confortable with using R for data analysis and plotting.
Consider the impact that filtering and data quality has on different analyses.
Understand the relationship between LD and N_e.
Understand the how N_e and N_c vary in natural populations.

Background reading (Nielsen and Slatkin)

Wright-Fisher Model: p. 22-27
Effective population size: p. 43-46
Linkage Disequilibrium: p. 108-112, including boxes 6.1-6.3

Pink salmon

Pink salmon in the Pacific have an obligate 2 year life-cycle; they live to be 2 years old, reproduce, then die. This results in two reproductively isolated lineages, in the odd and even years. "Pink salmon, a highly abundant and widely ranging salmonid, provide a naturally occurring opportunity to study the effects of similar environments on divergent genetic backgrounds due to a strict two-year semelparous life history. The species is composed of two reproductively isolated lineages with overlapping ranges that share the same spawning and rearing environments in alternate years." (Seeb et al 2014)

We have samples from adult fish from six pink salmon populations at three different sites. At each site we have samples from both the odd- and even-year lineage.

Collection sites, north to south

Nome River, Norten Sound, Alaska, USA
- Nome, Alaska is the end of the Iditarod dog sled race
Koppen Creek, Prince William Sound, Alaska, USA
- in southeast Alaska
Snohomish River, Puget Slound, Washington state, USA
- Near Seattle, WA

Rough estimates of the census population sizes (N_c).

Lineage	Population	N_c	N_e	Ratio
Odd-year	Nome R.	~300K (source)	?	?
Odd-year	Koppen Cr.	200K (metapopulation) (source)	?	?
Odd-year	Puget S.	~1.4M (source)	?	?

Even-year	Nome R.	~10K (source)	?	?
Even-year	Koppen Cr.	200K (metapopulation) (source)	?	?
Even-year	Puget S.	4K (source)	?	?

How to use this document.

You are reading README.md, a markdown document that decribes the exercise.

About the *.ipynb files. These are Jupyter notebook files that help organize and communicate the analyses in this exercise. You can view these (non-interactively) on Github.

Sub-directories

./data - raw data, this will be provided
./scripts - analysis files
- *.sh files contain code meant to be run in the terminal
- *.r files contain code meant to be run in R
./work - intermediate data files and results
./plots - figures and plots

Exercise

How to run this exercise. Navigate to a desired base directory and then you can execute all the analyses in this exercise with this series of commands:

We will go over each of these scripts in turn.

Getting started

Clone or download this repository (to be run in terminal from ~/exercises or a similar directory)

mkdir popgen-pink-salmon
cd popgen-pink-salmon
wget https://api.github.com/repos/FerRacimo/popgen-pink-salmon/tarball/master -O - | tar xz --strip=1

or

git clone https://github.com/FerRacimo/popgen-pink-salmon.git

or

go to the repository on Github and click Clone or download and then Download ZIP. Download and unzip the repository in the appropriate directory. Notice the name of the directory might have a 'master' suffix.

Running the analyses

Before we get started, take a look at the files in the data folder. Can you read what is inside them?

Now go back to the main directory (maybe ~/exercises/popgen-pink-salmon or somewhere else depending on where you downloaded the folder) and begin running the analysis scripts detailed below.

Filter the data

bash ./scripts/1_clean_data.sh

see: ./scripts/1_clean_data.sh to see the commands that are executed, or see ./1_clean_data.md for an anotated version that describes each line.

Perform PCA on the data before and after filtering

We'll perform a principal component analysis (PCA) on our genetic data. PCAs are a way of summarizing large multi-dimensional datasets into a few axes of variation (we'll cover them in more detail next week). In this case, we have many dimensions (thousands of loci) and the frequencies of particular alleles at each of these loci all carry some information about the average genetic relationships between the individuals we are studying. We'll try to summarize this large quantity of information into the two strongest axes of variation (principal component 1 and principal component 2), which we will then plot, so as to understand how our individuals are related to each other. Importantly, we need to make sure that the sites we use to compute our PCA are actually polymorphisms (variable sites) that do not have large amounts of sequencing errors (so that we can be confident about whether an allele is absent or present in an individual).

bash ./scripts/2_do_PCA.sh

see: ./scripts/2_do_PCA.sh to see just the commands that are executed, or see ./2_do_PCA.md for an anotated version that describes each line.

Rscript ./scripts/3_plot_PCA.r

see: ./scripts/3_plot_PCA.r to see just the commands that are executed, or see ./3_plot_PCA.ipynb for an anotated version that describes each line.

This will create two PCA plots. These plots show the first two PC axes from the inital and post-filtering data sets of the six populations.

./plots/PCA.pink_salmon.clean.png
./plots/PCA.pink_salmon.inital.png

To see the first file, for example, you can type:

display ./plots/PCA.pink_salmon.clean.png

Calculate LD

bash ./scripts/4_calculate_LD.sh

see: ./scripts/4_calculate_LD.sh to see the commands that are executed, or see ./4_calculate_LD.md for an anotated version that describes each line.

Estimate Ne

Rscript ./scripts/5_estimate_Ne.r

see: ./scripts/5_estimate_Ne.r to see just the commands that are executed, or see ./5_estimate_Ne.ipynb for an anotated version that describes each line.

Becuase it can be very computationally intensive, I have omitted calculating confidence intervals for the Ne estimates, with a bootstrap or jackknife procedure. There is still research into the best was to provide accurate confidence intervals with the LD method of estimating Ne. See this paper and also this paper for some discussion of this issue.

Plot the Ne and Nc estimates

Rscript ./scripts/6_plot_Ne_Nc.r

see: ./scripts/6_plot_Ne_Nc.r to see just the commands that are executed, or see ./6_plot_Ne_Nc.ipynb for an anotated version that describes each line.

This will create four plots looking at Ne and Nc in the six pink salmon populations.

Each plot can be viewed with

display [path_to_image]

Barplot of the population-specific effective population size estimates

./plots/Ne_estimates.png

Barplot of the population-specific effective and census population size estimates

You can see how large some populations are in absolute number.

./plots/Ne_and_Nc_estimates.png

A log-scaled version of the above plot.

./plots/Ne_and_Nc_estimates_log-scaled.png

Looking at the Ne/Nc ratios

./plots/Ne-Nc_ratios.png

And also a heatmap of the r^2 matrix used in the Ne estimate

In these plots yellow is low LD and orange is high LD. You can see the raw r2 values in your ./work/Puget_EVEN.ld and similar files.

./plots/LD_Koppen_EVEN.png
./plots/LD_Koppen_ODD.png
./plots/LD_Nome_EVEN.png
./plots/LD_Nome_ODD.png
./plots/LD_Puget_EVEN.png
./plots/LD_Puget_ODD.png

All together

You can run ./do_everything run the entire analysis:

bash ./do_everything

Questions

1_clean_data

Check the text produced by running plink on your files. What are the sample sizes and number of loci used in the analysis of each population, why do they differ?
Check the text produced by running plink on your files. What are the reasons that some variants are removed in this filtering step? Why do you think it’s important to remove them? See here for an explanation of all the filters implemented in plink: https://www.cog-genomics.org/plink/1.9/filter
Why is it important to separate each population before calculating LD?

2_do_PCA & 3_plot_PCA

What is shown in the first few axes of the PCA projection? What does each dot represent?
Describe the differences between the two PCAs (before and after filtering).
- How are they different?
- How are they similar?
Why do you think there is a Puget_EVEN individual that is projected near the the Koppen_EVEN individuals?
- Give a possible biological explanation
- Give a possible laboratory explanation
Here we analyzed all six populations together. Would it have been useful to perform PCA on the data from each population separately? What would that reveal?

4_calculate_r2

What does the r² statistic measure? How is r² related to D?
How would our estimates of LD have changed if we did not exclude locus pairs on the same chromosome?
How is our estimate of LD affected by sample size?

5_estimate_Ne & 6_plot_Ne

Which lineage of pink salmon has higher N_e in the north, south, and middle of the range?
Based on your estimates of effective population size, which population do you expect to have be most affected by genetic drift? Which ones do you expect to be the least affected?
What more would you want to research about the pink salmon populations in order to understand the Ne/Nc ratios?

Perspectives

What is the difference between a population's N_e and N_c. Why are both important when seeking to underestand population dynamics?
Can you calculate the relative signal (due to Ne) and the noise (due to sample size?) in the mean r2 value?
How would recent migration into a population affect estimates of LD and N_e?
Given time and money how would you improve this analysis - more samples? more loci? more populations?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Estimation of recent effective population size in pink salmon

Population Genetics 2019, University of Copenhagen

Original author: Ryan Waples ([email protected]); edited by Fernando Racimo

Program

Learning outcomes

Background reading (Nielsen and Slatkin)

Pink salmon

Collection sites, north to south

How to use this document.

Sub-directories

Exercise

Getting started

Running the analyses

Filter the data

Perform PCA on the data before and after filtering

Calculate LD

Estimate Ne

Plot the Ne and Nc estimates

Barplot of the population-specific effective population size estimates

Barplot of the population-specific effective and census population size estimates

A log-scaled version of the above plot.

Looking at the Ne/Nc ratios

And also a heatmap of the r^2 matrix used in the Ne estimate

All together

Questions

1_clean_data

2_do_PCA & 3_plot_PCA

4_calculate_r2

5_estimate_Ne & 6_plot_Ne

Perspectives

Data sources

Further reading

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
data		data
plots		plots
scripts		scripts
work		work
.gitignore		.gitignore
0_get_data.md		0_get_data.md
1_clean_data.md		1_clean_data.md
2_do_PCA.md		2_do_PCA.md
3_plot_PCA.ipynb		3_plot_PCA.ipynb
4_calculate_LD.md		4_calculate_LD.md
5_estimate_Ne.ipynb		5_estimate_Ne.ipynb
6_plot_Ne_Nc.ipynb		6_plot_Ne_Nc.ipynb
README.md		README.md
do_everything		do_everything

YuejiaoHuang/popgen-pink-salmon

Folders and files

Latest commit

History

Repository files navigation

Estimation of recent effective population size in pink salmon

Population Genetics 2019, University of Copenhagen

Original author: Ryan Waples ([email protected]); edited by Fernando Racimo

Program

Learning outcomes

Background reading (Nielsen and Slatkin)

Pink salmon

Collection sites, north to south

How to use this document.

Sub-directories

Exercise

Getting started

Running the analyses

Filter the data

Perform PCA on the data before and after filtering

Calculate LD

Estimate Ne

Plot the Ne and Nc estimates

Barplot of the population-specific effective population size estimates

Barplot of the population-specific effective and census population size estimates

A log-scaled version of the above plot.

Looking at the Ne/Nc ratios

And also a heatmap of the r^2 matrix used in the Ne estimate

All together

Questions

1_clean_data

2_do_PCA & 3_plot_PCA

4_calculate_r2

5_estimate_Ne & 6_plot_Ne

Perspectives

Data sources

Further reading

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages