Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalise simulateAlleleFreq() function to any ploidy #6

Open
tavareshugo opened this issue May 26, 2018 · 4 comments
Open

generalise simulateAlleleFreq() function to any ploidy #6

tavareshugo opened this issue May 26, 2018 · 4 comments

Comments

@tavareshugo
Copy link

In #5 I asked about the bulkSize option in runQTLseqAnalysis() if:

is it the case that this parameter should be the number of genome copies pooled, which for a diploid it would be 2x number of individuals pooled?

The answer is no, bulkSize should be the number of diploid individuals. I see this is right, because of the way the null expectation is being simulated.

I guess there are two levels to the simulation, because there are two levels of sampling:

  • First we sample individuals from the population and calculate what the alternative allele frequency is (simulateAlleleFreq() function).
  • Then we sample some reads in the locus (simulateSNPindex() function).

I hand't noticed but indeed the first level of the simulation is assuming individuals are diploid, because it samples diploid individual genotypes (c(0, 0.5, 1) with probabilities relating to the expected segregation ratios in an F2 (c(1, 2, 1)/4).

But what if one is working with higher ploidy? Then the above simulation would not work.

However, the way it is implement at the moment, I think is equivalent to sampling from a binomial with probability of the event (picking an alternative allele) being 0.5 and number of trials being equal to the number of alleles sampled (2 x number of individuals).

To illustrate with code:

# Define parameters of simulation
nind <- 100  # number of sampled individuals
ploidy <- 2  # assuming diploid
replications <- 1e6  # make a lot of replications also to show how second method is faster

# Current implementation - uses number of sampled individuals
sim1 <- replicate(replications, 
                  mean(sample(c(0, 0.5, 1), size = nind, 
                              prob = c(1, 2, 1)/4, replace = TRUE)))

# Other way - sample alternative alleles with probability 0.5 from n trials given by 
## the number of alleles, which is ploidy * nind
sim2 <- rbinom(replications, ploidy*nind, 0.5)/(ploidy*nind)

# Check they are equivalent
ggplot(data.frame(sim1, sim2)) +
  geom_density(aes(sim1)) +
  geom_density(aes(sim2), colour = "pink", linetype = "dashed")

I guess the advantage is that this is general, regardless of the ploidy (besides the bonus of being faster).

I think the RIL implementation is already general as it is, because in that case we assume the individuals are fully homozygous, in which case they are equivalent to "haploid" organisms. In any case, I think the implementation can be also be made faster, by sampling from a binomial:

# Current implementation
ril1 <- replicate(replications, 
                  mean(sample(
                    x = c(0, 1),
                    size = nind,
                    prob = c(1, 1),
                    replace = TRUE
                  )))


# Other way - null expectation is also allele frequency of 0.5, except now the number of trials
## is the number of individuals (because we can look at them as equivalent to haploid organisms)
ril2 <- rbinom(replications, nind, 0.5)/(nind)

# Check they are equivalent
ggplot(data.frame(ril1, ril2)) +
  geom_density(aes(ril1)) +
  geom_density(aes(ril2), colour = "pink", linetype = "dashed")

@bmansfeld please do check all of this, as I might be making some wrong assumption somewhere (I should also probably go and read the Tagaki paper in more detail! 😄)

@bmansfeld
Copy link
Owner

Hi Hugo,
Thanks for the detailed posts. It's been fun to collaborate and think on this with you.
Regardless of the ploidy levels. The rbinom solution is certainly faster and I used it in the SNP index simulations. I guess it eluded me for some reason in the allele freq sim. The way I went was kind of derived from the original slower code in the Tagaki et al scripts, which randomly sampled from a uniform dist and then asked if it was higher or lower than 0.5 to decide the allele.
I will probably incorporate the rbinom solution alongside the changes suggested in #5, just for the sake of speed.
In regard to polyploidy, and though I am not a polyploid expert the way you have it set up makes sense to me as is. However, I am aware that there are cases in autopolyploids that can behave in different ways such as multivalent pairing etc.
I will be in touch with Dr. Pat Edger in our dept who is our resident polyploidy expert in the coming weeks and see if there is a way to integrate a more accurate representation of the allele frequencies.
Maybe this is overkill and the method you suggested is perfect for the kinds of studies QTLseqr is for, I am just not confident enough about the polyploidy mess...
Let me know your thoughts.
Ben

@tavareshugo
Copy link
Author

Hi Ben,

I'm no polyploidy expert either, so it seems sensible to discuss with someone more familiar with polyploid genetics! I suppose the way I put it made two assumptions:

  • parents are fully homozygous
  • segregation of the n copies in the F2 is random and independent

If that's not true, then the model would be wrong, I guess. As you say, a more "realistic" model might be hard, because it probably depends on homology between chromosome copies, which will probably vary between chromosomes, individuals, varieties, species, etc...

I suppose if the function was made general, this could be explicitly mentioned in the documentation and it would be up to the user to decide. Also, if the default ploidy = 2, then a substantial number of users don't even have to worry about this. 😄

@cfljam
Copy link

cfljam commented Nov 7, 2018

Im glad you are thinking about this. We have been trying QTLSeqR in some pooled data of hybrid backcross autotetraploids segregating for a major gene where the minor allele frequency of interest is 0.25. Results look sensible (similar to Popoolation CMH but sharper!) but Im interested how default expectations might not fit our system.

@WallyL
Copy link

WallyL commented Jan 6, 2023

Has anyone ever followed up on this? I would be very interested in learning what code modifications might be employed to facilitate better default modeling when running QTLSeqrR with a tetraploid species.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants