Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit tests! #12

Open
ArtPoon opened this issue Oct 19, 2021 · 6 comments
Open

Unit tests! #12

ArtPoon opened this issue Oct 19, 2021 · 6 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@ArtPoon
Copy link
Contributor

ArtPoon commented Oct 19, 2021

We need 'em

@DBecker7
Copy link
Collaborator

Some unit tests/synthetic data started in 9eeaf51

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 20, 2021

Eventually I think we'll need to simulate data for a comparison of pipelines against some ground truth

@DBecker7
Copy link
Collaborator

Proposal:

  1. Select from sequences.fasta according to current estimates of relative frequencies (frequency and total count are known).
    • I.e. sample names from metadata then use seqtk subseq to extract from sequences.fasta (doesn't work with sequences.fasta.xz, must unzip first).
    • I've already created sequences_pangolin.fasta.xz, which only contains sequences with known pangolin lineages (data/get-sequences-with-pangolineage.sh, takes over an hour to run on Rei because of unzipping/zipping).
  2. Extract amplicon regions from sampled fastas, output to a fastq with simulated Phred scores (or just fasta if the scores are not used).
    • Simulate coverage from our data.
    • At this step, we intentionally lose information about linkage across amplicons.
  3. Randomly sample from this file (without replacement) to simulate degradation / incomplete sampling.
    • Assumed degradation will probably be arbitrary, but can help us demonstrate the effect of degradation on case count estimation.

A computationally faster method might be to calculate all mutations from sequences_pangolin (encode_diffs) ahead of time, sample amplicon regions, simulate coverage, then reconstruct the sequence within coverage regions from the reference.

@ArtPoon ArtPoon added the help wanted Extra attention is needed label Nov 19, 2021
@GopiGugan
Copy link
Collaborator

Unit test needed for #46

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 25, 2023

Please focus on minimap2.py and estimate-freqs.R @SandeepThokala thanks

@GopiGugan
Copy link
Collaborator

@SandeepThokala to post coverage of unit tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants