Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis Variable #4

Open
vsoch opened this issue Mar 23, 2018 · 7 comments
Open

Analysis Variable #4

vsoch opened this issue Mar 23, 2018 · 7 comments

Comments

@vsoch
Copy link
Contributor

vsoch commented Mar 23, 2018

hey @fbartusch, Another question for you! I've created some cloud builders that can be launched to run snakemake on Google Cloud (compute) and I tested the valgrind (memory) analysis across about 16 different instance types. Since it's tiny (so far) the memory doesn't seem to make a difference. What I think I'd want to do (which would be useful for HPC) is to vary some variable set by the scientist to then assess how results are influenced. Is snakemake a bad contender for that? If so, what other things could we vary that would be useful / interesting?

@vsoch
Copy link
Contributor Author

vsoch commented Mar 23, 2018

Here is more detail on what I've done so far (I'm parsing the results from this now) https://github.com/sci-f/snakemake.scif/tree/add/races/results/cloud

@fbartusch
Copy link
Owner

fbartusch commented Mar 26, 2018

Hey @vsoch , you can use Snakemake to vary variables in the workflow. Actually I think that snakemake is not worse than other software for that purpose. Maybe you read this
page already?
Regarding the example workflow, I think the best steps for trying different variables are the bwa_map and bcftools_call steps.
Options for bwa mem are listed here. I think the following options could have a big influence on the result:

  • -k INT | Minimum seed length [19]
  • -B INT | Mismatch penalty [4]
  • -O INT | Gap open penalty. [6]

For bcftools_call respectively

  • -c, --consensus-caller: the original samtools/bcftools calling method (conflicts with -m)
  • -p, --pval-threshold float. with -c, accept variant if P(ref|D) < float.

You could add some of these variables into the snakemake workflow and create config files with different variable settings. Then you can specify which variables to use when running snakemake with the --configfile FILE option.

Since the data in this repo is just for testing purpose, I don't know if you'll see big changes in the result if you try other variables.

@vsoch
Copy link
Contributor Author

vsoch commented Mar 26, 2018

Okay, so reading the docs I think we want to take the following approach:

  • choose the set of variables to vary (you did this above)
  • define defaults in the config.yaml file (and I see you already have the samples here)
  • create a grid of variables and values to run
  • run across the same machine type

Then I assume we would want to look at the all.vcf file? Or are we still interested in memory and time? Given that we find some different in result or runtime metric, is our evaluation then that "the fastest" or "least memory required" is really associated with best? In other words, if we were running this grid of metrics for a researcher, what kind of advice would we give him after doing it.

Since the data in this repo is just for testing purpose, I don't know if you'll see big changes in the result if you try other variables.

Do you mean to say that you don't think doing the variation will have much influence? I think Snakemake definitely fits the bill for running the kind of comparison we want to do, and (the much harder part, for me at least) is deciding (in advance) if there is some variation (in what?) how do we evaluate it's goodness.

@vsoch
Copy link
Contributor Author

vsoch commented Mar 26, 2018

The other approach (when talking about variables) that is interesting would be to show how a single library / software changes over time (calling the same function) or doesn't.

@vsoch
Copy link
Contributor Author

vsoch commented Mar 26, 2018

There are also easy ways to do this with continuous integration, e.g., using a grid in travis (see example --> https://github.com/pydicom/pydicom/blob/master/.travis.yml) but there it's harder to have control of the results.

@vsoch
Copy link
Contributor Author

vsoch commented Mar 26, 2018

ah and here is an example for travis-izing circle! https://github.com/michaelcontento/circleci-matrix

@fbartusch
Copy link
Owner

Given that we find some different in result or runtime metric, is our evaluation then that "the fastest" or "least memory required" is really associated with best?

No. You want to get meaningful results for your scientific problem and the runtime or memory consumption is secondary. The choice of the parameters is very situation-dependent and up to the researcher. Time and memory consumption is interesting if you compare two algorithms with comparable input parameters.

Do you mean to say that you don't think doing the variation will have much influence?

I think it will influence the number of variants found. I just don't know how to interpret the changes since I'm not an expert in this domain.
I tried the vcf-stats tool. It creates simple statistics for the .vcf-file like 'indel_count' and 'snp_count'. The parameters I mentioned above will influence the specificity and thus the number of variants will change.

The other approach (when talking about variables) that is interesting would be to show how a single library / software changes over time (calling the same function) or doesn't.

That is really an interesting idea. I don't know if there are good studies about that for popular software.

There are also easy ways to do this with continuous integration

I never used continuous integration, but I have to keep the circleci thing in my mind. It looks very convenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants