-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diminish pipeline complexity: pass sample always with YAML #35
Comments
Hi there :) Is there any specific reason why you'd like to use YAML instead of the CSV/TSV for capturing information for samples? |
Hi! Well, the only reason was that using the YAML was easier for me to implement optional parameters that could be or not be passed inside the YAML, without affecting the file structure. For instance, if a CSV expected the following:
If users had a sample without resfinder, they would have to set:
Making sure to respect the expected columns, while with the YAML it would not be necessary. That was the main reason. If you have an idea how a TSV/CSV could be seamlessly allowed without being too restrictive, we could try to implement the possibiliy to give either a YAML or a CSV/TSV. |
Hey @fmalmeida , The reason that I recommend CSV/TSV is simply because of the overall user experience, YAML looks clean for sure, however the problem of formatting is really hard to capture and warn the user about malformatted YAML file and from what I've seen, the ideal end-users of the pipeline (biologists) aren't really aware of this format and possible indentation based pitfalls. On the other hand, CSV is pretty much the standard which is used for managing the samplesheets. Plus there is a native For the case of a missing value for a field, perhaps we can introduce a value like |
Hi @abhi18av, This is a very possible idea. In fact, the first step of the pipeline is to convert the YAML into a CSV in order to be parsed with the .splitCsv function. Therefore, I guess it would not be too complicated ... And actually, based on what you said, it'd be nice to accept both formats. So, I think we could have something like this: To day, the first steps of the pipeline are:
Thus, I believe we could have a parameter that loads directly the CSV in the second step. Adjusting the function in the second step to understand either the CSV from the parsed YAML and also the CSV given by the user that may or may not have some fields. What do you think?
|
Agreed! This way we reuse the existing code almost entirely. Though, considering the UX, perhaps we can just have one My personal preference, is to really keep the pipeline parameters to a minimal - what do you think?
Sure, I agree. However, for the time being, let's discuss all the (design/tower) changes in this pipeline first, test and then port them over in a batch fashion. |
Yes, this is ideal. I discussed it a little bit in the issue opening. I want to diminish the complexity and number of parameters. For that, I will actually remove the possibility of passing this "file related" parameters that are used in the samplesheet from the command line, and let the pipeline only accept the samplesheet for that using a
Yes, of course. 😉 |
This issue is somewhat related to what is also discussed in other issues and may affect them, such as:
What is the idea of this issue? Is to remove complexity from the pipeline. Instead of having a workflow for single samples with
--genome
paramter, and another for multiple samples with--in_yaml
, I want remove the parameters related to "single sample" analysis and make the inputs be always given using the YAML samplesheet, either for 1 or 1000 samples.Therefore, since our focus are users with less familiarity with bioinformatics, it would also make sense to change the parameters to make them more intuitive:
--in_yaml
to--input
--outdir
to--output
The text was updated successfully, but these errors were encountered: