annotation using list of gene IDs #31

fmalmeida · 2021-11-01T21:13:29Z

Add a new module which interprets a configuration provided by the user in order to annotate the input, using a list of desired genes from a reference. Ideally it would:

Download the sequences selected by user
Format these sequences to be used by blastp and generate reports properly
Use this FASTA subset to annotate the sample and create various reports, such as:
- A table of the anntotation: genes found, % ID, % Coverage, aln length, coordinates, etc.
- Does this annotation intersect with any gene detected by Prokka? The annotation between them differ? Create a table comparing it.
- generate the final HTML report for this custom annotation
Integrate this option with the already implemented custom database analysis using user's pre-formatted FASTAs (--custom_db) so the module either download sequences from NCBI and format it for custom annotation or use user's pre-formated database (in FASTA).
- Does it automaticaly detects between prot and nucl?

Anything else?

The text was updated successfully, but these errors were encountered:

fmalmeida · 2022-01-26T16:08:30Z

Already created a script to download the genbank of genes from the NCBI Protein database given a list of IDs and them convert this gbk to a well-formatted fasta database to be used by the pipeline as a custom protein database.

fmalmeida · 2022-01-27T11:02:55Z

This issue will be handled in branch issue-31.

fmalmeida · 2022-02-02T11:04:13Z

To decide:

These custom annotations with the already implemented --custom_db or with the current implementation using NCBI Protein accessions --ncbi_proteins must or not be included in the final GFF?
- If yes, how should it be added? Additional_database={filename,NCBI_Proteins};{DB_NAME}_product=.....;{DB_NAME}_description=.....?
Or should it be only given as an additional result to the main pipeline, instead of participating as a main tool and inside the main result?
- If yes, the final report of this custom annotation would have to have:
  - A table containing its intersection with the main annotation. And this table should provide how the gene was annotated by Prokka and how it was annotated (and how the alignment is) using the custom database.

I believe I am leaning towards the option 2, to keep things more standardized and easier to maintain and track.

However, one thing has been observed: The parameter --ncbi_proteins loads a protein FASTA while the --custom_db expects a nucl FASTA. This may cause confusion. To make things cleaner, it would be best with --custom_db accepts either prot or nucl FASTA, being able to automatically detect the input type and select between BLASTn or BLASTp

The main tasks that should be accomplished before the issue is done will be always hold in the first comment (even if it requires to be updated).

fmalmeida · 2022-02-08T10:48:54Z

Almost ready!

Now must work on the report file and the intersection table generation.

fmalmeida · 2022-02-14T15:00:48Z

While working on a good way to bring modules together putting it in a unique module that automatically understand the inputs, we saw that it would required to perform little changes in how the databases are formatted and download which would also require changes in the docker images.

Therefore, instead of going on with this in its separate branch trying to make it available as soon as possible, this feature will be implemented together with issue #36 in branch remodeling. This would allow that everything is customized in a single intake, avoiding creating a new Docker image that would suffer drastic changes between two releases.

Thus, the branch issue-31 now is available only to as backup to take copy the code that has already been developed for this issue, but its development will go on in #44.

fmalmeida assigned gpappasunb Nov 1, 2021

fmalmeida added the enhancement New feature or request label Nov 1, 2021

This was referenced Nov 4, 2021

possibility to set a global resfinder species value for samples in YAML #33

Closed

diminish pipeline complexity: pass sample always with YAML #35

Closed

fmalmeida linked a pull request Jan 27, 2022 that will close this issue

addition of ncbi_protein annotation module #49

Closed

fmalmeida pinned this issue Feb 3, 2022

fmalmeida added the priority label Feb 3, 2022

fmalmeida removed the priority label Feb 14, 2022

fmalmeida removed a link to a pull request Feb 14, 2022

addition of ncbi_protein annotation module #49

Closed

fmalmeida linked a pull request Feb 14, 2022 that will close this issue

Pipeline remodelling. Issues 24, 31 and 36 #44

Merged

fmalmeida removed a link to a pull request Mar 28, 2022

Pipeline remodelling. Issues 24, 31 and 36 #44

Merged

fmalmeida linked a pull request Mar 28, 2022 that will close this issue

Pipeline remodelling. Issues 24, 31 and 36 (#44) #51

Merged

fmalmeida closed this as completed in #51 Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annotation using list of gene IDs #31

annotation using list of gene IDs #31

fmalmeida commented Nov 1, 2021 •

edited

Loading

fmalmeida commented Jan 26, 2022

fmalmeida commented Jan 27, 2022

fmalmeida commented Feb 2, 2022 •

edited

Loading

fmalmeida commented Feb 8, 2022

fmalmeida commented Feb 14, 2022

annotation using list of gene IDs #31

annotation using list of gene IDs #31

Comments

fmalmeida commented Nov 1, 2021 • edited Loading

fmalmeida commented Jan 26, 2022

fmalmeida commented Jan 27, 2022

fmalmeida commented Feb 2, 2022 • edited Loading

fmalmeida commented Feb 8, 2022

fmalmeida commented Feb 14, 2022

fmalmeida commented Nov 1, 2021 •

edited

Loading

fmalmeida commented Feb 2, 2022 •

edited

Loading