Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

annotation using list of gene IDs #31

Closed
6 of 8 tasks
fmalmeida opened this issue Nov 1, 2021 · 5 comments · Fixed by #51
Closed
6 of 8 tasks

annotation using list of gene IDs #31

fmalmeida opened this issue Nov 1, 2021 · 5 comments · Fixed by #51
Assignees
Labels
enhancement New feature or request

Comments

@fmalmeida
Copy link
Owner

fmalmeida commented Nov 1, 2021

Add a new module which interprets a configuration provided by the user in order to annotate the input, using a list of desired genes from a reference. Ideally it would:

  • Download the sequences selected by user
  • Format these sequences to be used by blastp and generate reports properly
  • Use this FASTA subset to annotate the sample and create various reports, such as:
    • A table of the anntotation: genes found, % ID, % Coverage, aln length, coordinates, etc.
    • Does this annotation intersect with any gene detected by Prokka? The annotation between them differ? Create a table comparing it.
    • generate the final HTML report for this custom annotation
  • Integrate this option with the already implemented custom database analysis using user's pre-formatted FASTAs (--custom_db) so the module either download sequences from NCBI and format it for custom annotation or use user's pre-formated database (in FASTA).
    • Does it automaticaly detects between prot and nucl?

Anything else?


@fmalmeida
Copy link
Owner Author

Already created a script to download the genbank of genes from the NCBI Protein database given a list of IDs and them convert this gbk to a well-formatted fasta database to be used by the pipeline as a custom protein database.

@fmalmeida
Copy link
Owner Author

This issue will be handled in branch issue-31.

@fmalmeida fmalmeida linked a pull request Jan 27, 2022 that will close this issue
@fmalmeida
Copy link
Owner Author

fmalmeida commented Feb 2, 2022

To decide:

  1. These custom annotations with the already implemented --custom_db or with the current implementation using NCBI Protein accessions --ncbi_proteins must or not be included in the final GFF?
    • If yes, how should it be added? Additional_database={filename,NCBI_Proteins};{DB_NAME}_product=.....;{DB_NAME}_description=.....?
  2. Or should it be only given as an additional result to the main pipeline, instead of participating as a main tool and inside the main result?
    • If yes, the final report of this custom annotation would have to have:
      • A table containing its intersection with the main annotation. And this table should provide how the gene was annotated by Prokka and how it was annotated (and how the alignment is) using the custom database.

I believe I am leaning towards the option 2, to keep things more standardized and easier to maintain and track.

However, one thing has been observed: The parameter --ncbi_proteins loads a protein FASTA while the --custom_db expects a nucl FASTA. This may cause confusion. To make things cleaner, it would be best with --custom_db accepts either prot or nucl FASTA, being able to automatically detect the input type and select between BLASTn or BLASTp


The main tasks that should be accomplished before the issue is done will be always hold in the first comment (even if it requires to be updated).

@fmalmeida fmalmeida pinned this issue Feb 3, 2022
@fmalmeida
Copy link
Owner Author

Almost ready!

Now must work on the report file and the intersection table generation.

@fmalmeida
Copy link
Owner Author

While working on a good way to bring modules together putting it in a unique module that automatically understand the inputs, we saw that it would required to perform little changes in how the databases are formatted and download which would also require changes in the docker images.

Therefore, instead of going on with this in its separate branch trying to make it available as soon as possible, this feature will be implemented together with issue #36 in branch remodeling. This would allow that everything is customized in a single intake, avoiding creating a new Docker image that would suffer drastic changes between two releases.

Thus, the branch issue-31 now is available only to as backup to take copy the code that has already been developed for this issue, but its development will go on in #44.

@fmalmeida fmalmeida linked a pull request Feb 14, 2022 that will close this issue
@fmalmeida fmalmeida linked a pull request Mar 28, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants