Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short Introns flagged with <pseudo> not accepted by ENA anymore #151

Closed
schellt opened this issue Jul 26, 2021 · 8 comments
Closed

Short Introns flagged with <pseudo> not accepted by ENA anymore #151

schellt opened this issue Jul 26, 2021 · 8 comments
Labels
Info FYI

Comments

@schellt
Copy link

schellt commented Jul 26, 2021

Dear @Juke34 and all,
first of all: thank you very much for this tool kit. This is very helpful!

To submit a de novo assembly including annotation to ENA I ran certain filtering steps with agat 0.5.1 before running EMBLmyGFF3.

In detail executed
agat_sp_keep_longest_isoform.pl -f <gff> -o <agat1.gff>
agat_convert_sp_gxf2gxf.pl -g <agat1.gff> -o <agat2.gff>
agat_sp_fix_features_locations_duplicated.pl -f <agat2.gff> -o <agat3.gff>
agat_sp_flag_short_introns.pl --gff <agat3.gff> --out <agat4.gff>

Subsequently, I ran EMBLmyGFF3, validated and submitted with webin-cli-3.5.0.jar

ENA accepted the submission and released the fasta sequences of the genome assembly but not the annotation. I wrote back an forth with the help desk (Sam Holt) but he couldn't find any issues on their end. Until recently - half a year after submission - he came back to me with this answer:

The problem with this submission is that there are several introns of too short a length to be allowed through. Introns of less han 10bp aren't allowed, and this ought to have been blocked at submission but was allowed through due to a bug.

Of cause, I replied:

Actually, I ran agat_sp_flag_short_introns.pl and all introns <10 should be flagged with the attribute <pseudo>. That should actually "avoid ERROR when submiting the data to EBI. (Typical EBI error message: ********ERROR: Intron usually expected to be at least 10 nt long. Please check the accuracy)"

Is there any chance to fix this on your side?
Would you recommend to flag introns <10bp as in the future or is there any other best practice?

Then Sam replied to this again:

This will not be fixable from our end, it will be necessary to arrange resubmission. In future, please simply avoid including any introns shorter than 10bp. These are almost always an artefact of an automated annotation system, rather than genuine features.

As @Juke34 wrote here:
NBISweden/EMBLmyGFF3#31
agat_sp_flag_short_introns.pl should do the job.

My questions are now:

  • Do I miss something? I didn't add introns but all genes and linked features do have the attribute pseudo=8.
  • I could now just remove the 271/17710 genes containing small introns but do you see any alternative?
  • Since the pseudo flagged genes were accepted by Webin-CLI validate, I don't think that removing pseudo flagged features are the intention of agat_sp_flag_short_introns.pl, correct?

I see that this is not really an issue with agat itself but I just wanted to share my experience here.

Thanks and best regards,
Tilman

@Juke34
Copy link
Collaborator

Juke34 commented Jul 26, 2021

Hi thank you for using our tools and for your feedback with many precious information.
In the past (until recently) we have succeeded to submit annotation with intron shorter than 10bp when the gene was flagged as pseudogene. I didn't know there was a maximum number of them allowed. And I find that really strange that the Webin-CLI does not inform about it. I hope they will update the Webin-CLI to fix that.

agat_sp_flag_short_introns.pl just add the attribute pseudo for those genes with short introns.
What you could do is to add an extra step to remove those gene models by using agat_sp_filter_feature_by_attribute_presence.pl (remove feature with the pseudo attribute).

@Juke34 Juke34 added the Info FYI label Jul 26, 2021
@schellt
Copy link
Author

schellt commented Jul 26, 2021

Hi,
thanks a lot for the prompt reply.

I didn't know there was a maximum number of them allowed.

Probably you misunderstood "271196/17710". I don't think that there is a maximum number for short introns. I just wanted to point out that - in my case - it wouldn't harm that much to remove all 271196 genes containing small introns from the whole annotated gene set (17710 genes).

I will now filter out all pseudo flagged entries and keep you updated.

@Juke34
Copy link
Collaborator

Juke34 commented Jul 26, 2021

Sorry I read too fast the Sam's reply. So no threshold, but they don't allow it anymore.

@schellt
Copy link
Author

schellt commented Jul 27, 2021

Hi,
here a short update.
I couldn't filter the gff file with agat_sp_filter_feature_by_attribute_presence.pl because I included some information on functional annotation as note; e.g.:
Note=Similar to Rpusd2: RNA pseudouridylate synthase domain-containing protein 2 (Mus musculus)

If I run agat_sp_filter_feature_by_attribute_presence.pl -a pseudo, all entries that match "pseudo" anywhere in the attributes are filtered out. I tried to run agat_sp_filter_feature_by_attribute_presence.pl -a pseudo=, since I observed these flags from agat_sp_flag_short_introns.pl:

pseudo=5
pseudo=6
pseudo=7
pseudo=8
pseudo=9

Do the numbers actually have any meaning?

In the end agat_sp_filter_feature_by_attribute_presence.pl -a pseudo= didn't exclude any entries. Is this a bug or a feature ;) ?

Due to this mess, I counted the number of genes incorrectly. I should be 196.

Finally, I excluded the entries now with:
awk -F '\t' '$9!~/pseudo=/' <agat4.gff> > <agat5.gff>

Probably it would be the easiest to include an option in agat_sp_flag_short_introns.pl to exclude features with short introns directly.

The flat file created with EMBLmyGFF3 was successfully validated using webin-cli 3.7.0 (but that doesn't mean a lot, since validation was successful with the short introns flagged as pseudo too). Right now, I am waiting for instructions regarding re-upload.

@Juke34
Copy link
Collaborator

Juke34 commented Jul 27, 2021

attribute syntax: key = value
So it is normal that the = is not taken into account. agat_sp_filter_feature_by_attribute_presence.pl will remove features that have an attribute with the key specified with the -a option. It wlill also remove all related feature to a removed feature (e.g. remove exon and a CDS when you will have removed the mRNA feature).
I think the AWK command is not recommended here because the pseudo attribute is only attach at mRNA or gene feature (not sure which one). So you will still have exon and CDS lying around that should not be submitted.

Are you sure you did not mix up with agat_sp_filter_feature_by_attribute_value.pl that will remove according to the value of a specified attribute?

Otherwise I guess the features with the Note containing pseudo that have been removed are removed because they either have the pseudo attribute or are linked to a feature that have been removed because they contained the pseudo attribute.

@schellt
Copy link
Author

schellt commented Jul 27, 2021

Thanks for the reply!

Do the numbers actually have any meaning?

Does the value of the pseudo key have any meaning?

agat_sp_filter_feature_by_attribute_presence.pl works as expected. Sorry, I was just confused by the total number of removed features. I mixed it up with level 1 features removed... sorry - stupid me.

agat_sp_flag_short_introns.pl (from release 0.5.1) did flag all linked features:

$ awk -F '\t' '$9~/pseudo=/{print $3}' <agat4.gff> | sort | uniq -c
   1764 CDS
   1859 exon
     90 five_prime_UTR
    196 gene
    196 mRNA
    112 three_prime_UTR

This is identical to

$ agat_sp_filter_feature_by_attribute_presence.pl --gff <agat4.gff> -a pseudo -o <agat5.gff>
[...]
4217 features removed:
196 features level1 (e.g. gene) removed
196 features level2 (e.g. mRNA) removed
3825 features level3 (e.g. exon) removed

Therefore, the awk command should have worked. But to be sure I will use the output from agat.

@Juke34
Copy link
Collaborator

Juke34 commented Jul 27, 2021

Good, I was not sure that agat_sp_flag_short_introns.pl was flagging all the features of a concerned record. So you were right using awk was apparently right.

Does the value of the pseudo key have any meaning?

No it is just a counter.

@Juke34 Juke34 changed the title Short Introns flagged with <pseudo> not accepted Short Introns flagged with <pseudo> not accepted by ENA anymore Aug 5, 2021
@Juke34
Copy link
Collaborator

Juke34 commented Apr 4, 2022

Now it is possible to use: agat_sp_add_attribute_shortest_intron_size.pl to add the size of the shortest intron by gene and mRNA and then use the agat_sp_filter_feature_by_attribute_value.pl and chose the the minimum intron size to discard mRNA.

@Juke34 Juke34 closed this as completed Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Info FYI
Projects
None yet
Development

No branches or pull requests

2 participants