-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record type inference #268
Comments
This is a good idea, and I think we should have this feature, with a few caveats. First, IMO, it belongs in BioSequences, not FASTX (hence, I transferred the issue to this repository). The reason is principal: The biological sequence type is not a feature of the FASTA format, in which the sequences really are just text that can contain anything - indeed, they often contain non-standard symbols. This is also the motivation why The second caveat is that autodetection of sequence type is bound to be both flaky and inefficient, no matter how we do it. That's actually why we removed autodetection for v2 (see BioJulia/FASTX.jl#59). As one of the goals of BioJulia more broadly is to allow people to use robust software, we should be wary of adding flaky functions that users might accidentally rely on, and as a result, produce unreliable software. That doesn't mean we can't have it, but it should just be named something like We might also want to remove the method for kmers you linked to, before Kmers.jl is released, for the same reasons. |
A few ideas for implementing this:
|
This function is a quick-and-dirty parser function from `AbstractString` to `LongSequence`, with autodetection of the alphabet. It's meant to be used in ephemeral REPL work, and very clearly documented to be unstable and subject to change. See BioJulia#268
Is there anything available for inferring the FASTA record type from the sequence?
In earlier versions of FASTX I think this was done by default, and all of the records read in by
Readers
were returned as variants ofLongSequence
rather than strings. Now the same functionality is available optionally if you specify the return type when callingFASTX.sequence({desired_return_type}, record)
What I'm looking for is something along the lines of
Expected Behavior
Ambiguous interpretations lead to errors
unambiguous interpretations lead to auto-inferred sequence types
Current Behavior
Can't use a generic LongSequence for any record
Context
In addition to validating whether a FASTA is valid https://biojulia.github.io/FASTX.jl/latest/fasta/#FASTX.FASTA.validate_fasta it would be useful to have functionality to auto-infer the type of records in the FASTA
I'd need to think through the most logical way to check, but I think an order of operations to infer the best alphabet match might be like
The AA alphabet (letter codes, not molecules) seems to be a superset of DNA/RNA alphabet, and the T/U difference I think is enough to differentiate between DNA/RNA
link to codes https://www.ddbj.nig.ac.jp/ddbj/code-e.html
The text was updated successfully, but these errors were encountered: