Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record type inference #268

Closed
cjprybol opened this issue Feb 25, 2023 · 3 comments · Fixed by #292 · May be fixed by #269
Closed

Record type inference #268

cjprybol opened this issue Feb 25, 2023 · 3 comments · Fixed by #292 · May be fixed by #269
Labels
feature request A request for some desired or missing functionality

Comments

@cjprybol
Copy link
Contributor

Is there anything available for inferring the FASTA record type from the sequence?

In earlier versions of FASTX I think this was done by default, and all of the records read in by Readers were returned as variants of LongSequence rather than strings. Now the same functionality is available optionally if you specify the return type when calling FASTX.sequence({desired_return_type}, record)

What I'm looking for is something along the lines of

FASTX.sequence(BioSequences.LongSequence, record) -> Union{LongDNA, LongRNA, LongAA}

#or BioSequences would be a better home for the inference
BioSequences.LongSequence(FASTX.sequence(record)) -> Union{LongDNA, LongRNA, LongAA}

Expected Behavior

Ambiguous interpretations lead to errors

julia> x = FASTX.FASTA.Record("identifier" , "AAAA")
FASTX.FASTA.Record:
  description: "identifier"
     sequence: "AAAA"

julia> FASTX.sequence(BioSequences.LongDNA{4}, x)
4nt DNA Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongRNA{4}, x)
4nt RNA Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongAA, x)
4aa Amino Acid Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongSequence, x)
Error -> ambiguous sequence type

unambiguous interpretations lead to auto-inferred sequence types

julia> x = FASTX.FASTA.Record("identifier" , "AAAAJ")
FASTX.FASTA.Record:
  description: "identifier"
     sequence: "AAAAJ"

julia> FASTX.sequence(BioSequences.LongDNA{4}, x)
ERROR: Cannot encode 0x4a to BioSequences.DNAAlphabet{4}()
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] throw_encode_error(A::BioSequences.DNAAlphabet{4}, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:156
 [3] encode_chunk
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:168 [inlined]
 [4] copyto!(dst::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, doff::Int64, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64, N::Int64, #unused#::BioSequences.AsciiAlphabet)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:298
 [5] copyto!
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:235 [inlined]
 [6] BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}(src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, part::UnitRange{Int64})
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:85
 [7] LongSequence
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:83 [inlined]
 [8] sequence(::Type{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [9] top-level scope
   @ REPL[21]:1

julia> FASTX.sequence(BioSequences.LongRNA{4}, x)
ERROR: Cannot encode 0x4a to BioSequences.RNAAlphabet{4}()
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] throw_encode_error(A::BioSequences.RNAAlphabet{4}, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:156
 [3] encode_chunk
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:168 [inlined]
 [4] copyto!(dst::BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}, doff::Int64, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64, N::Int64, #unused#::BioSequences.AsciiAlphabet)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:298
 [5] copyto!
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:235 [inlined]
 [6] BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}(src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, part::UnitRange{Int64})
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:85
 [7] LongSequence
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:83 [inlined]
 [8] sequence(::Type{BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [9] top-level scope
   @ REPL[22]:1

julia> FASTX.sequence(BioSequences.LongAA, x)
5aa Amino Acid Sequence:
AAAAJ

julia> FASTX.sequence(BioSequences.LongSequence, x) #can only be an AA seq, per the available alphabets
5aa Amino Acid Sequence:
AAAAJ

Current Behavior

Can't use a generic LongSequence for any record

julia> FASTX.sequence(BioSequences.LongSequence, x)
ERROR: MethodError: no method matching BioSequences.LongSequence(::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true})
Closest candidates are:
  BioSequences.LongSequence(::Kmers.Kmer{A, K, N}) where {A, K, N} at /opt/julia/packages/Kmers/7SNBQ/src/kmer.jl:492
  BioSequences.LongSequence(::BioSequences.LongSequence, ::AbstractUnitRange{var"#s29"} where var"#s29"<:Integer) at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:49
  BioSequences.LongSequence(::BioSequences.LongSubSeq{A}) where A at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/seqview.jl:76
  ...
Stacktrace:
 [1] sequence(::Type{BioSequences.LongSequence}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [2] top-level scope
   @ REPL[19]:1

julia> BioSequences.LongSequence(FASTX.sequence(x))
ERROR: MethodError: no method matching BioSequences.LongSequence(::StringViews.StringView{SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}})
Closest candidates are:
  BioSequences.LongSequence(::Kmers.Kmer{A, K, N}) where {A, K, N} at /opt/julia/packages/Kmers/7SNBQ/src/kmer.jl:492
  BioSequences.LongSequence(::BioSequences.LongSequence, ::AbstractUnitRange{var"#s29"} where var"#s29"<:Integer) at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:49
  BioSequences.LongSequence(::BioSequences.LongSubSeq{A}) where A at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/seqview.jl:76
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[26]:1

Context

In addition to validating whether a FASTA is valid https://biojulia.github.io/FASTX.jl/latest/fasta/#FASTX.FASTA.validate_fasta it would be useful to have functionality to auto-infer the type of records in the FASTA

I'd need to think through the most logical way to check, but I think an order of operations to infer the best alphabet match might be like

DNA{2} -> RNA{2} -> DNA{4} -> RNA{4} -> AA

The AA alphabet (letter codes, not molecules) seems to be a superset of DNA/RNA alphabet, and the T/U difference I think is enough to differentiate between DNA/RNA

link to codes https://www.ddbj.nig.ac.jp/ddbj/code-e.html

@cjprybol
Copy link
Contributor Author

@jakobnissen jakobnissen transferred this issue from BioJulia/FASTX.jl Feb 27, 2023
@jakobnissen
Copy link
Member

This is a good idea, and I think we should have this feature, with a few caveats.

First, IMO, it belongs in BioSequences, not FASTX (hence, I transferred the issue to this repository). The reason is principal: The biological sequence type is not a feature of the FASTA format, in which the sequences really are just text that can contain anything - indeed, they often contain non-standard symbols. This is also the motivation why FASTA.sequence returns a string in FASTX v2, where it returned a BioSequence in v1 - all records really contain are strings, and interpreting them to a specific biological alphabet is a distinct process, which is done by BioSequences.

The second caveat is that autodetection of sequence type is bound to be both flaky and inefficient, no matter how we do it. That's actually why we removed autodetection for v2 (see BioJulia/FASTX.jl#59). As one of the goals of BioJulia more broadly is to allow people to use robust software, we should be wary of adding flaky functions that users might accidentally rely on, and as a result, produce unreliable software.

That doesn't mean we can't have it, but it should just be named something like guess_parse or guess_seq, which makes it clear that that's all it's doing. I think it's worth having, for convenient REPL work where it's fine if it's a little unreliable and slow. I agree in that case, we should just check each of the predefined alphabets in order, and error if it's ambiguous between RNA and DNA.

We might also want to remove the method for kmers you linked to, before Kmers.jl is released, for the same reasons.

@jakobnissen
Copy link
Member

A few ideas for implementing this:

  • Using tryparse (see Add tryparse(::BioSequence, s::AbstractString) or similar #224), simply trying each alphabet in order
  • Make a length NTuple{256, UInt8} lookup table (can also fit in a length-128 table) where each byte encodes four bits: If it contains valid DNA, valid RNA, IUPAC ambiguous nucleotides, or valid AA. Then we simply check each byte and AND the bitmask together before making the guess. If no bits are set, we error. If only AA bit is set, we pick amino acid alphabet. If both RNA and DNA alphabets is set, we error. Else we pick DNA/RNA, and 2/4 bits based on the ambiguous nucleotide bit.

jakobnissen added a commit to jakobnissen/BioSequences.jl that referenced this issue Feb 27, 2023
This function is a quick-and-dirty parser function from `AbstractString` to
`LongSequence`, with autodetection of the alphabet.
It's meant to be used in ephemeral REPL work, and very clearly documented to
be unstable and subject to change.

See BioJulia#268
@jakobnissen jakobnissen linked a pull request Feb 27, 2023 that will close this issue
5 tasks
@jakobnissen jakobnissen added the feature request A request for some desired or missing functionality label Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A request for some desired or missing functionality
Projects
None yet
2 participants