-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CenteredSequence{A} type for comparing short reads padding with AA_Gap #247
Comments
Dear @gszep Interesting idea. I'm sure I don't like shooting down reasonable and well thought out proposals from users, so let me justify why I think The issue is twofold. Most importantly, it violates the very first point of the
When we wrote this interface, we did think about whether it was a reasonable assumption. In particular, we thought of circular sequences like plasmids. However, we ended up accepting this limitation, because assuming fixed linear indices makes everything much simpler. More fundamentally, when implementing any kind of abstract type, there must be a concept of an interface, which makes a set of assumptions that are not violated. Julia has historically been very bad at defining and enforcing these interfaces, which is part of the reason the language is so unstable. We need to do better than what we do already. You mention OffsetArrays, which has recently gained infamy due to its buggy behaviour - see https://yuri.is/not-julia/ , precisely because most The second reason is, even if 1-based indexing was not part of the BioSequence interface, it's assumptions like that that enables generic code in the first place. Right now, we can implement most methods on Now, sometimes, generic code can be re-introduced by introducing new abstractions (similar to And I think it would be particularly hard to come up with good abstractions for a BioSequence type which also includes Again, I think your proposal is eminently reasonable, and I get its advantage when working with peptides presented by MHC and such. I just don't think we can actually implement it in a fashion that doesn't cause a lot of grief and pain and bugs down the road. I'll keep this open in case some other devs want to chime in. |
By the way, the unexpected behaviour of |
Thank you for your thorough answer! You've pointed me to issues on correctness in the Julia ecosystem that I did not know existed 😨 I completely agree with you about the can of worms |
Ah, a fellow immunologist! Hello 👋 (I don't actually do much immunology anymore, but it's close to my heart). One thing that @jakobnissen did not mention is that one of the things that's great about julia is that you can easily make a I think the correctness issues are not really as bad as that post makes out for the ecosystem as a whole (see this thoughtful but novel-length discussion), though when you're pushing the boundaries of functionality, it's important to keep in mind. Extensive testing is key! |
I have a current need for this functionally. @jakobnissen, @kescobo, and @SabrinaJaye, I wonder whether it is more reasonable to flip the responsibility of @gszep proposal and simply allow something like the following to work? using BioSequences
using OffsetArrays
offset = -4000:4000
seq = randdnaseq(length(offset))
oa_seq = OffsetVector(seq, offset) Currently, with BioSequences v3.1 the code errors with the following.
So, for this functionality to work, |
#173 is where that other discussion is. @jakobnissen was the primary detractor, though it seems like @SabrinaJaye concurred. I don't have a good sense of the maintenance burden, but I still don't see the benefit over making an And there's probably a better way that doesn't do type piracy but can freely convert back and forth. Anyway, I say I'm concerned about maintenance burden, but I'm not the primary maintainer here, so please hold my objection lightly. |
Right, so I still stand by my arguments in #173 - subtyping To implement
There are some design issues that you might want to consider. For example, should |
Oh darn, I misspoke. I don't need the gap padding, just the offset - so no violation of assumptions as suggested. As an aside, with the responsibility inverted, the using BioSequences
using BioSymbols
using OffsetArrays
"Unoptimsed PaddedSequence"
struct PaddedSequence
seq
end
"Unoptimsed getindex method for PaddedSequence"
function Base.getindex(padded::PaddedSequence, i::Int)
if firstindex(padded.seq) ≤ i ≤ lastindex(padded.seq)
return getindex(padded.seq, i)
end
return gap(eltype(padded.seq))
end
"Unoptimsed getindex method for PaddedSequence"
function Base.getindex(padded::PaddedSequence, r::UnitRange{Int})
f = Base.Fix1(getindex, padded)
return f.(r)
end Checking the padded sequence. julia> seq = dna"GATC" |> collect
4-element Vector{DNA}:
DNA_G
DNA_A
DNA_T
DNA_C
julia> pseq = PaddedSequence(seq)
PaddedSequence(DNA[DNA_G, DNA_A, DNA_T, DNA_C])
julia> pseq[-1:6]
8-element Vector{DNA}:
DNA_Gap
DNA_Gap
DNA_G
DNA_A
DNA_T
DNA_C
DNA_Gap
DNA_Gap Now using Vector{DNA} as a standing for seq = collect(seq)
4-element Vector{DNA}:
DNA_G
DNA_A
DNA_T
DNA_C
julia> offset = -1:2
-1:2
julia> oa_seq = OffsetVector(seq, offset)
4-element OffsetArray(::Vector{DNA}, -1:2) with eltype DNA with indices -1:2:
DNA_G
DNA_A
DNA_T
DNA_C
julia> oa_pseq = PaddedSequence(oa_seq)
PaddedSequence(DNA[DNA_G, DNA_A, DNA_T, DNA_C])
julia> oa_pseq[-4:4]
9-element Vector{DNA}:
DNA_Gap
DNA_Gap
DNA_Gap
DNA_G
DNA_A
DNA_T
DNA_C
DNA_Gap
DNA_Gap I don't think |
Sometimes it makes sense to begin indexing a sequence from its center. For example with hyper-variable loop regions for T cell / B cell receptors or peptides presented by antigen presenting cells. These are short amino acid sequences (lengths<20) whose start and end sequences are structured and center sequences are highly variable. Therefore I propose a
CenteredSequence{A}
typeExpected Behavior
AA_Gap
instead of out of bounds error, making it easier to align sequences of variable lengthsPossible Implementation
However there seem to be issues with this implementationEdit: fixed in BioSequence v3.0.2
The text was updated successfully, but these errors were encountered: