Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we make BioSequences <: AbstractVector? #173

Closed
TransGirlCodes opened this issue Jul 21, 2021 · 11 comments
Closed

Should we make BioSequences <: AbstractVector? #173

TransGirlCodes opened this issue Jul 21, 2021 · 11 comments

Comments

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Jul 21, 2021

So I'm just making an actual issue to discuss - and document said discussion of a prospect that was raised on Slack.

Because as I'm going through broadcasting, I'm seeing that there are some arguments in the pro column for doing this.

@TransGirlCodes
Copy link
Member Author

Linking the PR documenting the concerns with getting broadcasting to work with sequences, as some of the concerns there are relevant. #171

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Jul 21, 2021

One potential issue with inheriting from AbstractVector is I don't believe we can simply do BioSequence{A<:Alphabet} <: AbstractVector{eltype(A)}.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Jul 21, 2021

One alternative, is we could redefine the interface a bit. So as you have BioSequence{ElType, Alphabet} <: AbstractVector{ElType}, where the Alphabet does not longer determine the element type, but still controls what is acceptable. E.g BioSequence{DNA, UnambiguousIUPAC}, BioSequence{RNA, UnambiguousIUPAC}. Encoding and decoding and such is then controlled by dispatch on ElType, Alphabet, and perhaps even the sequence.

@TransGirlCodes
Copy link
Member Author

Let's say we did this, and take the example of IUPAC alphabets. Ok, we could have:

abstract type IUPACAlphabet <: Alphabet
struct UnambiguousIUPAC <: IUPACAlphabet
struct AmbiguousIUPAC <: IUPACAlphabet

Well some traits and methods might change a bit, but not totally, e.g. bits_per_element(::DNAAlphabet{2}), and bits_per_element(::RNAAlphabet{2})say becomes bits_per_element(::Type{T}, ::UnambiguousIUPAC) where {T<:Union{DNA,RNA} = 2

@CiaranOMara
Copy link
Member

CiaranOMara commented Jul 22, 2021

I did see BioSequence <: AbstractVector on slack and thought BioSequence is a bit like a stridden BitVector where the encoding is between strides. The BitVector would already have all of the broadcasting machinery required to splice/pack or assign the encodings into the data vector. So subtyping from AbstractVector does make sense to me. I think there is undoubtedly a one-to-one correspondence between BioSequence and AbstractVector if there were UInt2 and UInt4 types to hold encodings. I think the string stuff is a valuable nicety to present and input human-readable symbols that so we don't need to type out a comma-separated vector of symbols or, worse, their bit encodings. I agree that when it comes to processing symbols, we're more interested in array-like objects.

Maybe another way of conceptualising the BioSequence element type is that of spliced encodings -- we are sort of already there with the idea of bit packing. Instead of BioSequence{DNA, UnambiguousIUPAC}, what about
BioSequence{Encoded{DNA, UnambiguousIUPAC}}, where Encoded is a parametric primitive type? Maybe Encoded is the BioSymbol -- I've lost track of where the BioSymbols package is at, and as far as I'm aware, 8 bits is still the smallest sized type available to hold a single symbol encoding.

I'm not keen on keeping track and supplying two types to bits_per_element; I'd much prefer they be somehow wrapped, for example, bits_per_element(::Encoded). If the information is on the element or the element is aware of its internals, conversion between encodings is easily convert(Encoded{RNA, UnambiguousIUPAC}, nuc::Encoded{RNA, AmbiguousIUPAC}).

I think this idea of placing information on the element would move your Alphabets suggestion, Encoded (which may be the BioSymbol), and the bits_per_element method to the BioSymbols package, if they're not there already.

I like the idea of expressing the packed encodings as BioSequence{<:Encoded} and the unpacked encodings as Vector{<:Encoded}.

Besides nice syntax for assignments, does broadcasting across packed encoding offer any computational shortcuts, or is it still necessary to unpack to UInt8 and splice in bits from a new UInt8? If so, the reality of some broadcasting may be akin to a lazy BioSequence(map(func, seq)).

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Jul 22, 2021

I'm not sure about putting the encoding information on the element. BitVector for example does not do that, the array user doesn't have to worry, ok this is a Bool but it's an Encoded{Bool, OneHotBoolAlph} or anything like that. However, I do see pros and cons... I'll keep thinking about it today and write any thoughts that occur!

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Jul 22, 2021

So thought about this a bit more today, and I'm more open to the Encoded{} option.

But you got me thinking is it simpler to just have the symbols or elements dictate their own encoding and allowed values.

Let's say you have DNA & RNA type symbols, which basically just allow A, T, C, G, their encoding can be 2 bits. And IUPAC_[DNA/RNA] types which are basically the 1-hot encoded DNA and RNA types that BioSymbols currently have.

We define conversion and promotion between them so they work seamlessly for operations, comparison, and predicates.

The BioSequence types then only need to know one thing - how many bits do these symbol types require? And all the sequence types do is pack or unpack them from their own internal data structure. Then people that want to extend or take advantage of the package only have to define the symbol type e.g. say a "Genotype" for sequences of 0/0, 0/1 for say SNP calls and so on. The author of the symbols defines the encoding with the symbol type itself, and we just have the sequence types pack-em and unpack-em.

What do you think of this @jakobnissen?

@jakobnissen
Copy link
Member

jakobnissen commented Jul 23, 2021

So I originally suggested this in Slack. Just like you, @benjward, it came to me after trying to implement broadcasting for BioSequence. But I now think it's not a good idea, and let me explain why. First though, let's list the pros:

  • Broadcasting will be easier to implement
  • We kind of conform to the AbstractVector interface already
  • We might be able to use BioSequence in functions that take AbstractVector that we have not thought about yet.

However, I think using subtyping to solve these kind of interface issues is ultimately a bad solution, generally. It's the one that Julia provides by default, but it's bad nonetheless. The issue is that:

  • What people mean by AbstractVector may be different: People may not respect that it just means an interface and instead think it means vector in the mathematical sense. This can, and does lead to confusion, about what one can expect vectors to be.
  • More importantly, subtyping is an all-or-nothing thing. By subtyping AbstractVector, you get all its functionality, even if you don't want it. So while we may unlock great functionality we didn't anticipate, we also unlock footguns we didn't anticipate and now can't change. Remember, removing a subtype is much more breaking than adding one.
  • Subtyping is a one-shot gun. Now, because we have worked obn broadcasting, we want BioSequence to be AbstractVector. It's not too unreasonable to think that in one year, we suddenly find a great use case for why it should be AbstractString. Too bad, can't be done.

Like you said, @benjward, I think the broader Julia community is beginning to move towards a consensus that one should use ducktyping and traits for interfaces where possible, instead of relying on subtyping. It's more important that BioSequence is vector-like than it is that it is an AbstractVector. And luckily, Julia does provide functionality for ducktyping. As I said in Slack, if an interface is worth having, it's probably also worth implementing for your own type.

It might suck right now, because broadcasting in particular is hard to implement for non-array types, but this is a problem with the broadcasting machinery, not with Biosequence, and we should not share in its sin.

Finally, I also think that the fact that the element type of AbstractVector{T} is T - this is documented, by the way, not a convention - is a good hint that we probably shouldn't do this.

However, I agree that it's worth thinking about havign BioSequence{Encoding{T, C}} where T is the element type and C a struct that controls encoding. But it's a distinct thing from AbstractVector - we could have this even if we didn't subtype AbstractVector. It would probably simplify some things, but I'm afraid we would still need to have lots of methods added to sequences of certain element types. For example, 2-bit DNA and RNA can be converted without converting the data - and this is neither a property of T, nor C, but of all three things - BioSequence{Encoding{T, C}}.

Right now, however, I think it's not worth it. The current system works pretty well.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Jul 23, 2021

@jakobnissen

Ok, so I didn't go into the cons of this too much because I was still mulling over, but you have hit on the downsides I was considering, and a few more in addition.

I looked deeply into broadcasting this week by @which, @less and just interactively drilling my way down into insight, as to how this works with normal vectors, and how it goes for our sequences.

I believe our sequences really are not that far away from the entire interface (they quack almost exactly like a vector when required!), and that making use of the standard DefaultArrayStyle{1} (which is what the built in broadcasted method returns for our sequences based on it satisfying the interface!), will work. As do most of the methods defined in the documentation - the copy(::Broadcasted), copyto!(::Broadcasted) etc.

As far as I can tell, if we simply decide "hey, if you are using BioSequences, and you do broadcasting, and the end result of the loop's operations is DNA,RNA etc, then you're getting a sequence, not an array.". All we need to do is overload similar as the docs suggest, and then a few other bits that are slightly quirky and miiight need us to import and overload two or threeBase.Broadcast.xyz methods e.g. to make sure axes are extruded corectly so it doesn't default to Cartesian. I don't like that some of those methods are in Base.Broadcast when some of the Base.Broadcast methods in the docs that they tell you to overload are rexported from Base, as it makes you wonder, "crap is this gonna be a changing internal", but I guess as you say julia is moving towards array-like and as it does so, a few more of these methods only defined for AbstractArray will be expected to be overloaded for array-likes in time.

If we decide instead "Hey, if your broadcast involves biosequence arguments & the output is DNA,RNA,etc, then you get a sequence out, otherwise you get an array", then we wiiiill need our own style, but instead of entirely re-implementing broadcasting we can have the style set up to output a sequence, and then basically just dispatch right back to the DefaultArrayStyle{1} methods as soon as that's done.

@CiaranOMara
Copy link
Member

The conversation has moved in another direction, but let me offer another analogy for the Encoded{} option.

It's like describing UInt8 as Int8{Unsigned}, and then adding a constraint Accept1though10 on the values it allows. The result is Int8{Accept1though10, Unsigned}, which is an Int8 that can say, "I only take integer values 1 through 10 and encode them unsigned".

In terms of BioSymbols and BioSequences, the structure would be primitive type BioSymbol{Alphabet, Encoding} 8 end.

This is likely relevant to BioJulia/BioSymbols.jl#42.

@TransGirlCodes
Copy link
Member Author

Gonna close this as I believe the answer to this question is "No" to the AbstractVector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants