-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should lengths and names be required properties in every sequence collection ? #72
Comments
we've discussed this at length in the past; names can always just become 1-N integers, which is basically what you'd do in a computer if you didn't have names. Similarly, lengths can be easily computed from sequences. And if you don't have sequences and also no lengths, then you don't have a sequence collection. So, there is no situation where names/lengths cannot be computed. Keep in mind: we're not forcing anyone to use this structure for use case X -- this is just what you would make to compute a digest. In the case of a CRAM file where you have a digest for each sequence, and you want to compute a seqcol digest you could:
So, it is possible. The thing is, you probably wouldn't want to do this, because you'd really want to just use a collection that had an identifier computed with the names from some provider -- then you'd just ignore them. Making names option does not change that you still wouldn't want to do this. I recall weeks of discussion on this topic where we eventually concluded It is a bigger problem to make names optional. I think the above, to me, reinforces that it makes a lot of sense to have names be mandatory -- because it also guides the user away from doing something that wouldn't be a great decision from an interoperability perspective. |
@raskoleinonen I do think it would be useful to get a bit more details on the use case. The way I understood it, the context makes computation of the lengths difficult. Could you explain exactly why? Perhaps one way of doing this could be to look up the level-1 digest of the sequences in some seqcol service to receive the corresponding length array (and possibly names), of course given a point in the future when seqcol is more widely adopted. @nsheff I have never really liked the idea of just naming the sequences with its index. I don't understand why one would want to do that, as it basically reduces interoperability. You can always just add an index-based names array on the user's side if there is a need for this. There will be no extra information content in the The So to me, requiring the |
I suggest that there is two levels of the specification: 'core' and 'community'. The 'core' level describes the scope and format of the sequence collections, the terminology, how digests are computed, and has a list of standard fields with definitions. The decision of which fields are required and which are inherent would be in the 'community' part of the specification. This would allow the 'core' to be approved without discussing which fields are required and/or inherent in different contexts. My intuition tells me that having any standard fields in the 'core' as required and/or inherent is a significant distraction from approving the 'core' as a GA4GH standard. For example, my opinion is that lengths is neither inherent nor required. I suggest that opinions about field usage should not have a impact on the 'core' specification. |
@raskoleinonen I believe the overall idea in your suggestion is very much in line with the consensus of the group. I think the main point where we differ is that we have defined the |
I can also add that the discussion has mostly been about the EDIT: Even though I am not fully happy with requiring the |
Just another thought here: If you have neither names nor lengths, and you just have sequences, then what you want is just the level 1 sequences array digest. That's going to be part of the main digests. I guess the question is this: do we want to make it so that you can define a top-level digest with that same content? (just the sequences array?) If we do this, then there will be 2 digests that refer to that : the top-level and the level 1 digest, since that's basically a sequence collection with only a single attribute (sequences). The problem here is just that now, nothing is required. So an implementation can't even rely on the fact that there are unique identifiers (names) for the sequences in the collection. I guess I still argue that for this use case, users should use level 1 digests. I guess there's a pretty major downside to this, though, and that is that these would not be compatible with the APIs in the same way as top-level digests. The major advantage of doing it this way, though, is that those digests will perfectly match the sequence digests of other collections that do have names/lengths, which is nice. But I'm open to more discussion on this topic. |
Also: remember, for a particular use case, someone could just say "I'm not using the canonical minimal schema. I'm using my own schema, and in my schema, names and lengths are not required". That's fine. You'd just lose interoperability with other implementations that do require them, that's the cost. |
In recent discussions, we've moved a bit away from even having a "core schema" -- so now the schema is depicted as an example. I think this should probably solve the issue for all parties, because you can still be "seqcol compatible" even if using a completely different schema -- which means you can define your own set of required/inherent attributes. This comes at the cost of interoperability, though -- but that seems to be what is wanted. So. |
Having re-read this, I believe @nsheff I agree that moving away from a "core schema" is the right thing to do. I would suggest the schema isn't positioned as an example but an exemplar and recommend maintaining the semantic meanings of attributes if someone does diverge from it |
My fear is that we are throwing baby out with the bath water by removing the concept of the core schema. This is basically what we have been spending years to discuss and have landed on and is an important outcome of that work. To me, the core schema IS the main core contribution of the standard, it is the thing that might fix the issues we have been facing for 20 years and we should not take it lightly to take that out of the standard. |
I certainly understand that aspect. However there is broader utility in the seqcol spec beyond sequences where you have a central immutable concept that can be represented as a hash (sequences, container images, annotation releases). It feels like there has to be a way we can describe the core spec whilst not avoiding the major improvement of the schema. Also just something I've noted. The BAM spec states that |
I agree with you @sveinugu -- but basically it seems like people don't want interoperability, they want to just solve the one use case they have in the simplest way possible, which by necessity eliminates the ability to use the same thing in other contexts. In the end, people who value the interoperability can still agree upon a shared schema as we have done, and I'm hopeful that if enough momentum builds around that shared schema it could create a community resource where it's worth the extra effort by people outside that central community to buy into the interoperable network built on that core schema... but at the beginning, lacking such a network existing, it's a harder sell. And even if we do build up such a community around an interoperable schema, really, it doesn't really prevent the standard from being useful in a silo, in addition to that . |
with the recent changes including the attribute endpoint, this changes things for this issue of core schema. @sveinugu what's your position now on what should be required/inherent in the core schema? I guess it seems that taking lengths to no longer required/inherent makes more sense now. |
Like @nsheff, I agree with @sveinugu that a core schema does seem like the central property for the work and would be a shame to drop. I also like @andrewyatz's idea of an exemplar, it seems like a good approach that doesn't exclude other use cases down the road. More on this later. Just adding my $0.02 here, but I wonder if the core issue that is being exposed is that there exists exactly one top-level digest property, that inherently supports one view of how collections should be digested. And we are seeing that various participants in this community are exposing different ideas about what attributes are relevant to sequence collections in different contexts. A similar way of thinking might be what motivated the PRC comments @tcezard brought up at the top of this issue. While addressing this by adding the A few ways I think this could be addressed are:
|
There's been some more discussion about this in meetings over the past months, and having just re-read this thread, I want to summarize things here briefly: The past few months we've made major changes toward making the level 1 digests more useful independently. Most importantly, through the To quote from myself above:
If all you have is sequences, then now there's no reason you can't just use level1 sequence digests, with the Furthermore, if what you have is names and lengths (no sequences), you again can accomplish everything using the /attribute endpoint and the level 1 digests for names/lengths (or name_length_pairs, more likely). This makes it more likely for us to preserve the integrity/interoperability of the top-level digests by mandating more in the core schema. Our core schema now says:
This establishes a minimal set of attributes that define the top-level schema, and which ones are inherent. This is probably the most widely useful representation, and also what can be retrieved from a FASTA file. This has been added to the ADR here: #87 Use cases that need only some of these now can operate easily within the spec using attribute endpoints, which will then be guaranteed to be interoperable across servers that use 'names/sequences' as the exclusive list of inherent properties. I think this kind of falls under a 4th option from what @ahwagner suggested above, that sort of partway satisfies all of the 3 suggestions:
|
I wanted to summarise one of the outcome of today's call and clarify a comment made in the PRC feedback document:
In ADR from 2023-07-12, we decided that the only mandatory properties in a sequence collections would be
lengths
andnames
.The argument made today was that by requiring lengths and names to be present, we're potentially forcing these attributes in use cases where they are not relevant or in some case not available. The example given was that of a CRAM file that contains a digest for each sequence but does not contains the length.
The argument in favour of having required field is one of interoperability. Guaranteeing the presence of the two fields helps making different services compatible by always having common grounds.
Reading back the ADR from 2023-07-12, the rational does not feel about how
lengths
andnames
should be made mandatory but howsequences
should not be made mandatory because it would have prevented the use case of coordinate space to be implemented. I think similar argument can be made about other use-cases we might not have envisioned.@raskoleinonen, please correct me if I misrepresented your point
@nsheff @sveinugu @andrewyatz please chime in as any change would have to be made relatively soon.
The text was updated successfully, but these errors were encountered: