Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment: inherent property #84

Open
ahwagner opened this issue Aug 21, 2024 · 14 comments
Open

Alignment: inherent property #84

ahwagner opened this issue Aug 21, 2024 · 14 comments
Labels
schema-term Proposals for terms in the core schema

Comments

@ahwagner
Copy link
Member

The VRS 2.0 specification uses a property (ga4ghDigest.keys) to indicate keys that are used in creating VRS computed identifiers. This has the same functionality as the SeqCol inherent attribute.

This has been implemented across 16 VRS classes. The ga4ghDigest object is also used for other digest-related keywords, including ga4ghDigest.prefix for VRS objects that are identifiable.

Here are some examples:

This is an opportunity to align these terms before either standard is finalized. Is it feasible to reuse the VRS ga4ghDigest structure for Sequence Collections in lieu of inherent?

@ahwagner ahwagner added the schema-term Proposals for terms in the core schema label Aug 21, 2024
@sveinugu
Copy link
Collaborator

Or, if I may be so bold: the opposite way around?

What I like about inherent is that is fits very well in conversation. "Should xyz be inherent?" just feels better to say than "should xyz be in keys?"

@ahwagner
Copy link
Member Author

ahwagner commented Aug 21, 2024

We read it as "should xyz be in GA4GH Digest keys?", and "what is the GA4GH Digest prefix?".

What is useful for our purposes is tying together all of the properties associated with the GA4GH digest into one object (see above examples). That said, if keys as a variable name is unpalatable, maybe there are other aligned solutions we could target, e.g. ga4ghDigest.inherent. Would that be preferable? Or perhaps ga4ghDigest.inherentProperties, ga4ghDigest.keyProperties, or something else?

@sveinugu
Copy link
Collaborator

Standardization is obviously what GA4GH is all about, so that is way more important than my personal preferences. So one way or another, I think your suggestion is a good one!

But just to add another argument for inherent (or a variant of that): The word indicates something essential to think about when one considers whether a property should be part of the list. Since a digest is typically used to identify an object, one should consider whether the property in question is an integral part of the identity or definition of said objects, or to put it in another way, is an inherent part of the object.

I am in any case only speaking for myself. Let's hear what others have to say!

@ahwagner
Copy link
Member Author

ahwagner commented Oct 1, 2024

@andrewyatz this is a key GKS / LSG alignment point that would be good to have you weigh in on; commenting here to bump this in your inbox.

@andrewyatz
Copy link
Collaborator

Sorry we did speak about this at the latest meeting for refget. At first I didn't quite get the concept being described here but now I can see it. Also I didn't twig it was vrs 2.0.

I'm personally on the side for more alignment here since the two emerging products are both conveying the same concept. I suspect my comment will be similar to what @sveinugu has said here that alignment is important. I think perhaps there is one difference though, which on the whole might not be important and it does also build on @sveinugu that the inherent property tries to flag the inherent attributes. I guess we're reusing this concept to reflect keys. One is talking about inherent identity the other about keys for a hash.

So overall I'll put my hat on the side of alignment in the hope we can align even though semantically they're not 100% the same. They will be confusing for others. Unless we duplicate. That sounds worse

@ahwagner
Copy link
Member Author

ahwagner commented Oct 6, 2024

I think that using inherent instead of keys as an attribute name is fine; the idea is the same, the keys are selected due to their use as inherent attributes for creating the digest. Reiterating my earlier comment, what I would really like to see here is the inherent attribute list nested under a ga4ghDigest (or even just ga4gh) keyword in JSON Schema.

So, I would propose that both specifications use something like:
ga4gh.inherent to indicate inherent properties of a data class, and ga4gh.prefix for a ga4gh identifier, if applicable.

So in JSON Schema a class definition may look something like:

"$defs": {
  "MyDataClass": {
    "ga4gh":{
      "inherent":["inherentPropertyA", "inherentPropertyB"]
    },
    "type": "object",
    "properties": {
      "inherentPropertyA": { ... },
      "inherentPropertyB": { ... },
      "otherPropertyC": { ... }
    }
  }
}

Would this work for Seq Col?

@ahwagner
Copy link
Member Author

@andrewyatz @nsheff @sveinugu @tcezard our next VRS 2.0 PRC meeting is on 10/31. One of our points of feedback was to document the ga4gh keys attribute (ga4gh/vrs#569); we would like to align with you before presenting this to the PRC. Would it be possible to get a decision on the above proposal before then so we may update our documentation accordingly?

@nsheff
Copy link
Member

nsheff commented Nov 1, 2024

Really the only change we would make, then would be:

description: "A collection of biological sequences."
type: object
properties:
  lengths:
    type: array
    collated: true
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
  names:
    type: array
    collated: true
    description: "Human-readable labels of each sequence (chromosome names)."
    items:
      type: string
  sequences:
    type: array
    collated: true
    items:
      type: string
      description: "Refget sequences v2 identifiers for sequences."
  accessions:
    type: array
    collated: true
    items:
      type: string
      description: "Unique external accessions for the sequences"
required:
  - names
  - lengths
ga4gh:
  inherent:
    - lengths
    - names
    - sequences

I can work with this, but I don't like how inherent is no longer parallel to required.

At the end of the day, we can make it work however, but what is the rationale for the added complexity? I guess to collect all the related terms?

Assuming we add transient and passthru modifiers as well (#86), I suppose it makes sense for these to go under the same ga4gh term as well, like so:

description: "A collection of biological sequences."
type: object
properties:
  lengths:
    type: array
    collated: true
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
  names:
    type: array
    collated: true
    description: "Human-readable labels of each sequence (chromosome names)."
    items:
      type: string
  sequences:
    type: array
    collated: true
    items:
      type: string
      description: "Refget sequences v2 identifiers for sequences."
  accessions:
    type: array
    collated: true
    items:
      type: string
      description: "Unique external accessions for the sequences"
...
required:
  - names
  - lengths
ga4gh:
  inherent:
    - lengths
    - names
    - sequences
  transient:
    - sorted_name_length_pairs
    - sorted_sequences
  passthru:
    - alias
    - author

Edit: fix yaml typo to correctly specify ga4gh section is an object, not an array.

@sveinugu
Copy link
Collaborator

sveinugu commented Nov 1, 2024

What about just namespacing the qualifier names, e.g. ga4gh_inherent? Then we can do that with all non-JSON Schema qualifiers, regardless of their position. We have e.g. a collated qualifier that resides under each property, and adding an extra 'ga4gh' there feels cumbersome.

@ahwagner
Copy link
Member Author

ahwagner commented Nov 1, 2024

The rationale for the ga4gh: as a containing term would be that this indicates a shared set of ga4gh extension conventions for representing, e.g., inherent properties, digest prefixes, transient properties. I imagine that these shared conventions are then registered with a GA4GH cross-product alignment body like TASC to help with future keyword alignment.

While I think this would also be achievable using a ga4gh_ prefix ahead of each term, I prefer the grouping under a shared domain, rather than by name prefix. This pattern is a little closer to how other keywords in JSON schema are used; function is determined by keyword data structures, as opposed to keyword substrings.

@nsheff
Copy link
Member

nsheff commented Nov 1, 2024

I wouldn't do it with the local modifiers. But I'm fine with grouping them under ga4gh for the global modifiers, and in fact I prefer that to prefixing them all with ga4gh_.

@sveinugu
Copy link
Collaborator

sveinugu commented Nov 1, 2024

Yeah, not sure I thought it was a good idea myself, just wanted to consider it. I think your suggestion looks good with the global qualifiers. It feels slightly off that local qualifiers are exempt, but that is not a good argument to not organise the global qualifiers in this way.

@ahwagner
Copy link
Member Author

Alright. Just to be very clear here, there is a small difference between what I proposed in this comment and what followed. I am proposing that ga4gh is a JSON Object, with TASC-registered properties including inherent. The subsequent examples use ga4gh as a JSON Array, but I am not sure if that was an intentional difference or just a typo.

@nsheff can you comment? Was ga4gh as an Array intended, or useful in some way over an Object structure?

@nsheff nsheff mentioned this issue Nov 13, 2024
@nsheff
Copy link
Member

nsheff commented Nov 13, 2024

Was ga4gh as an Array intended, or useful in some way over an Object structure?

No, that's just a typo. The way I wrote that is actually not correct yaml, and I'll edit that now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema-term Proposals for terms in the core schema
Projects
None yet
Development

No branches or pull requests

4 participants