Minimal and extended schemas proposal #50

nsheff · 2023-06-28T13:18:12Z

We decided to start with two schemas: a minimal schema that we would post now as what we should implement, and then an extended schema, which is in evaluation stage to see if it should end up in the minimal schema. Here are some drafts of these for comment and revision:

Minimal seqcol schema

description: "A collection of biological sequences, defined by the GA4GH Sequence Collections standard."
$id: "/schemas/seqcol_base"
version: 0.1.0
type: object
properties:
  lengths:
    type: array
    collated: true
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
  names:
    type: array
    collated: true
    description: "Human-readable identifiers of each sequence (e.g. chromosome names or accessions)."
    items:
      type: string
  sequences:
    type: array
    collated: true
    description: "Digests of sequences computed using the GA4GH digest algorithm (sha512t24u)."
    items:
      type: string
  sorted_name_length_pairs:
    type: array
    description: "Sorted digests of names+lengths pairs, computed following the seqcol specification."
    items:
      type: string
required:
  - lengths
  - names
inherent:
  - lengths
  - names
  - sequences

Extended seqcol schema

$ref: "/schemas/seqcol_base"
$id: "/schemas/seqcol_extended"
properties:
  masks:
    type: array
    collated: true
    description: "Digests of subsequence masks indicating subsequences to be excluded from an analysis, such as repeats"
    items:
      type: string
  priorities:
    type: array
    collated: true
    description: "Annotation of whether each sequence is a primary or secondary component in the collection."
    items:
      type: boolean
  topologies:
    type: array
    collated: true
    description: "Annotation of whether each sequence represents a linear or other topology."
    items:
      type: string
      enum: ["circular", "linear"]
      default: "linear"
  molecule_types:
    type: array
    collated: true
    description: "Designation of the type of molecule for each sequence, such as RNA, DNA, or protein."
    items:
      type: string
  alphabets:
    type: array
    collated: true
    description: "The set of characters actually present in each sequence"
    items:
      type: string
  alphabet_domains:
    type: array
    collated: true
    description: "The set of characters that could be included in each sequence"
    items:
      type: string

nsheff mentioned this issue Jul 12, 2023

What information is included within the string-to-digest? #8

Closed

tcezard mentioned this issue Aug 18, 2023

Define what the service info will contain #39

Open

nsheff added the schema-term Proposals for terms in the core schema label Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal and extended schemas proposal #50

Minimal and extended schemas proposal #50

nsheff commented Jun 28, 2023 •

edited

Loading

Minimal and extended schemas proposal #50

Minimal and extended schemas proposal #50

Comments

nsheff commented Jun 28, 2023 • edited Loading

Minimal seqcol schema

Extended seqcol schema

nsheff commented Jun 28, 2023 •

edited

Loading