Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support data schema (column metadata) #20

Open
dougmet opened this issue Mar 28, 2019 · 1 comment
Open

Support data schema (column metadata) #20

dougmet opened this issue Mar 28, 2019 · 1 comment

Comments

@dougmet
Copy link
Contributor

dougmet commented Mar 28, 2019

This is to support reading (#9) and writing to (#15) SDMX and CSVW (#19)

The indicator class will store the raw data, and metadata, and we also want the schema for the data. This includes things such as:

  • Column type: TimePeriod, Dissaggregation, Attribute
  • Data type: Numeric, String etc
  • Name: Optional long name that can include spaces etc
  • Description: Optional column description
  • SDMX field: What field is this in SDMX (related to DSD)
  • Translation key: How to translate to other languages (could even optionally store translations here)
@brockfanning brockfanning changed the title Support data schema (column metadata) in indicator class Support data schema (column metadata) Sep 11, 2020
@brockfanning
Copy link
Contributor

@jwestw This is related to what we discussed today. As mentioned above this is relevant to the output of CSVW. In addition, this would meet a need in Open SDG, where we don't have any way to control the ordering of the columns.

Countries that are inputting their data from SDMX already have a data schema, in their DSD (data structure definition). So I think we can focus on the use-case of countries that are using CSV files, like the UK.

I'll throw out some ideas for approaches below. Personally I kind of lean towards "jsonschema per indicator" along with "auto-generated".

One central jsonschema file

With this approach, there would be a single (very long) jsonschema file in the country's data repository, like "data-schema.json". It would be a full collection of all the columns and values used across all indicators. For example, part of it might look like this:

{
    "Age": {
        "title": "Age",
        "description": "Description of the age column.",
        "type": "string",
        "enum": [
            "Under 15",
            "16 to 24"
         ]
    },
    "Sex": {
        "title": "Sex",
        "description": "Description of the sex column.",
        "type": "string",
        "enum": [
            "Not specified",
            "Female",
            "Male"
         ]
    },
    etc...
}

Pros: centrally located and comprehensive (this is analogous to an SDMX DSD)
Cons: The same file would need to be updated every time a data manager wants to add a new disaggregation column or value

Jsonschema per indicator

With this approach there would be a separate jsonschema file for each indicator. It would look the same as the above, but would only contain the columns/values that are used in that indicator.

Pros: each indicator can be configured separately
Cons: may be some duplication

Auto-generated

This approach could be combined with one of the other two. In this approach, if a column did not have any jsonschema representation, then that jsonschema would be auto-generated, assuming "type": "string" and an enum of all the unique values in the column. (@jwestw I suspect this is partly what that ONS pipeline is doing when it converts to CSVW. So it's possible we could re-use that code or use it as a dependency if possible.) Presumably during auto-generation the order of the columns and values would default to alphabetical.

Using this same code we could also provide a way for countries to "initialize" a jsonschema file, for the purposes of customizing it. For example, say a country wants to customize their data schema for indicator 1.1.1 - they could run a Python script like python scripts/init-data-schema.py 1.1.1 or something to that effect, which would result in an auto-generated data-schemas/1-1-1.json file.

Pros: spares the countries from needing to maintain jsonschema
Cons: none

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants