Support data schema (column metadata) #20

dougmet · 2019-03-28T11:28:33Z

This is to support reading (#9) and writing to (#15) SDMX and CSVW (#19)

The indicator class will store the raw data, and metadata, and we also want the schema for the data. This includes things such as:

Column type: TimePeriod, Dissaggregation, Attribute
Data type: Numeric, String etc
Name: Optional long name that can include spaces etc
Description: Optional column description
SDMX field: What field is this in SDMX (related to DSD)
Translation key: How to translate to other languages (could even optionally store translations here)

brockfanning · 2020-09-11T12:25:33Z

@jwestw This is related to what we discussed today. As mentioned above this is relevant to the output of CSVW. In addition, this would meet a need in Open SDG, where we don't have any way to control the ordering of the columns.

Countries that are inputting their data from SDMX already have a data schema, in their DSD (data structure definition). So I think we can focus on the use-case of countries that are using CSV files, like the UK.

I'll throw out some ideas for approaches below. Personally I kind of lean towards "jsonschema per indicator" along with "auto-generated".

One central jsonschema file

With this approach, there would be a single (very long) jsonschema file in the country's data repository, like "data-schema.json". It would be a full collection of all the columns and values used across all indicators. For example, part of it might look like this:

{
    "Age": {
        "title": "Age",
        "description": "Description of the age column.",
        "type": "string",
        "enum": [
            "Under 15",
            "16 to 24"
         ]
    },
    "Sex": {
        "title": "Sex",
        "description": "Description of the sex column.",
        "type": "string",
        "enum": [
            "Not specified",
            "Female",
            "Male"
         ]
    },
    etc...
}

Pros: centrally located and comprehensive (this is analogous to an SDMX DSD)
Cons: The same file would need to be updated every time a data manager wants to add a new disaggregation column or value

Jsonschema per indicator

With this approach there would be a separate jsonschema file for each indicator. It would look the same as the above, but would only contain the columns/values that are used in that indicator.

Pros: each indicator can be configured separately
Cons: may be some duplication

Auto-generated

This approach could be combined with one of the other two. In this approach, if a column did not have any jsonschema representation, then that jsonschema would be auto-generated, assuming "type": "string" and an enum of all the unique values in the column. (@jwestw I suspect this is partly what that ONS pipeline is doing when it converts to CSVW. So it's possible we could re-use that code or use it as a dependency if possible.) Presumably during auto-generation the order of the columns and values would default to alphabetical.

Using this same code we could also provide a way for countries to "initialize" a jsonschema file, for the purposes of customizing it. For example, say a country wants to customize their data schema for indicator 1.1.1 - they could run a Python script like python scripts/init-data-schema.py 1.1.1 or something to that effect, which would result in an auto-generated data-schemas/1-1-1.json file.

Pros: spares the countries from needing to maintain jsonschema
Cons: none

dougmet mentioned this issue Mar 28, 2019

Ignore SDMX-related fields open-sdg/open-sdg#133

Closed

brockfanning changed the title ~~Support data schema (column metadata) in indicator class~~ Support data schema (column metadata) Sep 11, 2020

brockfanning mentioned this issue Mar 15, 2021

Dataflows and metadataflows #214

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support data schema (column metadata) #20

Support data schema (column metadata) #20

dougmet commented Mar 28, 2019

brockfanning commented Sep 11, 2020

Support data schema (column metadata) #20

Support data schema (column metadata) #20

Comments

dougmet commented Mar 28, 2019

brockfanning commented Sep 11, 2020

One central jsonschema file

Jsonschema per indicator

Auto-generated