The term Open Data is generally understood to be data that are made available to the public free of charge, without registration or restrictive licenses, for any purpose whatsoever (including commercial purposes), in electronic, machine-readable formats that ensure data are easy to find, download and use.
Open data initiatives by public institutions, such as governments and intergovernmental organisations, recognise that such data is produced with public funds and so, with few exceptions, should be treated as public goods.
Data reuse, both by data experts and the public at large, is key to creating new opportunities and benefits from government data. Open data reuse requires two basic criteria:
- Data must be legally open, meaning that it is placed in the public domain or under liberal terms of use with minimal restrictions. This ensures that government policies do not create barriers or ambiguities concerning how the data may be used.
- Data must be technically open, meaning that it is published in electronic formats that are machine-readable and non-proprietary. This ensures that ordinary citizens can access and use the data with little or no cost using common software tools.
The purpose of this syllabus in Data Wrangling and Validation for Open Data is to guide learners to confidence in delivering technically open data: well-structured, machine-readable data, validated to a defined and standard metadata schema.
NOTE: the Python-based version of the course is here.
Learning outcomes:
- Understand, and have practical experience with, the structure and design of machine-readable data files.
- Use Excel to investigate and manipulate source data to learn its metadata, shape and robustness, and employ these methods to develop a structural metadata schema.
- Learn and apply a basic set of methods to restructure messy source data into machine-readable CSV files using Microsoft Excel.
Project:
Each participant will be assigned a spreadsheet from training data and expected to restructure it in Excel or an alternative spreadsheet editor.
Learning outcomes:
- Understand, and employ, standard definitions to write a JSON schema for data validation.
- Perform data validation using Microsoft Excel.
- Learn how to validate machine-readable data in online applications against a defined schema.
- Apply techniques in open data publication to prepare data, schemas and validation outputs for publication.
Project:
Using the machine-readable spreadsheet created in Lesson 1, develop a JSON schema, and validate the data using this schema on CSV Lint.
My name is Gavin Chait, and I am an independent data scientist specialising in economic development and data curation. I spent more than a decade in economic and development initiatives in South Africa. I was the commercial lead of open data projects at the Open Knowledge Foundation, leading the open source CKAN development team, and led the implementation of numerous open data technical and research projects around the world. Recently, I have developed openLocal.uk, an initiative to develop a comprehensive business intelligence search engine for entrepreneurs. Data are based on open data and Freedom of Information requests.
I've worked with SBC4D since 2016 on a range of projects spanning from Ghana to Morocco, Tunisia and Ethiopia, to Tanzania and Mauritius. This syllabus was developed for the Des Chiffres et des Jeunes project in Cote d'Ivoire.
I have extensive experience in leading research projects, implementing open source software initiatives, and developing and leading seminars and workshops. I have taught for 25 years, including for undergraduates, adult education, and technical and analytical teaching at all levels.
Course content, materials and approach are copyright Gavin Chait, and released under both the Creative Commons Attribution-ShareAlike 4.0 International and the MIT licences.
The objective is to ensure reuse, and I recommend - but do not require - that any modifications or adaptations of the source material should be released under an equivalent licence.