Add more rows to data set; discuss reconciliation, GREL, extensions/packages #29

ErinBecker · 2018-06-29T20:54:14Z

The Social Sciences CAC ([email protected]) met June 15th and 19th to discuss the full Social Sciences curriculum and provide recommendations to the Maintainers about work for these lessons between now and their publication (September 2018). Their specific action items for this lesson are as follows:

Getting more rows of the SAFI data to use in the lessons
Incorporating data reconciliation, packages, and more GREL into the OpenRefine lesson

Please see the meeting minutes for more details.

bencomp · 2022-10-21T09:49:56Z

I understand this is an issue from the previous Curriculum Advisory Committee, when other people maintained this lesson. Regardless, I would still like to thank that CAC for bringing up these action items and see how we can address them. I have referred to these suggestions on multiple occasions, so they haven't been ignored all these years.

TL;DR: regarding the action items, I would say:

yes, more rows of data (and more importantly, more errors)
not sure about reconciliation for this dataset
packages/extensions can go in 'Discussion'
yes, more GREL.

As an instructor of this lesson, I agree that 131 rows of data is not a lot. Instead of using facets or clustering to find outliers, you could fix incorrect values in these rows manually. What would be a good number? 1000, which according to the minutes is half the dataset?
It is said to be not critical to have more rows and I can see that. I have argued in #108 that the number of fixes that the data need is too little. I think that is more important (although I am biased).

Reconciliation is very useful too, but we would need things to reconcile, preferably with example research questions that help learners understand why you would reconcile. Perhaps the easiest example would be to reconcile the names of the villages, districts, province and wards. There may be a possibility to use this to spot that in one row the village is said to be in the incorrect ward or district.
This dataset is probably not the best for demonstrating reconciliation, which is about connecting strings in a dataset to more widely used identifiers. Ethics come into play here, as I start to think about how reconciling items in this dataset could help deanonymise the respondents.

If I understand correctly, packages in the action items refers to extensions and distributions for OpenRefine. I think that could be a topic for the discussion page/section, because support for extensions across OR versions varies a lot and the distributions appear to be niche products. They can be powerful, but I feel they are less suited to people who first discover OR.
As I did install the RDF extension my screen looks a bit different from the learners' screens, which I do point out at the beginning of the lesson.

More GREL: yes! GREL is of course the way to transform the data. I wonder if GREL should be introduced with simpler examples than .replace on strings, like incrementing numbers (value + 1) or combining strings to create URLs ("https://example.org/" + value).

Overall, I would like to check in with the current CAC for their views and suggestions. I suggested several potential improvements for this lesson in #102, #122, #108 that would require or benefit from CAC input. They influence how much time opens up for other learning objectives.
Suggestions from others (not in the CAC) are also very welcome, of course!

bencomp · 2022-11-30T09:02:53Z

I posted a link to this issue in the lesson's Slack channel, with questions that are discussed here. @ostephens responded as follows (copied with permission):

How many rows should the dataset have?

In the Library Carpentry OR lesson we have a dataset of 1001 rows which seems to work as a good size - it's big enough you can appreciate that a tool helps, but small enough that we don't have any performance issues.

What kind of GREL expressions should be added to the lesson?

Assuming the current dataset, the cells with lists in look good for some GREL examples. So for instance the "items_owned" column can be manipulated using GREL to give a count of the most common items that are owned (mobile phones and radios just ahead of ploughs).
The current format of those lists makes the GREL slightly complicated to get a clean list and done correctly I think a series of steps that goes through the process of 'cleaning' this column could be provide a really good set of learning materials - one of the great things about OpenRefine is that ability to get real time feedback on changes as you work with the data.
OTOH if a more accessible example is needed the data set could be updated to simplify those lists to be just semi-colon separated which would make the process much simpler.

Another GREL example that would work with the current dataset would be the formatting of the "interview_date" column which is currently in dd-MMM-yyyy (vs the start and end columns which use ISO-8601). So something like:
value.toDate("dd-MMM-yy").toString("yyyy-MM-dd")
could provide a good example.
And give an opportunity to more generally talk about Date manipulation in OR (I would have guessed that date issues might come up commonly in social science datasets - but I may be wrong as not my area)

Is reconciliation useful for this dataset?

The province data, and most of the district data will reconcile nicely against Wikidata which could make a good example and allow the user to bring in data from Wikidata (e.g. the coordinate location for the district - although the data set already has some GPS coordinates so this isn't a strong example here).
Unfortunately at the moment the ward and village information doesn't have wikidata entries that match - although of course someone could fix that 🙂

Should we discuss extensions and alternative distributions of OpenRefine?

Unless there is something really specific I'd say these are worth mentioning but not including in detail (this is the approach we take in Library Carpentry).

bencomp · 2023-01-25T17:40:28Z

@datacarpentry/curriculum-advisors-social-science Your input would be very welcome.

ndporter · 2023-01-26T20:30:49Z

One comment from teaching this recently with the list of items column - the lesson uses GREL to facet by subsets of the column but doesn't demonstrate how to change that column to something more usable (such as dummy variables for each category of item once they're cleaned). As a bonus, parsing it to columns also highlights for learners the difference between cell transforms, multi-valued cell splits, and column splits.

All of that said, adding more GREL is also tricky when learners don't have programming experience because chaining functions can rapidly become confusing to novice coders.

eirini-zormpa · 2023-02-07T12:09:24Z

thank you for these points @bencomp and @ndporter ✨ The CAC haven't had a meeting in a while, but we'll make sure to discuss this issue next time we do!

bencomp · 2023-02-16T10:31:35Z

Thanks for your responses, @ndporter and @eirini-zormpa! I look forward to the results of your discussion.

As to your comment, @ndporter: the idea of using OpenRefine to create dummy variables from the items column had not yet crossed my mind. I like it. After trying and going through the manual and StackOverflow for a little bit, I think it is doable, but not in this workshop. It requires exporting the ID and items columns, doing the transformation in a new project and then importing the new columns (crossing them one by one, potentially) into the project. That is madness. Perhaps there are easier ways using column splitting, but I guess the current exercise of splitting to count is good enough. I'm open to other suggestions for introducing more GREL.

bencomp · 2023-09-20T15:56:01Z

As a Maintainer, I would like to be able to close this issue after five years. It has been open for so long, because it is a collection of suggestions. Some suggestions can be worked on, but others are probably out of scope for the lesson.

To allow for more targeted discussion and decisions, as well as progress on incorporating them, I updated #108 to also track the expansion of the data set with more rows and I created separate issues for the other suggestions.

Introduce more GREL expressions #175 is about GREL expressions
Introduce reconciliation #176 is about reconciliation
Introduce OpenRefine extensions and alternative distributions #177 is about distributions/packages and extensions

I will copy relevant comments to these other topics, so we can continue the discussions and close this issue.

bencomp mentioned this issue Oct 26, 2021

Consistent use of 'facet' and 'filter'; script exercise missing; overlap in introduction and other resources #95

Closed

bencomp mentioned this issue Aug 18, 2022

Many columns in the lesson #35

Open

bencomp mentioned this issue Oct 4, 2022

Move Getting help section and Other Resources episode into missing Discussion section #122

Closed

bencomp added help wanted Looking for Contributors status:refer to cac Curriculum Advisory Committee input needed type:enhancement Propose enhancement to the lesson labels Oct 21, 2022

bencomp mentioned this issue Jul 18, 2023

Dataset should be messier and larger #108

Open

8 tasks

bencomp changed the title ~~action items from Curriculum Advisors~~ Add more rows to data set; discuss reconciliation, GREL, extensions/packages Jul 19, 2023

This was referenced Sep 20, 2023

Introduce more GREL expressions #175

Open

Introduce reconciliation #176

Closed

Introduce OpenRefine extensions and alternative distributions #177

Open

bencomp closed this as completed Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more rows to data set; discuss reconciliation, GREL, extensions/packages #29

Add more rows to data set; discuss reconciliation, GREL, extensions/packages #29

ErinBecker commented Jun 29, 2018 •

edited by bencomp

Loading

bencomp commented Oct 21, 2022

bencomp commented Nov 30, 2022

bencomp commented Jan 25, 2023

ndporter commented Jan 26, 2023

eirini-zormpa commented Feb 7, 2023

bencomp commented Feb 16, 2023

bencomp commented Sep 20, 2023

Add more rows to data set; discuss reconciliation, GREL, extensions/packages #29

Add more rows to data set; discuss reconciliation, GREL, extensions/packages #29

Comments

ErinBecker commented Jun 29, 2018 • edited by bencomp Loading

bencomp commented Oct 21, 2022

bencomp commented Nov 30, 2022

bencomp commented Jan 25, 2023

ndporter commented Jan 26, 2023

eirini-zormpa commented Feb 7, 2023

bencomp commented Feb 16, 2023

bencomp commented Sep 20, 2023

ErinBecker commented Jun 29, 2018 •

edited by bencomp

Loading