tfidf matches #9

mshodge · 2022-11-18T13:51:43Z

Hi there -

Really great to see an R package for converting occupation descriptions to ISCO-08 codes.

Could you describe the process of making the tfidf_tokens dataset though. Does it just use the description field?

Also, I think the matching algorithm can be improved not just by taking the sum of the tfidf scores as it does not penalise for when a term is not in the matched tfidf score.

For example,

'bus driver' returns (num_leaves = 10) best match as:

8332 Heavy truck and lorry drivers

The weightTokens match for this is:

Whereas 8331 Bus and tram drivers is in the occupations dataset. But the weightTokens are:

Therefore 8332 Heavy truck and lorry drivers is not being penalised for not having 'bus' in it.

I will have a see if the matcher can add a penalty to it if all words aren't in the weighTokens.

mshodge · 2022-11-18T16:03:39Z

A quick experiment with including a metric for the number of term matches (e.g. bus + driver = 2, bus = 1, driver = 1) and then multiplying this with the isco_nn value seems to give a more sensible result for bus driver:

id iscoGroup count isco_nn isco_nn_count
1: 1 8331 2 2 4
2: 1 8332 1 2 2
3: 1 5165 2 1 2
4: 1 4323 1 1 1
5: 1 8322 1 3 3
6: 1 5311 1 1 1

8331 is Bus and tram drivers.

Will continue to experiment

mshodge · 2022-11-18T16:05:48Z

On the README example it also seems to bring back a more sensible result for Junior Architect Engineer:

id iscoGroup preferredLabel
1: 1 251 Software and applications developers and analysts
2: 2 216 Architects, planners, surveyors and designers
3: 3 523 Cashiers and ticket clerks

AleKoure · 2022-11-23T21:25:13Z

Hello,

The script that reproduces the tfidf table is here.
It uses preferred label and alternative label for all ISCO languages.

The description is omitted for two reasons:

At the time of writing this package description was not provided for all languages
CRAN has a size limit

This classifier may be improved by various text mining techniques that they depend on language and the type of application.
Adding an optional parameter that penalizes the results as you described could be useful.

Could you provide an example script or make a PR so that we can discuss on this further?
Also, a reference if exists would be useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tfidf matches #9

tfidf matches #9

mshodge commented Nov 18, 2022

mshodge commented Nov 18, 2022

mshodge commented Nov 18, 2022

AleKoure commented Nov 23, 2022 •

edited

Loading

tfidf matches #9

tfidf matches #9

Comments

mshodge commented Nov 18, 2022

mshodge commented Nov 18, 2022

mshodge commented Nov 18, 2022

AleKoure commented Nov 23, 2022 • edited Loading

AleKoure commented Nov 23, 2022 •

edited

Loading