Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tfidf matches #9

Open
mshodge opened this issue Nov 18, 2022 · 3 comments
Open

tfidf matches #9

mshodge opened this issue Nov 18, 2022 · 3 comments

Comments

@mshodge
Copy link

mshodge commented Nov 18, 2022

Hi there -

Really great to see an R package for converting occupation descriptions to ISCO-08 codes.

Could you describe the process of making the tfidf_tokens dataset though. Does it just use the description field?

Also, I think the matching algorithm can be improved not just by taking the sum of the tfidf scores as it does not penalise for when a term is not in the matched tfidf score.

For example,

'bus driver' returns (num_leaves = 10) best match as:

  • 8332 Heavy truck and lorry drivers

The weightTokens match for this is:

Screenshot 2022-11-18 at 13 48 53

Whereas 8331 Bus and tram drivers is in the occupations dataset. But the weightTokens are:

Screenshot 2022-11-18 at 13 48 40

Therefore 8332 Heavy truck and lorry drivers is not being penalised for not having 'bus' in it.

I will have a see if the matcher can add a penalty to it if all words aren't in the weighTokens.

@mshodge
Copy link
Author

mshodge commented Nov 18, 2022

A quick experiment with including a metric for the number of term matches (e.g. bus + driver = 2, bus = 1, driver = 1) and then multiplying this with the isco_nn value seems to give a more sensible result for bus driver:

id iscoGroup count isco_nn isco_nn_count
1: 1 8331 2 2 4
2: 1 8332 1 2 2
3: 1 5165 2 1 2
4: 1 4323 1 1 1
5: 1 8322 1 3 3
6: 1 5311 1 1 1

8331 is Bus and tram drivers.

Will continue to experiment

@mshodge
Copy link
Author

mshodge commented Nov 18, 2022

On the README example it also seems to bring back a more sensible result for Junior Architect Engineer:

id iscoGroup preferredLabel
1: 1 251 Software and applications developers and analysts
2: 2 216 Architects, planners, surveyors and designers
3: 3 523 Cashiers and ticket clerks

@AleKoure
Copy link
Contributor

AleKoure commented Nov 23, 2022

Hello,

The script that reproduces the tfidf table is here.
It uses preferred label and alternative label for all ISCO languages.

The description is omitted for two reasons:

  1. At the time of writing this package description was not provided for all languages
  2. CRAN has a size limit

This classifier may be improved by various text mining techniques that they depend on language and the type of application.
Adding an optional parameter that penalizes the results as you described could be useful.

Could you provide an example script or make a PR so that we can discuss on this further?
Also, a reference if exists would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants