-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tfidf matches #9
Comments
A quick experiment with including a metric for the number of term matches (e.g. bus + driver = 2, bus = 1, driver = 1) and then multiplying this with the isco_nn value seems to give a more sensible result for bus driver: id iscoGroup count isco_nn isco_nn_count 8331 is Bus and tram drivers. Will continue to experiment |
On the README example it also seems to bring back a more sensible result for Junior Architect Engineer: id iscoGroup preferredLabel |
Hello, The script that reproduces the tfidf table is here. The description is omitted for two reasons:
This classifier may be improved by various text mining techniques that they depend on language and the type of application. Could you provide an example script or make a PR so that we can discuss on this further? |
Hi there -
Really great to see an R package for converting occupation descriptions to ISCO-08 codes.
Could you describe the process of making the tfidf_tokens dataset though. Does it just use the description field?
Also, I think the matching algorithm can be improved not just by taking the sum of the tfidf scores as it does not penalise for when a term is not in the matched tfidf score.
For example,
'bus driver' returns (num_leaves = 10) best match as:
The weightTokens match for this is:
Whereas 8331 Bus and tram drivers is in the occupations dataset. But the weightTokens are:
Therefore 8332 Heavy truck and lorry drivers is not being penalised for not having 'bus' in it.
I will have a see if the matcher can add a penalty to it if all words aren't in the weighTokens.
The text was updated successfully, but these errors were encountered: