Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NER] More fine-grained set definition regarding locations #59

Open
proycon opened this issue Aug 21, 2018 · 7 comments
Open

[NER] More fine-grained set definition regarding locations #59

proycon opened this issue Aug 21, 2018 · 7 comments
Assignees

Comments

@proycon
Copy link
Member

proycon commented Aug 21, 2018

Currently the NER module in Frog distinguishes persons, locations, events, products(?) and miscellaneous.

Since the module has been enhanced with gazetteers, I think we can do better than this coarse division. Various named entities are perfectly enumerable; countries, cities, street names, postal codes, rivers, forests, mountains... and gazetteers serve well here; it would be a waste to lose this information by subsuming it all under "location". We already have a FoLiA set definition (https://github.com/proycon/folia/blob/master/setdefinitions/namedentities.foliaset.ttl) from a prior project that allows for a more fine-grained taxonomy regarding locations, which is compatible (i.e. a superset) with our current set.

Databases such as Geonames also contain this information, and we currently don't make use of it. I propose we migrate to a more fine-grained set (and include a few more gazetteers where possible). What do you think @kosloot @antalvdb @Irishx ?

Context: this is relevant for our 112-project (@HenkvdHeuvel), here we need to know whether a location is a street, city, etc.. I think we can include a lot of these gazetteer-based improvements in the Frog data itself, i.e. the generic dutch model (as it's not sensitive data)

(technicality: this is more of more of a frogdata issue than a Frog issue as such, but I guess it's more visible here)

@kosloot
Copy link
Collaborator

kosloot commented Aug 22, 2018

As far as i can see. The software itself doesn't impose restrictions. So this is indeed a data question.
I did use a small part of Geonames to test, and it is usable. But there are a lot of ugly details to consider.
The data can be polluted and (very) ambiguous.
So using this data might need some investigation, and probably preprocessing.

@proycon
Copy link
Member Author

proycon commented Aug 22, 2018

Another good (secondary) source for location data is OpenStreetMap, I experimented with that yesterday. It's fairly easy to extract all streets and cities/towns.

proycon added a commit to LanguageMachines/frogdata that referenced this issue Aug 22, 2018
…med some files to make clear it's about organisations rather than locations, added a provinces list
proycon added a commit to LanguageMachines/frogdata that referenced this issue Aug 23, 2018
proycon added a commit to LanguageMachines/frogdata that referenced this issue Aug 30, 2018
@proycon
Copy link
Member Author

proycon commented Nov 15, 2018

This is also relevant for @Irishx (frog evaluation) and @HenkvdHeuvel (112 project), and perhaps @antalvdb:

Okay, things are a bit more complex. We have some ordering problems. The current situation:

[Kobus] als ambigue, dan wint de laatste gazet, denk ik
denk dat het zo gaat:
ALS geen timbl tag toegekend
en WEL een gazet info bekend
dan neem die

[Kobus] alles wordt in een grote hash gepropt
laatste telt

I have a test sentence:

De Maas en de Waal stromen niet door Amsterdam, maar monden wel uit in de Noordzee

This results in four loc detections (from the context-based module), which is correct but doesn't make use of the gazetteers so we don't get any of the fine-grained categories, which was kind of the whole point of this exercise.

If I use a Frog trained on the much more limited model from the 112 project, the gazetteers do kick in and I now get:

  • De Maas - per (error from the context-module I presume)
  • Waal - loc.street
  • Amsterdam -- loc.city
  • de Noordzee -- loc.city

There's a street named Waal, not surprising as there are streets named after pretty much everything so this should be get a lower priority. There's also a village called "Noordzee" apparently which happens to take precedence over loc.water.sea.

I'm trying to find the 'optimal' ordering for ners.known, which is tricky enough as there is always ambiguity and you can never get it really right, but I can't override the NER context-based module here which poses a bigger problem. It would help (feature request) if we had a parameter to set the context-based module to have the highest priority, lowest priority, or completely disable it (the latter case might be interesting if you want to rely on gazetteers only, for speed for instance which may be an important factor in the 112 project)

Opinions?

Do we want to merge the new gazetteers into frogdata master despite the problems (the new lists technically are superior, i.e. more complete, and categories more fine-grained). Or do we keep the old status quo for now?

@Irishx
Copy link
Contributor

Irishx commented Nov 15, 2018

ik wil de NER graag testen met en zonder deze gazetteers om te zien wat het effect is.

@proycon
Copy link
Member Author

proycon commented Nov 15, 2018

Dat lijkt me een goed idee ja, je kan de gazetteers in ieder geval uitschakelen door zelf in de frog configuratie file, en ners.known te editen.

@kosloot
Copy link
Collaborator

kosloot commented Mar 7, 2023

@proycon and @Irishx Can we close this as "solved" for now? Or?

@Irishx
Copy link
Contributor

Irishx commented Mar 7, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants