-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NER] More fine-grained set definition regarding locations #59
Comments
As far as i can see. The software itself doesn't impose restrictions. So this is indeed a data question. |
Another good (secondary) source for location data is OpenStreetMap, I experimented with that yesterday. It's fairly easy to extract all streets and cities/towns. |
…med some files to make clear it's about organisations rather than locations, added a provinces list
…reldsteden. Using the new tagset (LanguageMachines/frog#59)
This is also relevant for @Irishx (frog evaluation) and @HenkvdHeuvel (112 project), and perhaps @antalvdb: Okay, things are a bit more complex. We have some ordering problems. The current situation:
I have a test sentence: De Maas en de Waal stromen niet door Amsterdam, maar monden wel uit in de Noordzee This results in four If I use a Frog trained on the much more limited model from the 112 project, the gazetteers do kick in and I now get:
There's a street named Waal, not surprising as there are streets named after pretty much everything so this should be get a lower priority. There's also a village called "Noordzee" apparently which happens to take precedence over I'm trying to find the 'optimal' ordering for Opinions? Do we want to merge the new gazetteers into frogdata master despite the problems (the new lists technically are superior, i.e. more complete, and categories more fine-grained). Or do we keep the old status quo for now? |
ik wil de NER graag testen met en zonder deze gazetteers om te zien wat het effect is. |
Dat lijkt me een goed idee ja, je kan de gazetteers in ieder geval uitschakelen door zelf in de frog configuratie file, en |
hoi,
Ja dit kunnen we wel afsluiten.
Groetjes
iris
Iris Hendrickx
***@***.***
… On 7 Mar 2023, at 09:20, Ko van der Sloot ***@***.***> wrote:
@proycon <https://github.com/proycon> and @Irishx <https://github.com/Irishx> Can we close this as "solved" for now? Or?
—
Reply to this email directly, view it on GitHub <#59 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPWXGWAUUO2SK2AAZQOMX3W23VVHANCNFSM4FQY544A>.
You are receiving this because you were mentioned.
|
Currently the NER module in Frog distinguishes persons, locations, events, products(?) and miscellaneous.
Since the module has been enhanced with gazetteers, I think we can do better than this coarse division. Various named entities are perfectly enumerable; countries, cities, street names, postal codes, rivers, forests, mountains... and gazetteers serve well here; it would be a waste to lose this information by subsuming it all under "location". We already have a FoLiA set definition (https://github.com/proycon/folia/blob/master/setdefinitions/namedentities.foliaset.ttl) from a prior project that allows for a more fine-grained taxonomy regarding locations, which is compatible (i.e. a superset) with our current set.
Databases such as Geonames also contain this information, and we currently don't make use of it. I propose we migrate to a more fine-grained set (and include a few more gazetteers where possible). What do you think @kosloot @antalvdb @Irishx ?
Context: this is relevant for our 112-project (@HenkvdHeuvel), here we need to know whether a location is a street, city, etc.. I think we can include a lot of these gazetteer-based improvements in the Frog data itself, i.e. the generic dutch model (as it's not sensitive data)
(technicality: this is more of more of a frogdata issue than a Frog issue as such, but I guess it's more visible here)
The text was updated successfully, but these errors were encountered: