Using Wikidata as a lexicon #9

nciric · 2024-03-12T17:09:52Z

In our first meeting we discussed various lexicon formats and the use cases for them. Wikidata already has flexible format, and passionate community contributing to it. We could bootstrap our effort by contributing to it, instead of starting a new lexicon under Unicode.

Before we settle on Wikidata we need to answer a couple of questions:

Licensing? Is it compatible with our needs (slicing, using in products, converting to more compact format, adding custom/proprietary words).
Filtering spam/abuse - data quality in general
What are the tools to operate on the lexicon (slicing, adding custom/proprietary elements...)

grhoten · 2024-03-13T17:49:19Z

As a part of discussing this point, I'd like to hear how the data is structured. If it's a collection of unannotated words without relationships, it's not that helpful. If it has annotations for a given word and all of the grammeme properties for the other surface forms of a given word, that would be helpful.

For example, take the Finnish word for numeraali. It has a nicely formatted declension table that is easy to read. The template for the declension table is simply "{{fi-decl-risti|numeraal|||a}}". That makes it easy to format a table, but it makes it hard to parse the data without infrastructure to execute the code behind the template. That template format makes it hard to generate the other surface forms and to deduce the grammatical properties of each form. Some of the cell entries don't even have page entries. So you have to go by what is in the table.

It's also worth pointing out that Wiktionary tends to put in optional stress markers in the declension tables for several languages, like Russian and Lithuanian. When you go to the actual Wiktionary page for a word, the stress markers are missing. These optional stress markers are helpful for pronunciation, but they're rarely written outside of an elementary school setting.

Clarity around the word relationships and properties in the data would be helpful to understand.

macchiati · 2024-03-13T17:54:30Z

It would be wikidata, not wiktionary (which doesn't have the right license). So check out https://www.wikidata.org/wiki/Q63116. For that particular term, they don't seem to have declensions.

…

On Wed, Mar 13, 2024 at 10:49 AM George Rhoten ***@***.***> wrote: As a part of discussing this point, I'd like to hear how the data is structured. If it's a collection of unannotated words without relationships, it's not that helpful. If it has annotations for a given word and all of the grammeme properties for the other surface forms of a given word, that would be helpful. For example, take the Finnish word for numeraali <https://en.wiktionary.org/wiki/numeraali>. It has a nicely formatted declension table that is easy to read. The template for the declension table is simply "{{fi-decl-risti|numeraal|||a}}". That makes it easy to format a table, but it makes it hard to parse the data without infrastructure to execute the code behind the template. That template format makes it hard to generate the other surface forms and to deduce the grammatical properties of each form. Some of the cell entries don't even have page entries. So you have to go by what is in the table. It's also worth pointing out that Wiktionary tends to put in optional stress markers in the declension tables for several languages, like Russian and Lithuanian. When you go to the actual Wiktionary page for a word, the stress markers are missing. These optional stress markers are helpful for pronunciation, but they're rarely written outside of an elementary school setting. Clarity around the word relationships and properties in the data would be helpful to understand. — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMFXQTILCQXUY7JKIM3YYCGTNAVCNFSM6AAAAABESUVWSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJVGE3DGNJYGE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

grhoten · 2024-03-13T18:47:03Z

Yes, I agree that the license for Wiktionary is not ideal. It is helpful to reference for illustrative purposes for problems at hand, and they're both a part of Wikimedia.

Wikidata does seem helpful for finding translations and synonyms of terms. I'm less clear on whether declensions exist at all in Wikidata. If it does exist, I'd like to see an example, and hopefully it's structured in a more parseable way than Wiktionary.

vrandezo · 2024-03-14T03:33:54Z

Besides the item for numeral ( Q63116 ) as mentioned by @macchiati there are also 31 lexemes that have this item as a sense: query results for the Lexemes. We don't have one in Finnish for numeraali, unfortunately, but we have an entry for the Estonian numeraal, L375630 (note, Lexeme identifiers start with L, and item identifiers with Q).

Roughly, items are the ontological things, and lexemes are the words. Each Lexeme is in a specific language, whereas the items are supposed to be language independent. Each lexeme can have 0 or more senses, and the sense can refer to an item. This way we can have a SPARQL query that asks for all lemmas on the lexemes that have a sense pointing to a given item, such as the item for numeral.

As you can see on the page for numeraal, L375630, this is all structured data. All the data can also be downloaded as JSON or as RDF. A SPARQL endpoint allows to query the data.

Regarding the questions in the OP:

Licensing? Is it compatible with our needs (slicing, using in products, converting to more compact format, adding custom/proprietary words).

All data in Wikidata is available under CC-0.

Filtering spam/abuse - data quality in general

Wikidata has a healthy community, and has seen so far 500,000+ contributors. It is the most edited wiki in the world.

What are the tools to operate on the lexicon (slicing, adding custom/proprietary elements...)

The data can be downloaded in bulk, have structured query using SPARQL, or per individual Lexeme and even more fine-grained. Editing is possible on-wiki with the community, or it can be enriched locally.

Happy to answer any more questions!

macchiati · 2024-03-14T03:37:31Z

Thanks Denny!

…

On Wed, Mar 13, 2024 at 8:34 PM Denny Vrandečić ***@***.***> wrote: Besides the *item* for numeral ( Q63116 <https://www.wikidata.org/wiki/Q63116> ) as mentioned by @macchiati <https://github.com/macchiati> there are also 31 *lexemes* that have this item as a sense: query results for the Lexemes <https://w.wiki/9TUu>. We don't have one in Finnish for *numeraali*, unfortunately, but we have an entry for the Estonian *numeraal*, L375630 <https://www.wikidata.org/wiki/Lexeme:L375630> (note, Lexeme identifiers start with L, and item identifiers with Q). Roughly, items are the ontological things, and lexemes are the words. Each Lexeme is in a specific language, whereas the items are supposed to be language independent. Each lexeme can have 0 or more senses, and the sense can refer to an item. This way we can have a SPARQL query that asks for all lemmas on the lexemes that have a sense pointing to a given item, such as the item for numeral. As you can see on the page for *numeraal*, L375630 <https://www.wikidata.org/wiki/Lexeme:L375630>, this is all structured data. All the data can also be downloaded as JSON <https://www.wikidata.org/wiki/Special:EntityData/L375630.json> or as RDF <https://www.wikidata.org/wiki/Special:EntityData/L375630.rdf>. A SPARQL endpoint <https://query.wikidata.org> allows to query the data. Regarding the questions in the OP: 1. Licensing? Is it compatible with our needs (slicing, using in products, converting to more compact format, adding custom/proprietary words). All data in Wikidata is available under CC-0. 1. Filtering spam/abuse - data quality in general Wikidata has a healthy community, and has seen so far 500,000+ contributors. It is the most edited wiki in the world. 1. What are the tools to operate on the lexicon (slicing, adding custom/proprietary elements...) The data can be downloaded in bulk, have structured query using SPARQL, or per individual Lexeme and even more fine-grained. Editing is possible on-wiki with the community, or it can be enriched locally. Happy to answer any more questions! — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMGTVHFAAT57P4BVWC3YYELDRAVCNFSM6AAAAABESUVWSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJWGMZDMMZRHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

grhoten · 2024-03-14T05:16:13Z

we have an entry for the Estonian numeraal, L375630 (note, Lexeme identifiers start with L, and item identifiers with Q).

Ooh! That seems interesting. We might be able to use that. It's really good to know how lexemes are referenced.

The data can be downloaded in bulk

Is this one of those locations? https://dumps.wikimedia.org/wikidatawiki/

FYI a recent bz2 version of it is 143 GB for reference, but I suspect that we just want to filter out the non-lexeme stuff.

vrandezo · 2024-03-15T15:50:48Z

If you go here

https://dumps.wikimedia.org/wikidatawiki/entities/

you can find the dump of only the Lexemes (the files named latest-lexemes..). That is, depending on the format, between 0.3-1.1 GB zipped. The references to the items in Wikidata would not be in, though.

grhoten · 2024-12-10T18:41:51Z

I'd like to nominate @vrandezo to work on this issue. 😀

nciric added the discuss Discussion item label Mar 12, 2024

nciric added this to Inflection scope dashboard Mar 15, 2024

nciric moved this to In Progress in Inflection scope dashboard Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Wikidata as a lexicon #9

Using Wikidata as a lexicon #9

nciric commented Mar 12, 2024

grhoten commented Mar 13, 2024

macchiati commented Mar 13, 2024 via email

grhoten commented Mar 13, 2024

vrandezo commented Mar 14, 2024

macchiati commented Mar 14, 2024 via email

grhoten commented Mar 14, 2024

vrandezo commented Mar 15, 2024

grhoten commented Dec 10, 2024

Using Wikidata as a lexicon #9

Using Wikidata as a lexicon #9

Comments

nciric commented Mar 12, 2024

grhoten commented Mar 13, 2024

macchiati commented Mar 13, 2024 via email

grhoten commented Mar 13, 2024

vrandezo commented Mar 14, 2024

macchiati commented Mar 14, 2024 via email

grhoten commented Mar 14, 2024

vrandezo commented Mar 15, 2024

grhoten commented Dec 10, 2024