Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Wikidata as a lexicon #9

Open
nciric opened this issue Mar 12, 2024 · 8 comments
Open

Using Wikidata as a lexicon #9

nciric opened this issue Mar 12, 2024 · 8 comments
Labels
discuss Discussion item

Comments

@nciric
Copy link
Contributor

nciric commented Mar 12, 2024

In our first meeting we discussed various lexicon formats and the use cases for them. Wikidata already has flexible format, and passionate community contributing to it. We could bootstrap our effort by contributing to it, instead of starting a new lexicon under Unicode.

Before we settle on Wikidata we need to answer a couple of questions:

  1. Licensing? Is it compatible with our needs (slicing, using in products, converting to more compact format, adding custom/proprietary words).
  2. Filtering spam/abuse - data quality in general
  3. What are the tools to operate on the lexicon (slicing, adding custom/proprietary elements...)
@nciric nciric added the discuss Discussion item label Mar 12, 2024
@grhoten
Copy link
Member

grhoten commented Mar 13, 2024

As a part of discussing this point, I'd like to hear how the data is structured. If it's a collection of unannotated words without relationships, it's not that helpful. If it has annotations for a given word and all of the grammeme properties for the other surface forms of a given word, that would be helpful.

For example, take the Finnish word for numeraali. It has a nicely formatted declension table that is easy to read. The template for the declension table is simply "{{fi-decl-risti|numeraal|||a}}". That makes it easy to format a table, but it makes it hard to parse the data without infrastructure to execute the code behind the template. That template format makes it hard to generate the other surface forms and to deduce the grammatical properties of each form. Some of the cell entries don't even have page entries. So you have to go by what is in the table.

It's also worth pointing out that Wiktionary tends to put in optional stress markers in the declension tables for several languages, like Russian and Lithuanian. When you go to the actual Wiktionary page for a word, the stress markers are missing. These optional stress markers are helpful for pronunciation, but they're rarely written outside of an elementary school setting.

Clarity around the word relationships and properties in the data would be helpful to understand.

@macchiati
Copy link
Member

macchiati commented Mar 13, 2024 via email

@grhoten
Copy link
Member

grhoten commented Mar 13, 2024

Yes, I agree that the license for Wiktionary is not ideal. It is helpful to reference for illustrative purposes for problems at hand, and they're both a part of Wikimedia.

Wikidata does seem helpful for finding translations and synonyms of terms. I'm less clear on whether declensions exist at all in Wikidata. If it does exist, I'd like to see an example, and hopefully it's structured in a more parseable way than Wiktionary.

@vrandezo
Copy link

Besides the item for numeral ( Q63116 ) as mentioned by @macchiati there are also 31 lexemes that have this item as a sense: query results for the Lexemes. We don't have one in Finnish for numeraali, unfortunately, but we have an entry for the Estonian numeraal, L375630 (note, Lexeme identifiers start with L, and item identifiers with Q).

Roughly, items are the ontological things, and lexemes are the words. Each Lexeme is in a specific language, whereas the items are supposed to be language independent. Each lexeme can have 0 or more senses, and the sense can refer to an item. This way we can have a SPARQL query that asks for all lemmas on the lexemes that have a sense pointing to a given item, such as the item for numeral.

As you can see on the page for numeraal, L375630, this is all structured data. All the data can also be downloaded as JSON or as RDF. A SPARQL endpoint allows to query the data.

Regarding the questions in the OP:

  1. Licensing? Is it compatible with our needs (slicing, using in products, converting to more compact format, adding custom/proprietary words).

All data in Wikidata is available under CC-0.

  1. Filtering spam/abuse - data quality in general

Wikidata has a healthy community, and has seen so far 500,000+ contributors. It is the most edited wiki in the world.

  1. What are the tools to operate on the lexicon (slicing, adding custom/proprietary elements...)

The data can be downloaded in bulk, have structured query using SPARQL, or per individual Lexeme and even more fine-grained. Editing is possible on-wiki with the community, or it can be enriched locally.

Happy to answer any more questions!

@macchiati
Copy link
Member

macchiati commented Mar 14, 2024 via email

@grhoten
Copy link
Member

grhoten commented Mar 14, 2024

we have an entry for the Estonian numeraal, L375630 (note, Lexeme identifiers start with L, and item identifiers with Q).

Ooh! That seems interesting. We might be able to use that. It's really good to know how lexemes are referenced.

The data can be downloaded in bulk

Is this one of those locations? https://dumps.wikimedia.org/wikidatawiki/

FYI a recent bz2 version of it is 143 GB for reference, but I suspect that we just want to filter out the non-lexeme stuff.

@vrandezo
Copy link

If you go here

https://dumps.wikimedia.org/wikidatawiki/entities/

you can find the dump of only the Lexemes (the files named latest-lexemes..). That is, depending on the format, between 0.3-1.1 GB zipped. The references to the items in Wikidata would not be in, though.

@grhoten
Copy link
Member

grhoten commented Dec 10, 2024

I'd like to nominate @vrandezo to work on this issue. 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Discussion item
Projects
Status: In Progress
Development

No branches or pull requests

4 participants