-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Wikidata as a lexicon #9
Comments
As a part of discussing this point, I'd like to hear how the data is structured. If it's a collection of unannotated words without relationships, it's not that helpful. If it has annotations for a given word and all of the grammeme properties for the other surface forms of a given word, that would be helpful. For example, take the Finnish word for numeraali. It has a nicely formatted declension table that is easy to read. The template for the declension table is simply "{{fi-decl-risti|numeraal|||a}}". That makes it easy to format a table, but it makes it hard to parse the data without infrastructure to execute the code behind the template. That template format makes it hard to generate the other surface forms and to deduce the grammatical properties of each form. Some of the cell entries don't even have page entries. So you have to go by what is in the table. It's also worth pointing out that Wiktionary tends to put in optional stress markers in the declension tables for several languages, like Russian and Lithuanian. When you go to the actual Wiktionary page for a word, the stress markers are missing. These optional stress markers are helpful for pronunciation, but they're rarely written outside of an elementary school setting. Clarity around the word relationships and properties in the data would be helpful to understand. |
It would be wikidata, not wiktionary (which doesn't have the right
license).
So check out https://www.wikidata.org/wiki/Q63116. For that
particular term, they don't seem to have declensions.
…On Wed, Mar 13, 2024 at 10:49 AM George Rhoten ***@***.***> wrote:
As a part of discussing this point, I'd like to hear how the data is
structured. If it's a collection of unannotated words without
relationships, it's not that helpful. If it has annotations for a given
word and all of the grammeme properties for the other surface forms of a
given word, that would be helpful.
For example, take the Finnish word for numeraali
<https://en.wiktionary.org/wiki/numeraali>. It has a nicely formatted
declension table that is easy to read. The template for the declension
table is simply "{{fi-decl-risti|numeraal|||a}}". That makes it easy to
format a table, but it makes it hard to parse the data without
infrastructure to execute the code behind the template. That template
format makes it hard to generate the other surface forms and to deduce the
grammatical properties of each form. Some of the cell entries don't even
have page entries. So you have to go by what is in the table.
It's also worth pointing out that Wiktionary tends to put in optional
stress markers in the declension tables for several languages, like Russian
and Lithuanian. When you go to the actual Wiktionary page for a word, the
stress markers are missing. These optional stress markers are helpful for
pronunciation, but they're rarely written outside of an elementary school
setting.
Clarity around the word relationships and properties in the data would be
helpful to understand.
—
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMFXQTILCQXUY7JKIM3YYCGTNAVCNFSM6AAAAABESUVWSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJVGE3DGNJYGE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Yes, I agree that the license for Wiktionary is not ideal. It is helpful to reference for illustrative purposes for problems at hand, and they're both a part of Wikimedia. Wikidata does seem helpful for finding translations and synonyms of terms. I'm less clear on whether declensions exist at all in Wikidata. If it does exist, I'd like to see an example, and hopefully it's structured in a more parseable way than Wiktionary. |
Besides the item for numeral ( Q63116 ) as mentioned by @macchiati there are also 31 lexemes that have this item as a sense: query results for the Lexemes. We don't have one in Finnish for numeraali, unfortunately, but we have an entry for the Estonian numeraal, L375630 (note, Lexeme identifiers start with L, and item identifiers with Q). Roughly, items are the ontological things, and lexemes are the words. Each Lexeme is in a specific language, whereas the items are supposed to be language independent. Each lexeme can have 0 or more senses, and the sense can refer to an item. This way we can have a SPARQL query that asks for all lemmas on the lexemes that have a sense pointing to a given item, such as the item for numeral. As you can see on the page for numeraal, L375630, this is all structured data. All the data can also be downloaded as JSON or as RDF. A SPARQL endpoint allows to query the data. Regarding the questions in the OP:
All data in Wikidata is available under CC-0.
Wikidata has a healthy community, and has seen so far 500,000+ contributors. It is the most edited wiki in the world.
The data can be downloaded in bulk, have structured query using SPARQL, or per individual Lexeme and even more fine-grained. Editing is possible on-wiki with the community, or it can be enriched locally. Happy to answer any more questions! |
Thanks Denny!
…On Wed, Mar 13, 2024 at 8:34 PM Denny Vrandečić ***@***.***> wrote:
Besides the *item* for numeral ( Q63116
<https://www.wikidata.org/wiki/Q63116> ) as mentioned by @macchiati
<https://github.com/macchiati> there are also 31 *lexemes* that have this
item as a sense: query results for the Lexemes <https://w.wiki/9TUu>. We
don't have one in Finnish for *numeraali*, unfortunately, but we have an
entry for the Estonian *numeraal*, L375630
<https://www.wikidata.org/wiki/Lexeme:L375630> (note, Lexeme identifiers
start with L, and item identifiers with Q).
Roughly, items are the ontological things, and lexemes are the words. Each
Lexeme is in a specific language, whereas the items are supposed to be
language independent. Each lexeme can have 0 or more senses, and the sense
can refer to an item. This way we can have a SPARQL query that asks for all
lemmas on the lexemes that have a sense pointing to a given item, such as
the item for numeral.
As you can see on the page for *numeraal*, L375630
<https://www.wikidata.org/wiki/Lexeme:L375630>, this is all structured
data. All the data can also be downloaded as JSON
<https://www.wikidata.org/wiki/Special:EntityData/L375630.json> or as RDF
<https://www.wikidata.org/wiki/Special:EntityData/L375630.rdf>. A SPARQL
endpoint <https://query.wikidata.org> allows to query the data.
Regarding the questions in the OP:
1. Licensing? Is it compatible with our needs (slicing, using in
products, converting to more compact format, adding custom/proprietary
words).
All data in Wikidata is available under CC-0.
1. Filtering spam/abuse - data quality in general
Wikidata has a healthy community, and has seen so far 500,000+
contributors. It is the most edited wiki in the world.
1. What are the tools to operate on the lexicon (slicing, adding
custom/proprietary elements...)
The data can be downloaded in bulk, have structured query using SPARQL, or
per individual Lexeme and even more fine-grained. Editing is possible
on-wiki with the community, or it can be enriched locally.
Happy to answer any more questions!
—
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGTVHFAAT57P4BVWC3YYELDRAVCNFSM6AAAAABESUVWSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJWGMZDMMZRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ooh! That seems interesting. We might be able to use that. It's really good to know how lexemes are referenced.
Is this one of those locations? https://dumps.wikimedia.org/wikidatawiki/ FYI a recent bz2 version of it is 143 GB for reference, but I suspect that we just want to filter out the non-lexeme stuff. |
If you go here https://dumps.wikimedia.org/wikidatawiki/entities/ you can find the dump of only the Lexemes (the files named latest-lexemes..). That is, depending on the format, between 0.3-1.1 GB zipped. The references to the items in Wikidata would not be in, though. |
I'd like to nominate @vrandezo to work on this issue. 😀 |
In our first meeting we discussed various lexicon formats and the use cases for them. Wikidata already has flexible format, and passionate community contributing to it. We could bootstrap our effort by contributing to it, instead of starting a new lexicon under Unicode.
Before we settle on Wikidata we need to answer a couple of questions:
The text was updated successfully, but these errors were encountered: