This repo is an aggregation of sources for Greek language to tackle varios Natural Language Processing/Understanding/Generation needs.
Contents
Greek language is spoken by majority of population in two countries.
X | Country - ISO Language code |
---|---|
CY | |
GR |
Morphological and syntatic annotations of Greek corpus. This Greek UD source used by many other pretrained open-source components.
Manually annotated: lemmas, dependencies, POS, features.
Genres: news, wiki, spoken
Souces: public domain, wikinews articles, European Parlament sessions texts.
Corpus size: 2521 sentences/ 61.673 tokens.
https://universaldependencies.org/treebanks/el_gdt/index.html
Greek text requires accents and diacritics removal. Some new Tokenizers include this step but earliest editions doesn not. https://legacy.cltk.org/en/latest/greek.html
Spacy lemmatizer (trainable lemmatizer)
Depends on a sutiation we might need different corpus tokenization. Sources below include general tokenizers for word, sentence, paragraph tokenization.
Spacy Tokenizer Also available a pipeline component for Greek language senter for Sentence segmentation.
Spacy offers other helpful components: morphologizer, dependency parser, attribute ruler.
Source | Supported labels | Link |
---|---|---|
Spacy | EVENT, GPE, LOC, ORG, PERSON, PRODUCT | Spacy models |
Spark NLP | ||
Stanza | ||
AUEB | LOC, ORG, PERSON, | gr-nlp-toolkit transformer-based |
Package | Details | Link |
---|---|---|
Spark NLP | Multilingual (wrapped from Hugging Face) | |
Transformers | Multilingual |
Cross-lingual QA dataset: XQuAD
BERT model pretrained on Greek corpus only.
bert-base-greek-uncased-v1
List of 144,000 Classical Greek proper nouns