ElasticSearch dictionary files for island.is project

Release

Because of the way we handle versions of these dictionary files in the indexer/migration project we must keep an incremental single part versioning scheme.

Releases must then contain at least the .tar.gz file.

Files

File names MUST satisfy this pattern [a-z0-9], i.e. no punctuation or whitespace and MUST NOT exceed 23 in length. As this can cause issues in AWS ES when these files are added as a package and associated to the domain.

stopwords.txt

List of words that should not be indexed (by default)

Format: one word per line, with all inflections.

Examples: https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball

keywords.txt

Words that should not be stemmed.

Format: one word per line

synonyms.txt

Synonyms of words.

Format: We use the solr format. See https://www.elastic.co/guide/en/elasticsearch/reference/7.4/analysis-synonym-tokenfilter.html

Examples: https://github.com/alphagov/search-api/blob/master/config/schema/synonyms.yml

stemmer.txt

A stemmer dictionary.

See https://www.elastic.co/guide/en/elasticsearch/reference/7.4/analysis-stemmer-override-tokenfilter.html

We generate icelandic stemmer.txt from the Augmented Format of the BÍN database (https://bin.arnastofnun.is/DMII/LTdata/k-format/ ). More information about the generation itself later.

hyphenpatterns.txt

A precomputed hyphenation pattern file.

Which uses XML-based hyphenation patterns to find potential subwords in compound words.

.txt is used in the filename due to limitation in AWS ES and package association, but this is file contains valid XML which the analyzer will recognise regardless of the filename ending.

See https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hyp-decomp-tokenfilter.html

More information about this functionality and source later.

hyphenwhitelist.txt

A whitelist for tokens produced in hyphenation decompounder token filter.

See https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hyp-decomp-tokenfilter.html

We generate icelandic hyphenwhitelist.txt from the Augmented Format of the BÍN database (https://bin.arnastofnun.is/DMII/LTdata/k-format/ ). More information about the generation itself later.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
is		is
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ElasticSearch dictionary files for island.is project

Release

Files

stopwords.txt

keywords.txt

synonyms.txt

stemmer.txt

hyphenpatterns.txt

hyphenwhitelist.txt

About

Releases 4

Packages

Contributors 3

island-is/elasticsearch-dictionaries

Folders and files

Latest commit

History

Repository files navigation

ElasticSearch dictionary files for island.is project

Release

Files

stopwords.txt

keywords.txt

synonyms.txt

stemmer.txt

hyphenpatterns.txt

hyphenwhitelist.txt

About

Resources

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 3

Packages