Because of the way we handle versions of these dictionary files in the indexer/migration project we must keep an incremental single part versioning scheme.
Releases must then contain at least the .tar.gz file.
File names MUST satisfy this pattern [a-z0-9], i.e. no punctuation or whitespace and MUST NOT exceed 23 in length. As this can cause issues in AWS ES when these files are added as a package and associated to the domain.
List of words that should not be indexed (by default)
Format: one word per line, with all inflections.
Words that should not be stemmed.
Format: one word per line
Synonyms of words.
Format: We use the solr format. See https://www.elastic.co/guide/en/elasticsearch/reference/7.4/analysis-synonym-tokenfilter.html
Examples: https://github.com/alphagov/search-api/blob/master/config/schema/synonyms.yml
A stemmer dictionary.
We generate icelandic stemmer.txt from the Augmented Format of the BÍN database (https://bin.arnastofnun.is/DMII/LTdata/k-format/ ). More information about the generation itself later.
A precomputed hyphenation pattern file.
Which uses XML-based hyphenation patterns to find potential subwords in compound words.
.txt is used in the filename due to limitation in AWS ES and package association, but this is file contains valid XML which the analyzer will recognise regardless of the filename ending.
See https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hyp-decomp-tokenfilter.html
More information about this functionality and source later.
A whitelist for tokens produced in hyphenation decompounder token filter.
See https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hyp-decomp-tokenfilter.html
We generate icelandic hyphenwhitelist.txt from the Augmented Format of the BÍN database (https://bin.arnastofnun.is/DMII/LTdata/k-format/ ). More information about the generation itself later.