Warning: SyntaxDot supersedes sticker2.
sticker2 is a sequence labeler using Transformer networks. sticker2 models can be trained from scratch or using pretrained models, such as BERT or XLM-RoBERTa.
In principle, sticker2 can be used to perform any sequence labeling task, but so far the focus has been on:
- Part-of-speech tagging
- Morphological tagging
- Topological field tagging
- Lemmatization
- Dependency parsing
- Named entity recognition
The easiest way to get started with sticker2 is to use a pretrained model.
- Input representations:
- Word pieces
- Sentence pieces
- Flexible sequence encoder/decoder architecture, which supports:
- Simple sequence labels (e.g. POS, morphology, named entities)
- Lemmatization, based on edit trees
- Dependency parsing
- Simple API to extend to other tasks
- Models representations:
- Transformers
- Pretraining from BERT and XLM-RoBERTa models
- Multi-task training and classification using scalar weighting.
- Model distillation
- Deployment:
- Standalone binary that links against PyTorch's
libtorch
- Very liberal license
- Standalone binary that links against PyTorch's
sticker2 is still under heavy development. However, models are reusable and the API is stable for every y in version 0.y.z.
sticker uses techniques from or was inspired by the following papers:
- The model architecture and training regime was largely based on 75 Languages, 1 Model: Parsing Universal Dependencies Universally. Dan Kondratyuk and Milan Straka, 2019, Proceedings of the EMNLP 2019 and the 9th IJCNLP.
- The tagging as sequence labeling scheme was proposed by Dependency Parsing as a Sequence Labeling Task. Drahomíra Spoustová, Miroslav Spousta, 2010, The Prague Bulletin of Mathematical Linguistics, Volume 94.
- The idea to combine this scheme with neural networks comes from Viable Dependency Parsing as Sequence Labeling. Michalina Strzyz, David Vilares, Carlos Gómez-Rodríguez, 2019, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- The encoding of lemmatization as edit trees was proposed in Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupała, 2008, PhD dissertation, Dublin City University.
You can report bugs and feature requests in the sticker2 issue tracker.
sticker2 is licensed under the Blue Oak Model License version 1.0.0. The list of contributors is also available.
- sticker2 is developed by Daniël de Kok & Tobias Pütz.
- The Python precursor to sticker was developer by Erik Schill.
- Sebastian Pütz and Patricia Fischer reviewed a lot of code across the sticker projects.