Skip to content

LOT Summer school 2018 - Language technology for low-resource languages

Notifications You must be signed in to change notification settings

yvesscherrer/lot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Course description

A large part of recent research in language technology (LT) is restricted to a small number of languages. While more and more datasets are created, made available, and used for English and a few other languages, the large majority of the world's languages is hardly ever the object of LT research. In this course, we will introduce and discuss several definitions of so-called 'low-resource languages', and we will examine how LT systems (such as taggers or parsers) can be developed for such languages despite the challenging data situation. In particular, we will discuss how linguistic annotations or models can be transferred from a resource-rich to a resource-poor language. In this setting, we have to distinguish cases where the two languages are etymologically closely related from cases where they are not. We will also see how these methods can be applied to 'special' types of low-resource languages such as historical language varieties, dialects, and sociolects, whose automatic processing faces similar challenges.

Day-to-day program

Monday

Definitions of low-resource languages in linguistics and computational linguistics

Overview of the main language technology applications and their resource requirements

Tuesday

Annotation

Data transfer vs. model transfer

Data transfer approaches: annotation projection, training data translation, ...

Wednesday

Model transfer approaches: plain model transfer, delexicalization, relexicalization, cross-lingual clusters and embeddings

Thursday

Closely related languages and language varieties - definitions, problems and solutions

  • Delphine Bernhard & Anne-Laure Ligozat (2013): Hassle-free POS-Tagging for the Alsatian Dialects. In: Marcos Zampieri & Sascha Diwersy: Non-Standard Data Sources in Corpus Based-Research, Shaker, ZSM Studien. https://hal.archives-ouvertes.fr/hal-00860790

  • Yves Scherrer & Achim Rabus (2017): Multi-source morphosyntactic tagging for Spoken Rusyn. Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects. http://www.aclweb.org/anthology/W/W17/W17-1210.pdf

Friday

Multilingual modelling and zero-shot learning

About

LOT Summer school 2018 - Language technology for low-resource languages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published