First Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

Shared Task

Our 2021 shared task has officially concluded. Stay tuned for the overview paper!
AmericasNLP 2021 Shared Task mailing list
AmericasNLP 2021 Shared Task GitHub (data, evaluation script, baseline)
Registration form.
Information on Individual Language Pairs

What?

The AmericasNLP 2021 Shared Task on Open Machine Translation is a competition aimed at encouraging the development of machine translation (MT) systems for indigenous languages of the Americas. Participants will build systems that translate between Spanish and an indigenous language.

Why?

Many of the indigenous languages of the Americas are so-called low-resource languages: parallel data with other languages as needed to train MT systems is limited. This means that many approaches designed for translating between high-resource languages, such as English and Chinese, are not directly applicable or perform poorly. Additionally, many indigenous languages exhibit linguistic properties uncommon among languages frequently studied in natural language processing (NLP). For instance, many are polysynthetic. This constitutes an additional difficulty. AmericasNLP wants to motivate researchers to take on the challenge of developing MT systems for indigenous languages.

How?

AmericasNLP invites the submission of MT results obtained by systems built for indigenous languages. Participants can use the training and development data we provide, but there are no limits on what participants can use, which is why we refer to our shared task as open MT. If participants want to translate additional data to improve their systems, that's great! If they want to use pretrained models, that's great, too! The only limitation is that we ask participants to not have the test input translated by hand.
The main metric of the shared task is ChrF++ (Popović, 2017). Participants can enter the competition with as many language pairs as they like, and systems for every language pair will be evaluated separately.
We provide an evaluation script and a baseline MT system to help participants getting started quickly. If you are interested in this shared task, register here.

System submission

Please send all your system outputs to katharina[dot]kann[at]colorado[dot]edu . The subject of your email should be "AmericasNLP2021; Shared Task Submission; <TEAM NAME>". The content of your submission email should be as follows:

Line 1: Team name
Line 2: Names of all team members
Line 3: Language codes for all languages you are sending submissions for in order of your choice (we will use that to double-check that we got all files you intended to send)
[optional] Line 4: A link to a GitHub repository with code that can be used to reproduce your results. This is not required in order to participate in the shared task but is strongly encouraged.

Please attach all output files to your email as a single zip file, named after your team, e.g., "CUBoulder.zip". Within that zip file, the individual files should be named "<LANGUAGE_CODE>.results.<VERSION>". The language code should be the same as used in the corresponding training set names. The version number is in case you want to submit the outputs of multiple systems; it should be a single-digit (please don't submit more than 9 options per language!). Each output file should contain one sentence per line (so there should be 1004 lines in total). Sentences should not be tokenized.

Which languages?

The following language pairs are featured in the AmericasNLP 2021 Shared Task:

Spanish–Hñähñu
Spanish–Wixarika
Spanish–Nahuatl
Spanish–Guaraní
Spanish–Bribri
Spanish–Rarámuri
Spanish–Quechua
Spanish–Aymara
Spanish–Shipibo-Konibo
Spanish–Asháninka

Spanish is always the source language: systems are evaluated on translating from Spanish into an indigenous language.

IMPORTANT DATES

~~Release of pilot data and evaluation scipt: December 16, 2020~~
~~Reveal of featured languages: December 22, 2020~~
~~Release of training data and baseline: January 01, 2021~~
~~Release of first batch of development sets: January 15, 2021~~
~~Release of second batch of development sets and test input: March 01, 2021~~
~~Submission of translations (shared task deadline): ~~March 15, 2021~~ Extended: March 20, 2021~~
~~Announcements of results: March 18, 2021 Extended: March 29~~
Submission of system description papers: April 01, 2021
Notification of acceptance: April 15, 2021
Camera-ready papers due: April 26, 2021
Workshop: June 11, 2021

All deadlines are 11:59 pm UTC -12h.

Organizers

Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, John Ortega, Annette Rios, Angela Fan, Ximena Gutierrez-Vasques, Luis Chiruzzo, Gustavo Giménez-Lugo, Ricardo Ramos, Anna Currey, Raymundo Isidro Alavez, Vishrav Chaudhary, Ivan Vladimir Meza Ruiz, Rolando Coto-Solano, Alexis Palmer, Elisabeth Mager, Thang Vu, Graham Neubig, Katharina Kann

Contact: americasnlp-sharedtask-organizers@googlegroups.com

References

Maja Popović. 2017. ChrF++: Words helping character n-grams. In Proceedings of the second conference on machine translation.

Platinium	Silver	Bronze
	Institute of Computational Linguistics, University of Zurich NAACL Emerging Region Funding Google Research	Snorkel AI Comunidad Elotl