AmericasNLP 2024 Shared Task 1: Machine Translation Systems for Indigenous Languages

What?

The AmericasNLP 2024 Shared Task on machine translation systems for Indigenous languages is a competition aimed at encouraging the development of machine translation (MT) systems for Indigenous languages of the Americas. Participants will build systems that translate between Spanish and an Indigenous language.

Why?

Many of the Indigenous languages of the Americas are so-called low-resource languages: parallel data with other languages as needed to train MT systems is limited. This means that many approaches designed for translating between high-resource languages, such as English and Chinese, are not directly applicable or perform poorly. Additionally, many Indigenous languages exhibit linguistic properties uncommon among languages frequently studied in natural language processing (NLP). For instance, many are polysynthetic. This constitutes an additional difficulty. The goal of the AmericasNLP 2024 shared task on machine translation systems for Indigenous languages is to motivate researchers to take on the challenge of developing MT systems for Indigenous languages.

How?

AmericasNLP invites the submission of MT results obtained by systems built for Indigenous languages. Participants can use the training and development data we provide, but there are no limits on what participants can use. If participants want to translate additional data to improve their systems, that's great! If they want to use pretrained models, that's great, too! The only limitation is that we ask participants to not have the test input translated by hand or train on the development or test sets. The main metric of the shared task is ChrF++ (Popović, 2017). Participants can enter the competition with as many language pairs as they like, and systems for every language pair will be evaluated separately, in addition to the overall average score, which will be used to determine the shared task’s winner. We provide an evaluation script and a baseline MT system to help participants get started quickly.

System Submission

Please send all your system outputs to americas.nlp.workshop@gmail.com. The subject of your email should be "AmericasNLP2024_SharedTask1; Shared Task Submission; <TEAM NAME>". The content of your submission email should be as follows:

Line 1: Team name
Line 2: Names of all team members
Line 3: Language codes for all languages you are sending submissions for in order of your choice (we will use that to double-check that we got all files you intended to send)
[optional] Line 4: A link to a GitHub repository with code that can be used to reproduce your results. This is not required in order to participate in the shared task, but we strongly encourage it.

Please attach all output files to your email as a single zip file, named after your team, e.g., "TheTranslators.zip". Within that zip file, the individual files should be named "<LANGUAGE_CODE>.results.<VERSION>". The language code should be the same as used in the corresponding training set names. The version number is in case you want to submit the outputs of multiple systems; it should be a single-digit (please don't submit more than 9 options per language!). Each output file should contain one sentence per line. Sentences should not be tokenized.

Which languages?

The following language pairs are featured in the shared task:

Hñähñu–Spanish
Wixarika–Spanish
Nahuatl–Spanish
Guaraní–Spanish
Bribri–Spanish
Rarámuri–Spanish
Quechua–Spanish
Aymara–Spanish
Shipibo-Konibo–Spanish
Asháninka–Spanish
Chatino–Spanish

Spanish is always the source language: systems are evaluated on translating from Spanish into an Indigenous language.

Important Dates

Release of pilot data: January 29, 2024
Release of training and development sets: February 5, 2024
Release of baseline systems and baseline results: February 12, 2024
Release of test inputs: April 1, 2024
Submission of results (shared task deadline): April 10, 2024
Announcement of winners: April 12, 2024
Submission of system descriptions papers: April 19, 2024
Notification of acceptance: April 22, 2024
Camera-ready papers due: April 26, 2024

All deadlines are 11:59 pm UTC -12h (AoE).

Organizers

Abteen Ebrahimi, Arturo Oncevay, Pavel Denisov, Robert Pugh, Ona de Gibert Bonet, Raúl Vázquez, Manuel Mager, Rolando Coto-Solano, Katharina von der Wense, Shruti Rijhwani

Contact: americas.nlp.workshop@gmail.com

References

Maja Popović. 2017. ChrF++: Words helping character n-grams. In Proceedings of the second conference on machine translation.

	We thank our sponsors
Platinum	Gold	Bronze
Amazon Web Services	Google Research	Aditu

AmericasNLP 2024 Shared Task 1: Machine Translation Systems for Indigenous Languages

What?

Why?

How?

System Submission

Which languages?

Important Dates

Organizers

References

We thank our sponsors

Platinum

Gold

Bronze