Second AmericasNLP Competition: Speech-to-Text Translation for Indigenous Languages of the Americas

IMPORTANT: Details about the AmericasNLP competition at NeurIPS
Winners of the AmericasNLP 2022 competition
Registration form

What?

The Second AmericasNLP Competition on Speech-to-Text Translation for Indigenous Languages of the Americas is an official NeurIPS 2022 competition aimed at encouraging the development of machine translation (MT) systems for indigenous languages of the Americas. The overall goal is to develop new speech-to-text translation technology for Indigenous languages, and participants will build systems for 3 tasks: (1) automatic speech recognition (ASR) for an Indigenous language (Task 1), (2) text-to-text translation between an Indigenous language and a high-resource language (Task 2), and (3) speech-to-text translation between an Indigenous language and a high-resource language (Task 3, our main task).

Why?

Many Indigenous languages of the Americas are so-called low-resource languages: parallel data with other languages as needed to train speech-to-text MT systems is limited. This means that many approaches designed for translating between high-resource languages – such as English, Spanish, or Portuguese – are not directly applicable or perform poorly. Additionally, many Indigenous languages exhibit linguistic properties uncommon among languages frequently studied in natural language processing (NLP), e.g., many are polysynthetic or tonal. This constitutes an additional difficulty. We want to motivate researchers to take on the challenge of developing speech-to-text MT systems for Indigenous languages.

How?

We invite submissions of speech-to-text MT results (as well as of results for the subtasks of ASR and text-to-text translation) obtained by systems built for Indigenous languages. We will provide training and evaluation data to the participants, but there are no limits on what outside resources – such as additional data or pretrained systems – participants can use, with the exception of the datasets listed here. This should go without saying, but we ask that participants don't translate (or transcribe, in the case of ASR) the test input by hand. The main metrics of this competition are ChrF (Popović, 2015) for Tasks 2 and 3 and character error rate for Task 1. Participants can submit results for as many language pairs as they like, but only teams that participate for all language pairs for a task are entering the official ranking. We provide an evaluation script and a baseline MT system to help participants getting started quickly. If you are interested in this competition, please register here.

Tracks

The competition will have two tracks:

External data and pre-trained models are allowed. In this track, we aim that teams pursue to train the best system possible. For this, they can collect all external data they can find or create. The only constraint is the list of prohibited datasets.
Only pre-trained models are allowed. Teams can use the provided dataset, Spanish/Portuguese monolingual data, and well-established pre-trained models (models published in any ML venue) in this track.

Both tracks are equivalent, and therefore the prizes described below are valid.

System Submission

UPDATE! The official submission leaderboards can be found here:

You can start submissions for the development data. As soon the dataset input is released, teams can also start uploading their finial submissions!

Languages

The following language pairs are featured in the NeurIPS–AmericasNLP 2022 competition:

Bribri–Spanish
Guaraní–Spanish
Kotiria–Portuguese
Wa'ikhana–Portuguese
Quechua–Spanish

For all pairs, the Indigenous language is the source language, and the high-resource language is the target language.

Pilot Data

Kotiria [kotiria_pilot.tar.gz]
Waikhana [waikhana_pilot.tar.gz]

Evaluation script: evaluate.py

Data and Baseline System

A script to download the datasets for the competition, an evalution script, and the official baselines can be found in our GitHub.
The TEST INPUTS for ASR are now online! [TEST_FILES]
UPDATE! Test MT inputs are now online! [mt_test_inputs.zip]
Prizes
As long as the best performing systems beat our baselines, the corresponding teams will be awarded the following prizes:

Task 1: $500 for the best team
Task 2: $500 for the best team
Task 3 (main task): $1000 for the best team, $500 for the second best team, $300 for the third best team
The prizes are valid for both tracks of this competition.
Important Dates

~~Release of pilot data and evaluation script: May 23, 2022~~

~~Release of training and development data and baseline systems: June 6, 2022~~

~~Release of test input/start of evaluation phase (ASR and Speech-to-text): September 20, 2022~~

Submission of translations by participants/end of competition (ASR and Speech-to-text translaiton): ~~September 30, 2022~~ October 14, 2022

Release of test input/start of evaluation phase (MT tasks): September 15, 2022

Submission of translations by participants/end of competition (Machine Translation): October 25, 2022

Announcements of results: October, 29

Competition track meeting at NeurIPS (virtual event): December 2022

All deadlines will be 11:59 pm UTC -12h ("anywhere on Earth").
Organizers
Manuel Mager, Katharina Kann, Abteen Ebrahimi, Arturo Oncevay, Rodolfo Zevallos, Adam Wiemerslage, Pavel Denisov, John E. Ortega, Kristine Stenzel, Aldo Alvarez, Luis Chiruzzo, Rolando Coto-Solano, Hilaria Cruz, Sofía Flores-Solórzano, Ivan Vladimir Meza Ruiz, Alexis Palmer, Ngoc Thang Vu
Contact: americas.nlp.workshop@gmail.com
References
Maja Popović, 2015. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation.