Challenges in achieving a corpus infrastructure to advance research in Computational linguistics and Natural Language Processingin Native American languages
Abstract: Natural Language Processing researchers and computational linguists frequently express disappointment and frustration over the lack of corpus in endangered languages that they can use to train and test their language models. This hindrance, caused in large part by a dwindling number of speakers and language keepers to create new data such as stories, prayers, political speeches, and everyday conversation. Coupled with this is the severe lack of capacity among speakers of endangered languages to prepare a corpus including transcribers, annotators, and translators. What can NLP researchers do to help create and facilitate the corpus in these languages? Collaborating with communities to increase capacity to develop corpora with members would be a first step. Furthermore, teaching basic programming courses in local high schools and colleges, working with legacy materials in language archives, and doing fieldwork to collect data alongside community members would greatly enhance the creation of endangered language corpora for NLP.
|
Challenges and Opportunities in NLP for Under-represented Languages
Abstract: Natural language processing (NLP) technology has seen tremendous improvements in recent years but most of these successes have been concentrated in languages with large amounts of data. In this talk, I will discuss challenges and potential solutions on the way to scaling NLP to more of the world's 7000 languages. In particular, I will highlight recent progress in NLP for African languages and present methods that are applicable to languages with limited data such as employing alternative sources of data and multi-modal information.
|