Is Neural Machine Translation viable for low-resource languages? an experimental study of the Irish language

Loading...
Thumbnail Image

Persistent link to this item

Statistics
View Statistics

Published Date

Publisher

Abstract

Transformer-based Neural Machine Translation (NMT) models are Large Language Models (LLMs) designed and developed for translating between two or more given languages. These are typically most successful in the context of high-resource languages, languages with plentiful amounts of available online text corpora, such as English, Spanish, or French. In contrast, languages with limited corpora are known as low-resource languages and tend to be overlooked or underrepresented, like Basque, Pashto, or Ojibwe. One of these low-resource languages is Irish (Gaeilge), which has approximately 1.9 million total speakers as of 2022, and an extremely limited pool of publicly available datasets and machine translation systems. In response to this shortage, we created three bilingual English-Irish datasets and three transformer models for translating from English to Irish. Our models were then evaluated on four automatic evaluation metrics, BLEU, TER, CHRF, and METEOR, and demonstrated promising results across all our datasets.

Description

University of Minnesota M.S. thesis. July 2025. Major: Computer Science. Advisor: Ted Pedersen. 1 computer file (PDF); viii, 66 pages.

Related to

item.page.replaces

License

Series/Report Number

Funding Information

item.page.isbn

DOI identifier

Previously Published Citation

Other identifiers

Suggested Citation

Quigley, Jack. (2025). Is Neural Machine Translation viable for low-resource languages? an experimental study of the Irish language. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/277324.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.