Researchers sidestepped the lack of parallel data for these languages by gathering monolingual web text, which they then used to build a multilingual unlabeled text dataset containing more than one million sentences. The authors (all affiliated with Google Research) acknowledged that Google Translate’s limited menu of languages has historically skewed “European,” despite high speaker populations of languages spoken in Africa, South and Southeast Asia, as well as the indigenous languages of the Americas. A May 2022 paper, Building MT Systems for the Next Thousand Languages, detailed efforts to create datasets for more than 1,500 languages, develop MT for those languages, analyze system outputs, and identify frequent errors.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |