Expériences de transcription automatique d'une langue rare
Thomas Pellegrini, Lori Lamel.
This work investigates automatic transcription of rare languages,
where rare means that there are limited resources available in
electronic form. In particular, some experiments on word
decompounding for Amharic, as a means of compensating for the lack
of textual data are described. A corpus-based decompounding
algorithm has been applied to a 4.6M word corpus. Compounding in
Amharic was found to result from the addition of prefixes and
suffixes. Using seven frequent affixes reduces the out of
vocabulary rate from 7.0% to 4.8% and total number of lexemes from
133k to 119k. Preliminary attempts at recombining the morphemes
into words results in a slight decrease in word error rate relative
to that obtained with a full word representation.