Corpus Studies in Word Prediction
By Trnka, Keith; McCoy, Kathleen F.; ASSETS 2007 - The Ninth International ACM SIGACCESS Conference on Computers and Accessibility, pp. 195-202Publication Date: October 15-17, 2007
Outline of the development of a word-prediction system to enhance the communication rate of people with disabilities who use Augmentative and Alternative Communication (AAC) devices. The basis of the system is a language model that has been trained on a large corpus of data. Such a model then predicts the next word of input, based on what the user has already typed. The system is evaluated by calculating theoretical keystroke savings and correcting poor predictions. Training and testing was done on a wide variety of corpora including conversational speech transcriptions, emails from AAC users, and articles from the online Slate magazine. Three tests were used to investigate the effects of training data for each corpus: in-domain, using the same corpus for training and testing; out-of-domain, using the training sets of all corpora except that used for testing; and mixed-domain, using the training sets of all corpora and evaluating on the testing set of each corpus. Topic modeling, which looks for patterns of words that tend to occur together and automatically categorizes them into topics, was implemented on one corpus. The study found that training on a combination of in-domain data with out-of-domain data is often more beneficial than either set alone, and that topic modeling is portable even when applied to very different text.
Published by: Association for Computing Machinery (Website:http://www.acm.org)
SIGACCESS (ACM Special Interest Group on Accessible Computing) (Web Site: http://www.sigaccess.org )
Link to text: http://www.cis.udel.edu/~mccoy/sig-nlp-fall07/Trnka-corpus_study.pdf

