Computers employ speech recognition techniques in many scenarios to enhance the user's experience. For example, speech recognition is performed by personal computers (PCs), smart phones, and personal digital assistants (PDAs), among others, to allow the user a touch free way to control the computer and/or enter data. The growth of speech recognition has been facilitated by the use of statistical language models to increase speech recognition performance. One constraint of statistical language models is that they require a large volume of training data in order to provide acceptable speech recognition performance. Often the training data contains inaccuracies or inconsistencies that ultimately affect the speech recognition performance of statistical language models trained with the inconsistent training data. The present concepts relate to automated data cleanup that increases the consistency of training data for use in speech recognition and/or related fields.