ASR technologies allow computers equipped with microphones to interpret human speech for transcription of the speech or for use in controlling a device. For example, a speaker-independent name dialer for mobile phones is one of the most widely distributed ASR applications in the world. As an other example, speech controlled vehicular navigation systems could be mentioned. A TTS synthesizer is a computer-based system that is designed to read text aloud by automatically creating sentences through a Grapheme-to-Phoneme (GTP) transcription of the sentences. The process of assigning phonetic transcriptions to words is called Text-to-Phoneme (TTP) or GTP conversion. Language processing modules, such as ASR modules, TTS synthesis modules, language understanding and translation modules, etc. generally utilize models that are trained using text data of some type.
In typical ASR or TTS systems, there are several data-driven language processing modules that have to be trained using text-based training data. For example, in the data-driven syllable detection, the model may be trained using a manually annotated database. Data-driven approaches (i.e., neural networks, decision trees, n-gram models) are also commonly used for modeling the language-dependent pronunciations in many ASR and TTS systems. The model is typically trained using a database that is a subset of a pronunciation dictionary containing GTP or TTP entries. One of the reasons for using just a subset is that it is impossible to create a dictionary containing the complete vocabulary for the most of the languages. Yet another example of a trainable module is the text-based language identification task, in which the model is usually trained using a database that is a subset of a multilingual text corpus that consists of text entries among the target languages.
Additionally, the digital signal processing technique of vector quantization that may be applicable to any number of applications, for instance ASR and TTS systems, utilizes a database. The database contains a representative set of actual data that is used to compute a codebook, which can define the centroids or meaningful clustering in the vector space. Using vector quantization, an infinite variety of possible data vectors may be represented using the relatively small set of vectors contained in the codebook. The traditional vector quantization or clustering techniques designed for numerical data cannot be directly applied in cases where the data consists of text strings. The method described in this document provides an easy approach for clustering text data. Thus, it can be considered as a technique for enabling text string vector quantization.
The performance of the models mentioned above depends on the quality of the text data used in the training process. As a result, the selection of the database from the text corpus plays an important role in the development of these language processing modules. In practice, the database contains a subset of the entire corpus and should be as small as possible for several reasons. First, the larger the size of the database, the greater the amount of time required to develop the database and the greater the potential for errors or inconsistencies in creating the database. Second, for decision tree modeling, the model size depends on the database size, and thus, impacts the complexity of the system. Third, the database size may require balancing among other resources. For example, in the training of a neural network the number of entries for each language should be balanced to avoid a bias toward a certain language. Fourth, a smaller database size requires less memory, and enables faster processing and training.
The database selection from a corpus currently is performed arbitrarily or using decimation on a sorted data corpus. One other option is to do the selection manually. However, this requires a skilled professional is very time consuming and the result could not be considered an optimal one. As a result, the information provided by the database is not optimized. The arbitrary selection method depends on random selections from the entire corpus without consideration for any underlying characteristics of the text data. The decimation selection method uses only the first characters of the strings, and thus, does not guarantee good performance. Thus, what is needed is a method and a system for optimally selecting entries for a database from a corpus in such a manner that the context coverage of the entire corpus is maximized while minimizing the size of the database.