In a large vocabulary speech recognition system, a text-to-phone component is required to convert the text into its correct phone sequence. The system then selects corresponding phonetic models to construct recognition models based on the phone sequence. The size and complexity of the text-to-phone component vary widely for different languages. For a language with more regular pronunciations than English, such as Spanish, a few hundred rules are enough to convert text to phone accurately. On the other hand, for a language with more irregular pronunciation, such as English, tens of thousands of rules are required to convert text to phone accurately. There are about as many pronunciation rules as there are words in English. It is really a pronunciation dictionary. We use a typical English pronunciation dictionary with 70,955 entries. It takes up to 1,826,302 bytes in ASCII form.
It is highly challenging to implement the text-to-phone dictionary in an embedded system such as part of a computer chip where memory space is a premium. One such use, for example, is building speech recognition models for a phrase used in a voice controlled web browser. One has to start with the text and look up a pronunciation dictionary of phones for the text. Once the phones are identified and the sequence of phones for words are determined, HMM model for each phone is determined. A dictionary with 1.8 Mbytes is too large a storage requirement for an embedded system. While compression techniques such as the asymptotically optimal Lempel-ZIV coding can be used for compression, this form of compression is not computable or searchable. The compressed data must be uncompressed to do a dictionary lookup. One, therefore, must have the memory space to do the lookup and therefore the uncompression defeats the purpose of saving memory space.
In addition, dictionary lookup usually requires additional data structures other than the dictionary itself. For example, once a dictionary is loaded into the memory which requires, for the example, 1.8 Mbytes, a very large array of address pointers is required to search the memory.