1. Field of the Invention
The present invention relates generally to the method of speech recognition systems, and more particularly, to a method of using an efficient dictionary and grammar structures.
2. Description of the Related Art
In the field of speech recognition systems a speech recognition system inputs an audio stream that is filtered to extract and isolate sound segments that are speech. The speech recognition engine then analyzes the speech sound segments by comparing them to a defined pronunciation dictionary, grammar recognition network and an acoustic model.
Sublexical speech recognition systems are usually equipped with a way to compose words and sentences from more fundamental units. For example, in a speech recognition system based on phoneme models, pronunciation dictionaries can be used as look-up tables to build words from their phonetic transcriptions. A grammar recognition network can then interconnect the words. Due to their complexity grammar recognition networks are seldom represented as look-up tables and instead are usually represented by graphs. However, the grammar recognition network graphs can be complicated structures that are difficult to handle and represent. Although there is not a fixed standard for grammar recognition network graphical representations, a current structure used is the Hidden Markov Model Toolkit (HTK) Standard Lattice Format (SLF).
SLF can be used to represent multiple recognition hypotheses in a word lattice fashion and a grammar recognition network for speech recognition. This format is composed of various fields or parts. The most relevant ones are the node and link fields. Together both fields define the grammar graph. Each node represents one of the edges of the graph and each link is related to one of the graph arcs. The words in the grammar can be associated to either the nodes or the links. The links can be associated to N-gram likelihoods, word transition probabilities and acoustic probabilities.
In the context of efficient grammar graph representation, one disadvantage of SLF is its explicitness to list nodes and links. Fundamentally, when words are associated to the nodes, the SLF nodes are only able to represent one and only one word. By the same token, each link represents one and only one transition between nodes. This explicitness makes it difficult for a human reader to interpret the contents of the grammar and, more importantly, requires a large memory object to be handled by the speech recognition system.
Another disadvantage of SLF is its lack of association with other elements of the recognition system, particularly the pronunciation dictionary. The interaction between the grammar recognition network and the pronunciation dictionary is dependent on the specific implementation of the speech recognition process. However, as long as the grammar recognition network and pronunciation dictionary are separate entities there can be undesirable operation of the speech recognition system. For example, if there are errors in the pronunciation dictionary they are not visible from the grammar recognition network and vice versa. Furthermore, it can be difficult to have changes made in one reflected in the other.
In view of the forgoing, there is a need for a more efficient method that can represent a unified layered dictionary and grammar structure.