1. Field of the Invention
The invention relates to speech recognition systems and, more particularly, to large-vocabulary speech recognition systems. The system described herein is suitable for use in systems providing interactive natural language discourse.
2. The Prior Art
Speech recognition systems convert spoken language to a form that is tractable by a computer. The resultant data string may be used to control a physical system, may be output by the computer in textual form, or may be used in other ways.
An increasingly popular use of speech recognition systems is to automate transactions requiring interactive exchanges. An example of a system with limited interaction is a telephone directory response system in which the user supplies information of a restricted nature such as the name and address of a telephone subscriber and receives in return the telephone number of that subscriber. An example of a substantially more complex such system is a catalogue sales system in which the user supplies information specific to himself or herself (e.g., name, address, telephone number, special identification number, credit card number, etc.) as well as further information (e.g., nature of item desired, size, color, etc.) and the system in return provides information to the user concerning the desired transaction (e.g., price, availability, shipping date, etc.).
Recognition of natural, unconstrained speech is very difficult. The difficulty is increased when there is environmental background noise or a noisy channel (e.g., a telephone line). Computer speech recognition systems typically require the task to be simplified in various ways. For example, they may require the speech to be noise-free (e.g., by using a good microphone), they may require the speaker to pause between words, or they may limit the vocabulary to a small number of words. Even in large-vocabulary systems, the vocabulary is typically defined in advance. The ability to add words to the vocabulary dynamically (i.e., during a discourse) is typically limited, or even nonexistent, due to the significant computing capabilities required to accomplish the task on a real-time basis. The difficulty of real-time speech recognition is dramatically compounded in very large-vocabulary applications (e.g., tens of thousands of words or more).
One example of an interactive speech recognition system under current development is the SUMMIT speech recognition system being developed at M.I.T. This system is described in Zue, V., Seneff, S., Polifroni, J., Phillips, M., Pao, C., Goddeau, D., Glass, J., and Brill, E. xe2x80x9cThe MIT ATIS System: December 1993 Progress Report.xe2x80x9d Proc. ARPA Human Language Technology Workshop, Princeton, N.J. March 1994, among other papers. Unlike most other systems which are frame-based systems, (the unit of the frame typically being a 10 ms portion of speech), the SUMMIT speech recognition system is a segment-based system, the segment typically being a speech sound or phone.
In the SUMMIT system, the acoustic signal representing a speaker""s utterances is first converted into an electrical signal for signal processing, The processing may include filtering to enhance subsequent recognizability of the signal, remove unwanted noise, etc. The signal is converted to a spectral representation, then divided into segments corresponding to hypothesized boundaries of individual speech sounds (segments). The network of hypothesized segments is then passed to a phonetic classifier whose purpose is to seek to associate each segment with a known xe2x80x9cphonexe2x80x9d or speech sound identity. Because of uncertainties in the recognition process, each segment is typically associated with a list of several phones, with probabilities associated with each phone. Both the segmentation and the classification are performed in accordance with acoustic models for the possible speech sounds.
The end product of the phonetic classifier is a xe2x80x9clatticexe2x80x9d of phones, each phone having a probability associated therewith. The actual words spoken at the input to the recognizer should form a path through this lattice. Because of the uncertainties of the process, there are usually on the order of millions of possible paths to be considered, each of different overall probability. A major task of the speech recognizer is to associate the segments along paths in the phoneme lattice with words in the recognizer vocabulary to thereby find the best path.
In prior art systems, such as the SUMMIT system, the vocabulary or lexical representation is a xe2x80x9cnetworkxe2x80x9d that encodes all possible words that the recognizer can identify, all possible pronunciations of these words, and all possible connections between these words. This vocabulary is usually defined in advance, that is, prior to attempting to recognize a given utterance, and is usually fixed during the recognition process. Thus, if a word not already in the system""s vocabulary is spoken during a recognition session, the word will not successfully be recognized.
The structure of current lexical representation networks does not readily lend itself to rapid updating when large vocabularies are involved, even when done on an xe2x80x9coff-linexe2x80x9d basis, that is, in the absence of speech input. In particular, in prior art lexical representations of the type exemplified by the SUMMIT recognition system, the lexical network is formed as a number of separate pronunciation networks for each work in the vocabulary, together with links establishing the possible connections between words. The links are placed based on phonetic rules. In order to add a word to the network, all words presently in the vocabulary must be checked in order to establish phonetic compatibility between the respective nodes before the links are established. This is a computationally intensive problem whose difficulty increases as the size of the vocabulary increases. Thus, the word addition problem is a significant issue in phonetically-based speech recognition systems.
In present speech recognition systems, a precomputed language model is employed during the search through the lexical network to favor sequences of words which are likely to occur in spoken language. The language model can provide the constraint to make a large vocabulary task tractable. This language model is generally precomputed based on the predefined vocabulary, and thus is generally inappropriate for use after adding words to the vocabulary.
A. Objects of the Invention
Accordingly, it is an object of the invention to provide an improved speech recognition system.
A further object of the invention is to provide a speech recognition system which facilitates the rapid addition of words to the vocabulary of the system.
Still a further object of the invention is to provide an improved speech recognition system which facilitates vocabulary addition during the speech recognition process without appreciably slowing the speech recognition process or disallowing use of a language model.
Yet another object of the invention is to provide a speech recognition system which is particularly suited to active vocabularies on the order of thousands of words and greater and total vocabularies of millions of words and greater.
Still a further object of the invention is to provide a speech recognition system which can use constraints from large databases without appreciably slowing the speech recognition process.
In accordance with the present invention, the lexical network containing the vocabulary that the system is capable of recognizing includes a number of constructs (defined herein as xe2x80x9cword classxe2x80x9d nodes, xe2x80x9cphonetic constraintxe2x80x9d nodes, and xe2x80x9cconnectionxe2x80x9d nodes) in addition to the word begin and end nodes commonly found in speech precognition systems. (A node is a connection point within the lexical network. Nodes may be joined by arcs to form paths through the network. Some of the arcs between nodes specify speech segments, i.e., phones.) These constructs effectively precompile and organize both phonetic and syntactic/semantic information and store it in a readily accessible form in the recognition network. This enables the rapid and efficient addition of words to the vocabulary, even in a large vocabulary system (i.e., thousands of active words) and even on a real-time basis, i.e., during interaction with the user. The present invention preserves the ability to comply with phonetic constraints between words and use a language model in searching the network to thereby enhance recognition accuracy. Thus, a large vocabulary interactive system (i.e., one in which inputs by the speaker elicit responses from the system which in turn elicits further input from the speaker) such as a catalogue sales system can be constructed. The effective vocabulary can be very large, (i.e., millions of words) without requiring a correspondingly large active (random access) memory because not all the words in it need be xe2x80x9cactivexe2x80x9d (that is, connected into the lexical recognition network) at once.
In accordance with the present invention, the vocabulary is categorized into three classes. The most frequently used words are precompiled into the lexical network; typically, there will be several hundred of such words, connected into the lexical network with their phonetically permissible variations. Words of lesser frequency are stored as phonemic baseforms. A baseform represents and idealized pronunciation of a word, without the variations which in fact occur from one speaker to another and in varying context. The present invention may incorporate several hundred thousand of such baseforms, from which a word network may rapidly be constructed in accordance with the present invention. The least frequently used words are stored as spellings. New words are entered into the system as spellings (e.g., from an electronic database which is updated periodically). To make either one of the least frequently used words or a completely new word active, the system first creates a phonemic baseform from the spelling. It then generates a pronunciation network from the phonemic baseforms in the manner taught by the present invention.
The phonetic constraint nodes (referred to hereinafter as PC nodes of PCNs) organize the inter-word phonetic information in the network. A PC node is a tuple, PC (x, y, z . . . ) where the x, y, z are constraints on words that are, or can be, connected to the particular node. For example, x may specify the end phone of a word; y the beginning phone of a word with which it may be connected in accordance with defined phonetic constraints; and z a level of stress required on the following syllable. While tuples of any desired order (the order being the number of constraints specified for the particular PCN) may be used, the invention is most simply described by tuples or order two, e.g., PCN (x, y). Thus, PCN (null, n) may specify a PCN to which a word with a xe2x80x9cnullxe2x80x9d ending (e.g., the dropped xe2x80x9cnexe2x80x9d in the word xe2x80x9cphonexe2x80x9d is connected and which in turn will ultimately connect to words beginning with the phoneme /n/.
Word Class Nodes (referred to hereinafter as WC nodes or WCNs) organize the syntactic/semantic information in the lexical network and further facilitate adding words to the system vocabulary. Examples of word class nodes are parts-of-speech (e.g., noun, pronoun, verb) or semantic classes (e.g., xe2x80x9clast namexe2x80x9d, xe2x80x9cstreet namexe2x80x9d, or xe2x80x9czip codexe2x80x9d). Both the words that form the base vocabulary of the speech recognition system (and therefore are resident in the lexical network to define the vocabulary that the system can recognize), as well as those that are to be added to this vocabulary are associated with predefined word classes.
Words are incorporated into the lexical network by connecting their begin and end nodes to WC nodes. The WCNs divide the set of words satisfying a particular PCN constraint into word classes. There may be a general set of these word classes, e.g., nouns, pronouns, verbs, xe2x80x9clast namexe2x80x9d, xe2x80x9cstreet namexe2x80x9d, xe2x80x9czip codexe2x80x9d, etc. available for connection to the various PCNs. On connecting a specific instance of a set member (e.g., xe2x80x9cnounxe2x80x9d) to a PCN, it is differentiated by associating a further, more specific characteristic to it, e.g., xe2x80x9cnoun ending in /n/xe2x80x9d, xe2x80x9cnoun ending in xe2x80x9cnullxe2x80x9d, etc. Each specific instance of a WCN connects to only one particular PCN. So, for example, there may be a xe2x80x9cnounxe2x80x9d WCN connected to the (null, n) PCN which is separate from a xe2x80x9cnounxe2x80x9d WCN connected to the (vowel, n) PCN. To qualify for connection to a given WC node, a word must not only be of the same word class as the WC node to which it is to be connected, but is connected, e.g., noun ending in xe2x80x9cnullxe2x80x9d.
The PC nodes are interconnected through word connection nodes (hereinafter referred to as CONN nodes) which define the allowable path between the end node of a word and the begin node of a following word. Effectively, CONN nodes serve as concentrators, that is, they link those PC nodes which terminate a word with those PC nodes which begin a succeeding word which may follow the preceding word under the phonetic constraints that are applicable. These constraints are effectively embedded in the WC nodes, the PC nodes, the CONN nodes, and their interconnections.
In order to add a word to the lexical network, it is necessary first to create a pronunciation network for that word. A given word will typically be subject to a number of different pronunciations, due in part to the phonetic context in which they appear (e.g., the end phoneme /n/ in the word xe2x80x9cphonexe2x80x9d may be dropped when the following word begins with an /n/ (e.g., xe2x80x9cnumberxe2x80x9d), and in part to other factors such as speaker dialect, etc. Variations in pronunciation which are due to differing phonetic context are commonly modeled by standard rules which define, for each phoneme, the ways in which it may be pronounced, depending on the surrounding context. In the present invention, network fragments corresponding to the operation of these rules on each phoneme are precompiled into binary form and stored in the system, indexed by phoneme. The precompiled network fragments include labels specifying allowed connections to other fragments and associations with PCNs. These labels are of two types: the first refers to the phoneme indexes of other pronunciation networks: the second refers to specific branches within the pronunciation networks which are allowed to connect to the first pronunciation network. Pronunciation networks for phonemes precompiled according to this method allow the rapid generation of pronunciations for new words to thereby facilitate word addition dynamically, i.e., during the speech recognition process itself.
In adding a word to the lexical network, the word is associated with a phonemic baseform and a word class. Its pronunciation network is generated by choosing the network fragment associated with phonemes in the phonemic baseform of the word and then interconnecting the fragments according to the constraints at their end nodes. The ensuing structure is a pronunciation network typically having a multiplicity of word begin and word end nodes to allow for variation in the words which precede and follow. The resultant pronunciation network is linked to the word class nodes in the manner described above.
In the present invention, the words are organized by word class, and each added word is required to be associated with a predefined word class in order to allow use of a language model based on word classes during the search of the lexical network, even with added words. Predefined words are not required to belong to a word class; they may be treated individually. The language model comprises functions which define the increment to the score of a path on leaving a particular word class node or word end node or on arriving at a particular word class node or word begin node. A function may depend on both the source node and the destination node.
In accordance with the present invention, constraints from electronic databases are used to make the language vocabulary task tractable. The discourse history of speech frequently can also provide useful information as to the likely identification of words yet to be uttered. In the present invention, the discourse history is used in conjunction with a database to invoke different language models and different vocabularies for different portions of the discourse. In many applications, the system will first pose a question to the user with words drawn from a small-vocabulary domain. The user""s response to the question is then used to narrow the vocabulary that needs to be searched for a subsequent discourse involving a large domain vocabulary. As an example, in a catalogue sales system, the system may need to determine the user""s address. The system will first ask: xe2x80x9cWhat is your zip code?xe2x80x9d, and then use the response to fill in the xe2x80x9cstreet namexe2x80x9d word class from street names found in the database that have the same zip codes as that of the stated address. The street names so determined are quickly added to the system vocabulary, and street names previously in the vocabulary are removed to provide the requisite room in active memory for the street names to be added and to reduce the size of the network to be searched. The language model, i.e., the probabilities assigned to the various street names so selected, may be established a priori or may be based on other data within the system database, e.g., the number of households on each street, or a combination of these and other information items. Similarly, the system may ask for the caller""s phone number first, then use the response and an electronic phonebook database to add to the vocabulary the names and addresses corresponding to the hypothesized phone numbers.
The extensive use of large electronic databases while interacting with the user necessitates an efficient database search strategy, so that the recognition process is not slowed appreciably. In accordance with the present invention, hash tables are employed to index the records in the database, and only that information which is needed for the task at hand is stored with the hash tables.