The present invention relates to a method that builds phrase grammars from a corpus of speech, text, phonemes or any kind of symbolic input (herein xe2x80x9cthe corpusxe2x80x9d).
It has long been a goal of computing systems to interact with human users using natural language from the users. That is, rather than restricting the user to predetermined syntactic commands, it would be preferable to have the user express a command in the most natural way for the user and to have a computer comprehend the command. Although modern computing systems have improved remarkably in their ability to recognize spoken words, comprehension of speech still is limited because these system cannot ascribe meanings to the commands.
Significant advances have been made in the ability of modern computing systems to acquire phrases from a corpus. For example, acquisition techniques are disclosed in U.S. patent application Ser. No. 08/960,291, entitled xe2x80x9cAutomatic Generation of Superwords,xe2x80x9d filed Oct. 29, 1997. Other examples may be found in E. Giachin, xe2x80x9cPhrase Bigram for Continuous Speech Recognition,xe2x80x9d Proc. ICASSP, pp. 225-228, (1995), K. Ries, et al., xe2x80x9cImproved Language Modeling by Unsupervised Acquisition of Structure,xe2x80x9d Proc. ICASSP, pp. 193-196 (1995).
Additionally, advances have been made in the ability of such systems to classify words that possess similar lexical significance. The inventors, for example, have developed a clustering technique as disclosed in co-pending U.S. patent application Ser. No. 207,326 entitled xe2x80x9cAutomatic Clustering of Tokens from a Corpus of Speech,xe2x80x9d the disclosure of which is incorporated herein. Clustering processing also is disclosed in Kneser, et al., xe2x80x9cImproved Clustering Techniques for Class-Based Statistical Language Modeling,xe2x80x9d Eurospeech (1993) and in McCandless, et al., xe2x80x9cEmpirical Acquisition of Word and Phrase Classes in the Atis Domain,xe2x80x9d Third European Conf. Speech Comm. Tech. (1993).
While phrase acquisition and clustering techniques improve the ability of a computing system to comprehend speech, neither technique alone can build a structure model from a corpus of speech or text. Accordingly, there is a need in the art for a method for building a linguistic model from a corpus of speech or text.
The present invention provides a method that combines clustering techniques with phrase acquisition techniques with a closed-loop optimization method to build complex linguistic models from a corpus. A set of features is initialized by the corpus. Thereafter, the method determines, according to a predetermined cost function, to process the features by one of phrase clustering processing or phrase grammar learning processing. If phrase clustering processing is performed, the method processes an interstitial set of features comprising both the old features and newly established clusters by phrase grammar learning processing. The features obtained as an output of phrase grammar learning is re-indexed as a set of features for a subsequent iteration. The method may be repeated over several iterations to build a hierarchical linguistic model.