The present invention relates generally to speech recognition systems for electronic commerce (E-commerce) and voice commerce (V-commerce). More particularly, the invention relates to a speech model adaptation system based on a reduced dimensionality representation of a speaker population. The adaptation system adapts to a new user""s speech very rapidly in an unsupervised mode, and provides speaker identification and speaker verification functions as a byproduct.
Electronic commerce promises to change the way goods and services will be sold in the decades to come. Currently, electronic commerce is conducted over the Internet using personal computers connected through an Internet service provider to the Internet, where a wide variety of different commerce opportunities are made available. Using suitable browser software, the user communicates with an E-commerce server or host computer to obtain information about products and services or to engage in a commercial transaction.
E-commerce raises a number of important issues. High on the list is security. The E-commerce system must ensure within reasonably practicable limits that the party contacting the E-commerce server is who he or she says. Current technology relies heavily on keyboard-entered user I.D. and password information for user verification.
Although great strides have been made in improving the personal computers and Web browsers so that they are easier for the average consumer to use, there still remains much room for improvement. For example, many users would prefer a speech-enabled interface that would allow them to interact with the server by speaking. This has not heretofore been practical for a number of reasons.
First, speech recognizers can require a great deal of training by an individual user prior to use. This training process is called speaker adaptation. For E-commerce and V-commerce applications, speaker adaptation is a significant problem because spoken interactions are unsupervised (the server does not know in advance what the speaker will say next) and the spoken transaction is typically quite short, yielding very little adaptation data to work with.
Second, even if it were possible to perform adequate speaker adaptation with a small amount of adaptation data, the system still needs to store the adapted speech models for that user, in order to take advantage of them the next time the user accesses the system. In a server-based application that will be used by many users, it is thus necessary to store adapted models for all users. Current technology makes this quite difficult because the speech models are typically quite large. The large size of a speech model carries two costs: a great deal of storage space is required if these models are to be stored at the server, and an unacceptably long data transmission time may be required if these models must be shipped between client and server.
In addition to the foregoing difficulties with current recognition technology, voice commerce or V-commerce carries another difficulty: speaker verification. Whereas keyboard-entered user I.D""s and passwords can be used to offer some level of security in conventional E-commerce transactions, V-commerce transactions are another matter. Although a system could be developed to use conventional user I.D. and passwords, these present potential security risks due to the possibility of the voice transaction being overheard.
The present invention addresses all of the foregoing problems through a system and method that carries out automatic speech model adaptation while the transaction is being processed. The system relies upon a dimensionality reduction technique that we have developed to represent speaker populations as reduced-dimensionality parameters that we call eigenvoice parameters.
When the user interacts with the system by speaking, speech models of that user""s speech are constructed and then processed by dimensionality reduction to generate a set of eigenvoice parameters. These parameters may be placed or projected into the eigenspace defined by the speaker population at large. From this placement or projection, the system rapidly develops a set of adapted speech models for that user. Although the speech models, themselves, may be fairly large, the eigenvoice parameters may be expressed as only a few dozen floating point numbers. These eigenvoice parameters may be readily stored as a member of a record associated with that user for recall and use in a subsequent transaction.
The reduced dimensionality parameters (eigenvoice parameters) represent the user, as a unique speaker, and not merely as a set of speech models. Thus, these parameters may serve as a xe2x80x9cfingerprintxe2x80x9d of that speaker for speaker verification and speaker identification purposes. Each time the user accesses the system, the system computes reduced dimensionality parameters based on the user""s input speech. These parameters are compared with those previously stored and may be used to verify the speaker""s authenticity. Alternatively, the system may compare the newly calculated parameters with those already on file to ascertain the identity of the speaker even if he or she has not otherwise provided that information.
The system and method of the invention thus opens many doors for E-commerce and V-commerce opportunities. For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.