Computerized speech processing systems can be used for automated speech recognition (understanding what is being said), speaker identification (who is speaking), and speaker verification (authenticating that the speaker really is who he or she claims to be). An important element in developing computerized speech processing systems is to collect and annotate speech data for training or evaluating acoustic-phonetic models used during continuous speech processing. In continuous speech processing, the words and phrases flow into one and another naturally without artificial pauses.
In order to build robust models, speech from hundreds, perhaps thousands of individual speakers must be collected. This is an arduous and time consuming task, particularly if the system includes models for processing speech spoken in different languages.
Other than the variability in the linguistic groupings of the speech data, another important factor to consider while collecting speech training data is the variability in the acoustic characteristics of the environments where the speech is being produced and collected. In the prior art, a large effort has gone into collecting speech data using public (analog) telephone networks. There, variable acoustic characteristics can be attributed to background noise, telephone handsets, transmission lines, and switching equipment, and the like.
More recently, speech applications have moved to the "desk-top." Modern high speed PCs including lap-top computers can be configured with microphones, loudspeakers, and sound cards to acquire and reproduce speech signals. The computers can be interconnected by a (digital) network such as the Internet. Standard protocols such as the World Wide Web (the "Web") can be used to transmit and receive digitized speech signals between users all over the world.
Clearly, the models generated from speech data collected via telephone networks are of minimal use in Web based speech processing systems. For example, the acoustic characteristics of computer microphones connected to digital sound cards bear little resemblance to analog telephone handsets. Also, background noise and communication channels are quite different for telephone and Web-based networks.
Most prior art speech collection techniques for desk-top applications have required the speakers offering their speech to be present at the collection site. This means a trained individual must also be present to supervise the collection process. The acoustic environment at the training site is unlikely to representative of the environment in which the application will actually be used due to a mismatch in the training data. Also, the collection of data for specific sets of speakers, such as native speakers of a foreign language, may impose additional logistic constraints.
Therefore, it is desired to provide means for collecting speech data using an all digital network such as the Internet. Furthermore, it is desired, that standard network interfaces such as the World Wide Web can be used to interact with speakers while collecting speech training data. Furthermore, it is desired that the speech collection mechanism is widely distributed so that speech data for a large number of speakers can readily be collected.