1. Field of the Invention
The present invention relates to a speech data analysis system, and, more specifically, to a system that correlates speaker signal source and a normalized signal comprising measurements of input acoustic data to a database of language, dialect, accent, and/or speaker attributes in order to create a detailed transcription of the input acoustic data.
2. Description of the Related Art
Speech transcription is an evolving area of technology served by several disparate technologies targeted at subsets of the issue. Individual systems and applications focus on and attempt to solve their own problems, including speech-to-text, phrase and word recognition, language recognition, and speaker identification. However, each of these techniques applies only rudimentary signal processing techniques, and none are able to achieve high levels of accuracy without a large amount of training.
Automatic Speech Recognition (“ASR”) systems convert spoken words into text, and include systems as diverse as call routing, voice dialing, and data entry, as well as advanced speech-to-text processing software packages. These systems are often based on a language model and require domain training in which a user trains the system to recognize his specific voice, accent, and/or dialect. Although effective, domain training results in several limitations on the application of the approach, both in the specific speech domain and in how much confidence the user has in the product. Additionally, in situations in which a significant amount of training is required, the time and effort required can be a substantial barrier to adoption by new users.
In addition to training requirements, ASR systems continue to suffer from less-than-perfect accuracy, with some estimating a current peak effectiveness of only 80-90%. In other words, in every ten words converted to text, one or two on average are incorrect. Although ASR systems can greatly increase productivity, the need to correct converted speech detracts from the possible productivity maximum.
There is, therefore, a continued need for a system with a class of signal processing processes that accurately recover speech attributes associated with the speaker and with what is spoken without the need for excessive domain training, and with improved accuracy.