This invention relates generally to a system and method for improving the accuracy of audio searches. More specifically the present invention relates to employing a plurality of acoustic and language models to improve the accuracy of audio searches.
Call recording or telephone recording systems have existed for many years storing audio recordings in a digital file or format. Typically, these recording systems rely upon individuals, typically using telephone networks, to record or leave messages with a computerized recording device, such as a residential voice mail system. However, as the technology of conference calls, customer service, and other telephone systems has advanced, call recording systems are now employed on a variety of systems ranging from residential and commercial voice mail to custom service to emergency (911). These recording systems are often implemented in environments where the recorded calls include speakers of many languages, dialects and accents.
As the use of call recording systems has expanded, the database of recorded calls has also expanded. For many call recording systems, such as emergency (911) calls, a database of emergency calls must be maintained for activities such as retrieving evidence or training purposes. Over time, these databases can become quite large, storing enormous amounts of data and audio files from typically numerous and unknown callers. Although calls may be identified by recorder ID, channel number, duration, time, and date in the database, the content of the audio file may be unknown without listening to the call records individually. However, the content of audio files in a call recording database is often of particular interest for research, training, or evidence gathering. Unfortunately, searching audio files for keywords or content subjects is difficult and extremely time consuming unless the searching is performed using automatic speech recognition technology. Traditional systems for searching audio files convert audio files in a database into a searchable format using an automatic speech recognition system. The speech recognition system employs a single model, representing a language such as English, to perform the conversion. Once a searchable format of the database is created, the database is searched for keywords or subject matter and the searching system returns a set search results called hits. The search results indicate the location, along with other possible information, of each hit in the database such that each hit may be located and heard. The search results may also indicate or flag each audio file in the database containing at least one hit.
Unfortunately, typical systems manage to identify only a small portion of this audio information. This is because of the formidable task of using speech recognition technology to recognize the wide variety of pronunciations, accents, and speech characteristics of native and non-native speakers of a particular language or multiple languages. Keywords are often missed in searching because audio files are not accurately converted by the automatic speech recognition system or indexing engine. Therefore, due to the large number of unknown voices on a call recording system and the different pronunciations, accents, speech characteristics, and languages possible in any given audio file in a call recording database, traditional searching techniques have failed to provide less than optimal search results.