Large organizations, such as commercial organizations, financial organizations or public safety organizations conduct numerous interactions with customers, users, suppliers or other people or entities on a daily basis. Many of these interactions are vocal, or at least comprise a vocal component, such as an audio part of a video or face-to-face interaction. In order to get insight into the data conveyed by these interactions, the interactions are captured and often recorded.
The interactions can be used for a multiplicity of purposes, including but not limited to quality assurance of the handling personnel, getting insight into the customers' needs, obtaining better understanding of the pros and cons of the organization, and more.
However, in order to achieve many of these purposes, it is required to know what was said in the interaction. Since listening or manually transcribing a large volume of interactions is impractical, it is required to automatically obtain the text using speech to text methods.
Developing a speech recognition engine is a complex task that requires expertise in a multiplicity of subjects, including linguistics, phonology, signal processing, pattern recognition, or others. Developing speech recognition for call center environments presents even further challenges, including handling spontaneous speech, very large vocabulary, multiple and unknown speakers having a wide variety of accents, a noisy environment, low audio quality due to compression of the audio input, and others.
In addition, adaptation and update of speech recognition systems to a specific environment of a call center, as related to the used equipment, common vocabulary, domain, required accuracy, and other factors is also required. Some factors, and in particular the vocabulary used in the call center may require frequent updates, for example when names of new products or competitors are used.
The main existing technologies for obtaining text from audio include phonetic search and speech to text.
Phonetic search relates to indexing the audio and producing a lattice of phonemes from an audio input. The lattice can then be searched for any required words or terms.
The advantages of phonetic search include: rapid implementation and deployment; low CPU consumption for indexing; reduced dependence of the phonetic indexing on the particular language spoken in the audio or the domain relatively to speech to text; easy switching between languages; low maintenance and tuning requirements; and high detection rate, also referred to as high recall rate or low false positive rate. In addition, the words that can be searched for are not required to be known in advance, so that terms that become known at a later time can be searched for within an earlier produced lattice.
The disadvantages of phonetic indexing include: relatively slow search for terms, compared to search on text; relatively large number of false negative for similarly-sounding or short terms, i.e., medium precision; proprietary and unreadable output format which does not support free search and forces the user to use proprietary search, i.e., the search engine associated with the indexing product; and high storage requirements.
Speech to text relates to providing the full transcription of an audio input. The advantages of speech to text include obtaining the full text spoken within the audio, thus enabling: detailed analysis; automatic discovering; rapid searching for words; and compact storage requirements.
The disadvantages of speech to text include: low detection rate, i.e., high false positive rate; high CPU consumption for indexing, relatively to phonetic search; high dependence on language and domain, which may require specific development and frequent updates; and long deployment and tuning process. In addition, speech-to-text techniques do not enable searching for words which were unknown at the time the audio was indexed, such as out-of-vocabulary terms.
Thus, none of these methods complies with the needs of obtaining text with high accuracy and high detection from large volumes of captured or recorded vocal interactions.
There is therefore a need for a method and apparatus for speech recognition. The speech recognition should provide high accuracy relative to phonetic search, be efficient as related to processing speed as well as to storage requirements, and should enable fast adaptation to various environments and easy updating to changes in an environment.