Spoken document retrieval (SDR) systems can be used for searching and filtering large datasets of speech recordings. Generally, an element (e.g., word, phrase or sentence) is searched by searching for a textual element in a transcription of an audio content. A common platform for SDR systems is the Solr platform (see https://en.wikipedia.org/wiki/Apache_Solr) which can enable a fast search over large amounts of textual data. Known systems and method can transcribe an audio recording using large vocabulary speech recognition (LVCSR) to produce, for example, a transcription of the audio recording and/or search for an element in the transcription.
However, using an LVCSR can have a number of drawbacks. For example, LVCSR systems can be unable to detect and/or transcribe Out Of Vocabulary (OOV) words or phrases. Generally, if a word is not present in the LVCSR's vocabulary, an LVCSR based system or method misrecognizes the word resulting in, for example, user requests and queries related to the word failing. Generally, a user's query or request includes requesting to search for an element or term in a set of audio recordings or transcriptions thereof.
Unfortunately, OOV words are often the basis of, or included in, user queries or requests. For example, requests related to customer service or market management, in which new products, new rival companies and/or other name entities might be the user's main interest. Names of new products or new companies may not yet be included in an LVCSR's vocabulary and may therefore cause OOV errors.
Moreover, transcribing large amounts of audio content (e.g., conversations or phone calls) to text can require costly computational resources and/or be time consuming. For example, due to the costly transcription process, known systems may only transcribe a portion of the relevant audio content (e.g., 30%) and therefore fail to cover (or search in) all relevant audio content, e.g., all customers phone calls.