The invention generally relates to the field of digital audio processing and more specifically to a method and apparatus for processing a continuous audio stream containing human speech related to at least one particular transaction. The invention further relates to a multi-user speech recognition or voice control system.
Business transactions are increasingly conducted by way of telephone conversation. Exemplarily it is referred to audio logs of call center dialogues which have to be accessed in order to locate specific transactions. Another example are logs which are stored on audio tapes and can be accessed by scanning corresponding tape archives.
Beyond that it is to be expected that in the future many transactions like teleshopping or telebanking will be handled by automatic transaction systems using text to speech synthesis to communicate with a customer. Another substantial and still growing amount of transactions is the field of telephone conversation which takes place between two human individuals, in particular two individuals speaking different languages.
A particular field of transactions is transactions that are legally binding. It is current practice to record the underlying interactions on audio tapes to have a log of each interaction. For legal reasons, in cases where both parties disagree about an intended transaction at a later date, these logs can be used as a proof instrument. Nowadays such tapes are labeled with a date information and a customer or employee identifier. This makes the task of locating and indexing an audio log of a specific transaction an extraordinary effort.
Prior efforts to automize the indexing of such audio material, e.g. using prior art speech recognition technology, failed due to the large variability of speech styles and dialects of the human individuals engaged in those interactions.
Another application field is multi-user speech recognition systems (SRSs) where two or more speakers are located in the same room, e.g. a typical mixed conversations during personal meetings or the like which shall be protocolled using SRS technology. Another similar situation is command language used in an aircraft cockpit where the pilot and the co-pilot operate the aircraft via voice control. As modem SRSs have to be trained for different users, these systems so far are not able to automatically switch between the different speakers.