In speech recognition systems and other speech-based system, a Text-to-Speech (TTS) audio stream is generally created by a TTS engine. A TTS engine takes text data and converts the text into spoken words in an audio stream which may then be played back on a variety of audio production devices, where the audio stream includes an audio waveform and may include other data related to the audio waveform. When used in conjunction with speech recognition circuitry that recognizes a user's speech or speech utterances, a TTS will allow an ongoing spoken dialog between a user and a speech-based system, such as for performing speech-directed work.
Those skilled in the art recognize that a phoneme is the smallest segmental unit of sound employed in a language to form meaningful contrasts between utterances. In the English language, for example, there are approximately 44 phonemes, which when used in combinations may form every word in the English language. A TTS engine generally performs the conversion from text to an audio stream by splitting each word in the text string into a sequence of the word's component phonemes. Then the units of sound for each of the phonemes in the sequence are connected in sequential order into an audio stream that can be played on a variety of sound production devices.
When a TTS engine generates a TTS audio waveform from text, the TTS engine may output metadata that corresponds to the generated audio waveform. This metadata generally contains a text representation of each phoneme provided in the audio stream and may also provide an indication of the position of the phoneme in the audio waveform (i.e. where the phoneme occurs when the audio waveform is produced for listening).
TTS engines and the creation of audio streams based on text data technologies have been widely used in a variety of communication technologies such as automated systems that provide audio feedback and/or instructions to a user. TTS engines and the creation of audio streams based on text data have been used in speech-based work environments to provide workers with audio instructions related to tasks the workers are to perform. In these systems, a worker is typically equipped with a portable terminal device that receives data from a management computer over a communication network, such as a wireless network. The link between the terminal device and the management computer or central system is usually a wireless link, such as Wi-Fi link. The data generally comprises instructions for the worker, either in text or audio format. In these systems, the terminal may convert received text data to an audio stream or the management computer may convert the text to an audio stream prior to transmitting the instructions to the terminal. The generated audio stream may include an audio waveform and metadata associated with the audio waveform, and may be generated using a TTS engine, audio recordings, or a combination.
Generally, the audio stream is produced as sound for the worker through use of a communication component that is in communication with the management computer and/or the terminal device. The communication component may be, for example, a headset having a speaker for production and a microphone for voice input, or similar devices. The audio stream, which includes an audio waveform and has the instructions in audio format, is received by the communication component and produced as sound or speech for the worker.
Conventional systems and methods for producing sound involve playing a storage buffer containing the audio waveform that has been received when a predetermined amount of data has been received. In optimal conditions, playback of the audio waveform by a conventional system will consume more time than it takes to receive a subsequent audio waveform and provide it to a production buffer. Hence, the transition from the audio waveform being produced to the playback of the subsequent audio waveform should occur without any noticeable indication of the transition in the production of the sound to the user of the terminal device and any communication component.
However, in conventional systems, delay in the reception of data, such as a delay from a wireless link, may lead to the situation where audio playback or production of a received audio waveform completes before a subsequent audio stream and audio waveform has been fully received into the buffer. This delay in buffering the audio waveforms often leads to what can be generally described as “choppy” production of sound for the user. Other common descriptions of this occurrence include “skipping,” “popping,” “stuttering,” etc. In short, the delay causes the production of sound to have a delay where production must wait for a subsequent audio stream and audio waveform to be received into the buffer. As mentioned, the cause of the skipping in the production is due to a failure to fully buffer the subsequent audio waveform before production of the previous audio waveform ends. In many communication systems, these breaks in production may be caused by delays in receiving and/or processing the received audio streams, such as over a wireless communication link.
In communication systems that involve producing sound that includes spoken words or speech, the skipping that is due to delay in the system can result in unintelligible or inaccurate sound being produced for a user of the communication component. Depending on the specific application of the communication system that transmits audio feedback and/or instructions to a user, an unintelligible or inaccurate production of audio in the system can render a conventional system unusable for its intended purpose. Overall, the effects of the errors in production described may be considered to affect the quality of the produced sound for a user of the communication component, leading to degraded intelligibility, clarity, usability and/or accuracy.
As discussed, in conventional systems, any delay in receiving and/or processing a subsequent audio waveform leads to skipping. Some techniques can be used to address this issue. Compressing the waveform reduces the time it takes to transfer the waveform and reduces the likelihood that a delay will interrupt playback. However, this is not always adequate and does not address intelligibility when a dropout does occur.
Another technique is to buffer all of or a portion of the waveform on the receiving side before starting playback. The downside of this approach is that it can cause a delay before playback is started while the receiver waits for the waveform to be received. However, this delay is unnecessary in cases when the waveform is transferred at a faster rate than it is being played, so it would be desirable to eliminate it when possible.
Another technique used to address this issue is for the receiver to repeat a portion of the audio. When the receiver of some systems does not receive the next segment of the waveform to be played in time (i.e. before it finishes playing what it has received), it repeatedly plays the last segment of audio that it has received to fill time until it receives the next portion of the waveform. This can prevent the audio from dropping out, but when the portion of the waveform that is repeated is not stationary or periodic, it can produce uneven sounds (clicks and stuttering).
For a wireless headset in industrial environments, when transaction rates are high, the average latency (of delivering verbal instructions to the user wearing a wireless headset) can have a meaningful effect on the value of the system. It can also affect worker acceptance of the system.
Intelligibility and smoothness is also important to the system value and worker acceptance. Difficult to understand and/or choppy audio can cause worker delays and can adversely affect worker acceptance of the system.
Accordingly, there is a need, unmet by conventional communication systems, to address unintelligible or inaccurate production of sound from audio waveforms and speech due to delay in receiving and/or processing in the communication component.