Voice-directed workflow systems allow workers to communicate verbally with a computer system. These systems may be used in warehouses or distribution centers to improve safety and efficiency for tasks such as picking, receiving, replenishing, and/or shipping.
Voice-directed workflow systems typically require a worker to wear a headset equipped with a microphone and earphone. Voice commands are transmitted to the worker via the earphone and spoken responses from the worker are received by the microphone. In this way, a worker may be directed to perform a task and respond with their progress by speaking established responses into the microphone at certain points in an established workflow dialog.
Speech recognition is part of a voice-directed workflow system. Speech recognition is the translation of spoken words into text/data via a computing device. A computing device configured for speech recognition is known as a speech recognizer.
Speech recognition is a challenging problem for a variety of reasons. First, the speech recognizer must detect speech versus background noise. For example, the speech recognizer must recognize that a sound represents speech rather than a breath. Next, the speech recognizer must compare the speech input to words and/or phrases in a vocabulary typically specific to the application (i.e., application vocabulary). Here, the speech recognizer may use the workflow dialog to help determine what was said.
Often, for a particular workflow dialog, the expected responses are limited to a range of possible responses, or even a single expected response. For example, if a worker is given a picking task with the prompt, “pick two,” and the worker is expected to confirm the picking task with the response “two,” then the speech that occurs after the prompt may be expected to match a voice template for “two.” In general, a workflow has an associated application vocabulary consisting of voice templates for the vocabulary words, sounds, or phrases necessary to carry out the tasks associated with workflow.
Voice templates (i.e., speech templates or templates) are voice patterns for particular words or phrases stored in memory. The voice templates may be specific to a user in speaker-dependent recognition systems. Alternatively, the voice templates may be for all users (i.e., generic) in speaker-independent recognition systems. In either case, the speech recognizer determines how closely the received speech matches a stored voice template to determine what was most likely spoken.
Since everyone's speech may be different, custom voice templates may be created. To create a custom voice template for a word, a user may be prompted (e.g., through a display) to provide speech samples (e.g., by repeatedly saying a word). It is common to require workers new to a voice-directed workflow system to train the system for their voice by creating voice templates for a variety of words and/or sounds.
A problem arises when the voice templates created by a worker are not distinct enough for a speech recognizer to distinguish it from other words in the application vocabulary. For example, some workers may pronounce the word, “five,” and the word, “nine,” similarly. This may result in voice templates created for the word, “five,” that are very similar to the voice template, “nine.”
Voice template similarity may erode the speech recognizer's performance. For example, a worker may be asked to repeat what they have said which may reduce productivity and cause frustration. Errors may also occur as numbers may be transposed (e.g., a 5 recorded when a 9 was intended, or vice-versa).
Therefore, a need exists for analysis during the creation of a voice template (i.e., during training) to insure that a created voice template is not similar to (or does not match with) any other stored voice templates. If a similarity is found, then a user may be prompted to create a new, more distinct, voice template for the word. This dynamic training analysis may improve user experience and accuracy for voice-directed workflow systems.