There can be a lot happening in the passenger compartment of a modern automobile. Although the driver's full attention is needed for the task of driving safely, there can be many distractions in forms such as complex entertainment systems, electronic navigation systems and their colorful displays and text-to-speech (TTS) audio outputs, smart phones, email devices, all of which now can be controlled by automatic speech recognition (ASR). There is a need to minimize these distractions to preserve the driver's attention to the road ahead.
While operating such in-car devices, most distraction seems to be visual—caused by the driver looking at the operated device (its screen, buttons etc.). In the specific case of automotive ASR, some systems allow processing of the recognized text completely without a display, relying on audio output only, while other systems use a display showing the full edited text. Eyes-free text composition without a text display generally leads to lower distraction levels as compared to ASR systems that use a text display, but more errors may remain in the composed text that go unnoticed by the user. This can be due, for example, to different words sounding alike when spoken by a speech synthesizer. It may be possible to offset this problem, for example, by using an automatic audio output disambiguation method, but under some circumstances it may still be worthwhile to show to the driver a limited content text display.
When entering, navigating and editing text in text processing applications such as ASR systems, the most frequent user interface operations typically are mapped directly to physical controls such as dedicated buttons or rotary knobs. Other operating functions can be activated by traversing and selecting from menus, which substantially increases visual and cognitive load. Other user control mechanisms can include recognizing handwritten gestures (e.g., using a touchpad) or recognizing voice commands. One advantage of dedicated physical controls is their robustness to any kind of noise and their low visual and cognitive distraction potential. But there can only be so many physical controls in actual automotive setups before the dashboard becomes too complex.
One of the application control functions in a text processing application, such as processing ASR text, is control of the text insertion point—the cursor. There are two main cursor modes which can be thought of as “insert-after” and “replace.”
The insert-after mode can be described in the context of an audio playback arrangement—typically produced by a text-to-speech (TTS) system which always reads aloud the active text item when the active text item changes, e.g. during navigation among items or after entering new text. This way, the text last pronounced by the system, such as “please buy bananas,” is naturally followed by the user's dictation, e.g., “and oranges.” This holds as well for other input modalities such as handwriting recognition. Insert-after mode is a natural choice when operating an eyes-free text processing system without a text display, or with a system using a text display that shows the complete dictated text. In the former case, the user maintains a “mental cursor” at the point just after the text last dictated. In the latter case, the display and behavior resembles word processing and e-mail programs with which users are already familiar, and that by default implement an insert-mode cursor.
The replace mode insertion point cursor is an alternative to insert-after mode. New input text such as new dictation results replaces the active text item (except maybe at the beginning or end of the entire text where artificial beginning- and end-of-message markers may be placed). Under certain conditions, replace mode may be more natural than insert-after mode; for instance, when a display is used that only shows the active text item. Replace mode offers the benefit that replacing a whole active text item by re-dictating is quicker than in the insert-after mode where the user first needs to delete the old text. One drawback of replace mode is that the insertion of new text inside a block of existing text requires either a switch to insert mode or the use of a voice command such as “Insert <new text>”.
Both cursor insertion modes require the user to understand which mode is being used. If user expectations do not match the editing mode, undesired pieces of text may accidentally remain part of the message in insert-after mode, and intended text may be accidentally deleted in replace mode. This can be offset by: (1) not switching these cursor modes dynamically, but rather deciding to use only one of the modes in a deployed system; (2) sufficient user feedback indicating what is happening when new text is inserted into existing text (e.g. TTS announcing that an original text item has been replaced under replace mode); or (3) getting the user accustomed to which mode is active at which time.