During the last several years Internet telephony systems or, voice over IP (VoIP), have become feasible as a result of a maturing standards base, including ITU standard H.323 and the IETF standard RFC 3261, otherwise known as the Session Initiation Protocol (SIP).
Other non-SIP protocols have greatly extended the use of VoIP, most notably Skype with its proprietary protocol, but also through applications such as Google Talk using standardized XMPP/Jingle, and AOL's AIM Phoneline.
SIP, Skype, Jingle, and telephones (including mobile and BlackBerries) are hereby collectively known as communications devices.
The main attraction associated with VoIP is the potential to use the massive, free-market Internet infrastructure in lieu of tightly regulated legacy telephony infrastructure. The key difference is that Internet service charges are determined by the aggregate bandwidth consumed between the customer and Internet Service Provider, whereas the traditional telephony charges are based on geographic distance between parties.
With the availability of scalable low-cost Internet service provision now ubiquitous in all offices, and most homes, the ability to layer voice communications over existing IP traffic means that companies and individuals can pay for the data capacity to their Internet Service Provider (ISP) only, without regard for where the terminating party is located.
An additional attraction of VoIP is the ability to offer voice services or applications via software servers external to traditional private branch exchanges (PBX's), meaning that additional services can be offered to customers or integrated with other software applications without requiring additional or new hardware.
However, one characteristic of a voice service, accessed via a VoIP or traditional telephone is that the interaction with the service is limited to the audio domain. More specifically, the user is able to hear instructions, prompts, messages or other audio data, and is able to control the service via the keypad or voice recognition. Keypad interaction generally either injects DTMF tones into the audio (in-band), or sends a control signal (out-of-band) via a standard protocol such as RFC 2833, or both.
The combination of audio for output, and voice recognition and/or DTMF for input defines the telephone user interface (TUI).
Use of DTMF generated by a VoIP or traditional telephone's keypad enables the user to navigate through an interactive voice response (IVR) service, but limits the control to the twelve keys found on most keypads (0 to 9, plus ‘*’ and ‘#’). Some telephones (e.g., Mitel 5340) offer additional programmable “soft keys” that can be mapped to service-specific functionally, but the soft keys typically use proprietary protocols that are only understood by a limited number of PBX's.
Presenting the user with lists of items in a voice service is a particularly difficult task for a TUI, in that selection of items from lists requires mapping keys to items, yet the items may not be readily describable via audio. An example is selecting a one of several voicemails to play. To do this, a voice service must be able to audibly identify a voicemail and map it to a telephone keypad button. However, the number of voicemails may be greater than the number of available keys. Additionally, a voicemail is not readily described in audio. Characteristics that could be used to identify a voicemail of interest might include; the caller's identification (typically a destination ID, or, DID), the caller's display name, the time of the call, and the duration of the voicemail.
The most relevant bit is of information is frequently, but not always, the caller's DID or display name. The former is time-consuming to read out (e.g., “1” “6” “1” “3” “5” “5” “5” “1” “2” “3” “4”) and may not be immediately recognizable to the user. The latter can be very difficult for a service to read in a manner that will mean anything to the user (e.g., a telephone carrier may have assigned John and Jane Doe a call display of “DOEJANDJ”).
Similarly, other voicemail characteristics such as date and duration, suffer from the fact that it may take more time to listen to the description of the message than it takes to actually listen to a short message.
Consequently, most telephone services present a list of items one item at a time, and the user has the opportunity to interact with each list item before proceeding to the next. However, such sequential presentation means that a user must potentially listen to the description and/or options associated with many items before encountering one of interest. The voice service developer is intrinsically constrained in the options that he can present to the user, and voice services have had to limit input to such data that can be expressed via the 12 DTMF digits, or awkwardly via the letters commonly printed on the keypad (i.e., the “2” key has “ABC”, the “3” key has “DEF”, the “4” key has “GHI”, etc.). However, the common “Directory” or “Dial-by-Name” feature of a PBX reveals this to be a generally unsatisfying way to enter data.
Short Messaging Service (SMS) users are familiar with using a mobile telephone keypad to type text by rapidly and repeatedly pressing a specific digits to cycle through to a stack of letters, number and symbols associated with a specific key (“2” yields “A”, “22” yields “B”, “222” yields “C”, etc.). However, this method is not suitable for a voice service as DTMF can only represent 16 possible characters (the digits 0 to 9, “*”, “#”, and four additional reserved keys).
Voice recognition offers greater flexibility for input, but is still an emerging technology that requires extensive training and/or programmer-defined dictionaries, and still appears to be susceptible to speaker dialects and accents, subject matter vernacular and unusual target names. Additionally, natural language speech is very open-ended—users frequently utter speech that is not anticipated by the developers of a voice service, with the result that utterances that are logical to the user are unrecognizable to the service, leading to a frustrating user experience. The continual effort to accommodate regional and personal differences in vocabulary and speech patterns is an enormously expensive proposition for a service developer, and requires expertise that is probably not held by the service developer.
Furthermore, unpronounceable texts (such as passwords) are poor candidates for voice recognition.