A voice response (VR) system allows a human user to listen to spoken information generated by a computer system. The user enters dual tone multi-frequency (DTMF) tones, or speaks commands, to navigate through the functions of such a VR system.
The implementation of VR systems that respond to tones or spoken commands is well known, but these systems are designed with the assumption that humans will be providing the commands to a computer over a communication link. Furthermore, these systems are typically designed to use human speech in the form of stored audio files that are played over the telephone line in order to communicate with the outside world. Communication with VR systems is thus normally via an analog interface. U.S. Pat. Nos. 4,071,888 and 4,117,263 are representative of basic patents in the field of VR systems. Modem VR systems are largely similar to the centralized systems described in these patents.
In contrast to VR systems, electronic mail (email) employs digital electronic signals for communications between users. Messages are encoded as numbers and sent from place to place over digital computer networks. Furthermore, email can be used to exchange voice messages in the form of digital audio files. However, the interface between email software systems and the underlying network is digital—not analog.
As a result of this analog-digital interface dichotomy, there is currently virtually no integration between voicemail and email. Since voicemail is the most common application of VR systems today, it is the best example. Accessing a voicemail system using a telephone handset, a user may listen to commands and send DTMF (Touchtone®) responses in order to listen to, save, forward, and delete their voicemail messages. However, commercial voicemail systems have a limited message capacity (both in time and space), and the lack of a digital interface in voicemail systems makes integration of voicemail with email and digital audio difficult. Not only is voicemail management using traditional dial-in systems cumbersome, it can be expensive, as cellular and mobile phone users must often incur the user peak-rate phone charges to access their voicemail. In addition, if the user has multiple telephones with voicemail accounts then each voicemail account must be checked with a separate phone call, and the user must manage each voicemail box separately. Voicemail is therefore a transient, untrustworthy, and cumbersome medium for communication.
Note that email and voicemail systems both use a “store & forward” model for message delivery. It would thus be desirable to construct a bridge between them (allowing voicemail to reach the Internet and Internet audio messages to reach the phone system), which should enable a number of applications of great utility to be implemented. For example, if voicemail messages were available on a user's computer in digital form and freely available for distribution via email, then several advantages to users of voicemail systems would result. For example, such a system would enable the following benefits: (1) voicemail messages could be captured securely and permanently, just like any other type of computer file; (2) voicemail messages could be distributed and used wherever digital audio files are used, in particular, for transmission to remote locations via email (note the cost of retrieving email remotely is far lower than the long distance charges or peak roaming charges that may be incurred to make calls to voicemail); and, (3) because no direct connection is required to a modem, except at one location (the server), users would be able to receive voicemail on non-telephone devices, i.e., with the same devices used for email.
The prior art identifies the value of integrating voicemail with computers and in particular, personal computers (PCs). U.S. Pat. No. 6,339,591,for example, describes a system for sending voicemail messages over the Internet, using proprietary methods (i.e., not email). The most likely configuration that might be used to integrate voicemail with the computer network would effect this integration at the centralized voicemail switch. In such a system, because voicemail messages are stored as digital audio files on the voicemail switch and because that switch is on the computer network, those voicemail messages might then be made available to computers on the network.
U.S. Pat. No. 5,822,405 discloses a method of using a PC or other device equipped with a special modem to retrieve voicemail over a telephone line and store each message in a file on the computer; however, this patent makes no mention of digital distribution of the voicemail messages retrieved. This patent comes close to solving the central problem of interacting between a computer and a VR system, namely the need to use speech recognition in many cases, but room for improvement exists. For example, improvements can be made in the analysis of the audio signals received by a user's computer, and no utility is provided in this prior art patent for the digital distribution of the retrieved messages.
Where voicemail messages are to be saved for later use in a conventional voicemail system, the voicemail messages are kept stored within the voicemail system. For example, U.S. Pat. Nos. 6,295,341; 4,327,251; 6,337,977; and 6,341,160 describe such systems. Even when computers are employed, the messages are generally kept in the answering device (as disclosed in U.S. Pat. No. 6,052,442). U.S. Pat. No. 6,335,963 even teaches that email be employed for notifying a user of voicemail, but not for delivery of the messages themselves.
There is much use made of voice recognition in VR applications, but in almost all these applications, voice recognition is used by a computer to recognize the content of a human voice speaking on the telephone (e.g., as taught in U.S. Pat. Nos. 6,335,962; 6,330,308; 6,208,966; 5,822,405; and 4,060,694). Such human voice recognition techniques are computationally expensive. Readily available human voice recognition applications compare real-time spoken words against a stored dictionary. Because of variations in the human spoken word and variations in the quality of the communications channels, the comparison of a spoken word with a dictionary of words must take into account variations in both the length and the spectral characteristics of the human speech being recognized. Thus, solving the problem of human speech recognition in real-time consumes significant computational resources, which effectively limits the applications of human speech recognition used in conjunction with fast, relatively expensive, computers. Where non-standard audio recognition methods are used, they are typically restricted to narrow applications, as disclosed in U.S. Pat. Nos. 6,324,499; 6,321,194,and 6,327,345.
It should be noted that VR systems often emulate (i.e., “speak”) the human voice, but do not produce it. Instead, they use stored audio files that are played over the telephone communication link. Therefore, the speech that these VR systems produce is identically spoken every time it is played. The recognition of repetitive identical audio signatures is, in fact, a much simpler problem to solve than the problem of recognizing actual spoken human voice produced by a variety of speakers. It would be preferable to provide a system employing such techniques for recognizing stored audio file speech, thereby enhancing computational performance and enabling less expensive processors to be employed.
Another issue with conventional voice-recognition methods applied to VR applications is that the recognition of whole words and phrases can involve considerable latency. In VR applications, it is preferable to keep recognition latency to a minimum to avoid lost audio and poor response. Reduced processing overhead within the application will allow latency to be reduced within the recognition system.
In the prior art, voice recognition is always proceeded by a learning step, where the recognizing computer system processes speech audio to build a recognizer library. Many VR and voice recognition inventions include such a learning process, which may be used to teach the computer what to say, what tones to send, or what words to recognize (e.g., as disclosed in U.S. Pat. Nos. 6,345,250; 6,341,264; and 5,822,405). It should be noted that in the prior art, when a system is learning words to be recognized, the learning method is independent of the context of the audio being learned. That is to say, the recognition method stands alone and can distinguish between a word being recognized and all other words (at least theoretically). It would thus be desirable to provide a computer-driven VR system wherein the learning method is simplified to take into account the invariant nature of the messages and the known context of their expression, to require fewer computational resources to be employed.
Much prior art in the field of automatic control of VR systems with a computer depends upon the calling computer knowing the context of the VR system at all times. For example, the application described in U.S. Pat. No. 6,173,042 assumes that the VR system works identically every time, and that tones can be input to the VR system at any time. The prior art recognizes that the context of recognition is important (e.g., as disclosed in U.S. Pat. No. 6,345,254). It would be desirable to provide a programming language to describe VR interactions, which includes a syntax powerful enough to express such context in a general manner.
Many VR control applications (such as described in U.S. Pat. No. 5,822,405) use some form of interpreted programming language to tell the application how to drive the remote VR system. In the prior art however, the scripting language is of a very restricted syntax, specific to its application (for example, voicemail retrieval). In order to build a general purpose VR response system, it would be helpful to have a programming language that is sufficiently powerful to address a wide range of VR applications (e.g., retrieval of stock quotes, airline times, or data from an online banking application).
Another aspect of the learning process that can have a major impact on its efficiency is the user interface (UI). A UI that is too generalized may result in complex manipulations of the interface being required to achieve full control of the learning process. Such a situation arises often when the learning portion of an invention's embodiment is performed with a general purpose tool, as is in U.S. Pat. No. 5,822,405. It would be desirable to provide a computer-driven VR system, wherein the UI is specifically adapted to enable easy navigation and control of all of the aspects of the VR system, including any learning method required.
A different issue with conventional voice recognition methods applied to VR applications, is that the recognition of whole words and phrases can involve considerable latency. It would be desirable to provide a computer-driven VR system, wherein recognition latency is kept to a minimum to avoid lost audio content and poor response.
When designing a VR control application (such as described in U.S. Pat. No. 5,822,405) it may be necessary to develop some form of interpreted programming language, to tell the application how to drive the remote VR system. In the prior art, however, the scripting language is of a very restricted syntax, specific to its application (for example, voicemail retrieval). In order to build a general purpose VR response system, it would be desirable to employ a programming language that is sufficiently powerful and more general in nature to address a wide range of VR applications (e.g., retrieval of stock quotes, airline times, or for accessing data in an online banking application). If a bridge such as that noted above can be built between voicemail and the Internet, it would make voicemail as easy to review, author, and send, as email. Voicemail, originating in the telephone system, might be integrated directly with messages created entirely on the Internet using an audio messaging application.
Many integrated messaging systems have been built. These systems seek to integrate some combination of voicemail, text messaging, and email into one interface. However, the prior art with respect to unified messaging (UM) is exclusively concerned with creating a closed universe within which the system operates. Such systems, although at times elegant, do not cater to users who have a need to access voicemail from different voicemail systems (such as from home and from work), through an Internet connection. For example, U.S. Pat. No. 6,263,052 archives the voice messages within the voicemail system. It would be desirable to enable the voicemail messages to be available on the computer network, thereby enabling a user to reply to those messages offline, and to forward the reply to the original caller using email, or to make a voicemail response that is delivered by the computer system. If integrated messaging systems could interface directly with any VR system over the public service telephone network (PSTN), then UM would become easier to apply, and would also become more useful.
Often after voicemail messages are received, a user will wish to reply to such messages. It is convenient for the user to be able to reply to the voicemail at their leisure, and have the reply forwarded to the original sender as another voicemail. Such a system is described in U.S. Pat. No. 6,263,052.
In the prior art it is assumed that if two computers are to communicate with each other they will do so using some form of digital encoding, and that if they are using a telephone line to communicate they will modulate a signal on that line with an audio signal that follows the structure of the digital sequence they wish to communicate. U.S. Pat. Nos. 4,196,311 and 3,937,889 are exemplary of such art. On the other hand, humans communicate with each other over the telephone using analog, not digital, communications. However, if two computer systems, each equipped with voice recognition and the ability to communicate using analog voice communications, were placed in communication with each other in a peer-to-peer configuration, a useful form of two-way communication might result. If the recognition of audio from one computer can drive a program on the other computer, which can in turn send audio responses to the first computer, then secure encoded communications might be effected by use of a normal telephone voice call.
Clearly, it would be desirable to provide a software system, running on a suitably equipped computer, which can be flexibly programmed and easily taught to navigate a VR system using audio signature recognition and which can download chosen audio segments to the computer system as digital audio files. Such a system will preferably enable the automatic scheduled retrieval of audio files from the VR system and enable these files to be automatically forwarded via email to the intended recipient, over the Internet.
It would further be desirable for digital audio files to be played over the telephone system and to leave voicemail messages that can be played directly by the recipient. Yet another desirable feature of such a system would be the use of computationally efficient waveform recognition algorithms to maximize the number of telephone lines that can be simultaneously supported by one computer.
It would still be further desirable to provide flexible interfaces, functions, and programming language to enable general purpose applications to interface with the VR retrieval and forwarding system. Such a system would automatically recognize duplicate audio files (i.e., files which have been downloaded twice from the same VR system), and provide means for the user to prepare digital audio files as replies to received messages, or as new voice messages, and to have those digital audio files delivered via email or over the phone line, to the intended recipient.
Further desirable features of such a system would include means for teaching the software to recognize new audio signatures and to incorporate them into a program script, and such learning processes should be enabled both locally (at a computer with a modem), and remotely (by employing a computer and a modem receiving commands via email from a remote computer). It would further be desirable to provide a system that enables two computers to communicate over an audio communications channel, to achieve an audio encoded computer-to-computer communications system.