1. Field of the Invention
The present invention relates to the field of data processing systems having an audio user interface and is applicable to electronic commerce. More specifically, the present invention relates to various improvements, features, mechanisms, services and methods for improving the audio user interface aspects of a voice interface (e.g., telephone-based) data processing system as well as improvements directed to automatic data gathering.
2. Related Art
As computer systems and telephone networks modernize, it has become commercially feasible to provide information to users or subscribers over audio user interfaces, e.g., telephone and other audio networks and systems. These services allow users, e.g., “callers,” to interface with a computer system for receiving and entering information. A number of these types of services utilize computer implemented automatic voice recognition tools to allow a computer system to understand and react to callers' spoken commands and information. This has proven to be an effective mechanism for providing information because telephone systems are ubiquitous, familiar to most people and relatively easy to use, understand and operate. When connected, the caller listens to information and prompts provided by the service and can speak to the service giving it commands and other information, thus forming an audio user interface.
Audio user interface systems (services) typically contain a number of special words, or command words, herein called “keywords,” that a user can say and then expect a particular predetermined result from the service. In order to provide novice users with information regarding the possible keywords, audio menu structures have been proposed and implemented. However, keyword menu structures for audio user interfaces, contrasted with graphical user interfaces, have a number of special and unique issues that need to be resolved in order to provide a pleasant and effective user experience. One audio menu structure organizes the keywords in a hierarchical structure with root keywords and leaf (child) keywords. However, this approach is problematic for audio user interfaces because hierarchical structures are very difficult and troublesome to navigate through in an audio user interface framework. This is the case because it is very difficult for a user to know where in the menu structure he/she is at any time. These problems become worse as the hierarchical level deepens. Also, because the user's memory is required when selecting between two or more choices, audio user interfaces do not have an effective mechanism for giving the user a big picture view of the entire menu structure, like a graphical user interface can. Therefore, it would be advantageous to provide a menu structure that avoids the above problems and limitations.
Another approach uses a listing of keywords in the menu structure and presents the entire listing to each user so they can recognize and select the keyword that the user desires. However, this approach is also problematic because experienced users do not require a recitation of all keywords because they become familiar with them as they use the service. Forcing experienced users to hear a keyword listing in this fashion can lead to bothersome, frustrating and tedious user experiences. It would be advantageous to provide a menu structure that avoids or reduces the above problems and limitations.
Moreover, when using audio user interfaces (e.g., speech), many users do not know or are not aware of when it is their time to speak and can get confused and frustrated when they talk during times when the service is not ready to process their speech. Of course, during these periods, their speech is ignored thereby damaging their experience. Alternatively, novice users may never speak because they do not know when they should. It would be advantageous to provide a service offering a speech recognition mechanism that avoids or reduces the above problems and limitations.
Additionally, computer controlled data processing systems having audio user interfaces can automatically generate synthetic speech. By generating synthetic speech, an existing text document (or sentence or phrase) can automatically be converted to an audio signal and rendered to a user over an audio interface, e.g., a telephone system, without requiring human or operator intervention. In some cases, synthetic speech is generated by concatenating existing speech segments to produce phrases and sentences. This is called speech concatenation. A major drawback to using speech concatenation is that it sounds choppy due to the acoustical nature of the segment junctions. This type of speech often lacks many of the characteristics of human speech thereby not sounding natural or pleasing. It would be advantageous to provide a method of producing synthetic speech using speech concatenation that avoids or reduces the above problems and limitations.
Furthermore, callers often request certain content to be played over the audio user interface. For instance, news stories, financial information, or sports stories can be played over a telephone interface to the user. While this content is being delivered, users often speak to other people, e.g., to comment about the content, or just generally say words into the telephone that are not intended for the service. However, the service processes these audible signals as if they are possible keywords or commands intended by the user. This causes falsely triggered interruptions of the content delivery. Once the content is interrupted, the user must navigate through the menu structure to restart the content. Once restarted, the user also must listen to some information that he/she has already heard once. It would be advantageous to provide a content delivery mechanism within a data processing system using an audio user interface that avoids or reduces the above problems and limitations.
Additionally, in using audio user interfaces, there are many environments and conditions that lead to or create poor voice recognition. For instance, noisy telephone or cell phone lines and conditions can cause the service to not understand the user's commands. Poor voice recognition directly degrades and/or limits the user experience. Therefore, it is important that a service recognize when bad or poor voice recognition environments and conditions are present. It is not adequate to merely interrupt the user during these conditions. However, the manner in which a service deals with these conditions is important for maintaining a pleasant user experience.
Also, many data processing systems having audio user interfaces can also provide many commercial applications to and for the caller, such as, the sales of goods and services, advertising and promotions, financial information, etc. It would be helpful, in these respects, to have the caller's proper name and address during the call. Modern speech recognition systems are not able to obtain a user name and address with 100 percent reliability as needed to conduct transactions. It is desirable to provide a service that could obtain the callers' addresses automatically and economically.