1. Field of Art
The present invention relates generally to speech recognition-based systems, such as automated directory assistance systems, and, more specifically, to methods and apparatuses for training speech-recognition-based systems.
2. Related Art
Directory assistance services typically require a large number of human operators manning telephones in numerous call centers that are spread out across the country. Sometimes these call centers have to operate twenty-four hours a day, seven days a week and 365 days a year. Because it takes so much manpower and so many resources to provide these services, directory assistance service companies realize that substantial savings can be achieved by replacing human operators and call centers with speech recognition-enabled Directory Assistance Automation (DAA) systems. When they are properly trained, configured and deployed, DAA systems use speech-recognition technology to automatically receive, recognize and respond to some portion of the directory assistance call traffic without human intervention, thereby saving directory assistance providers substantial time, effort and money, and allowing the providers to apply those resources to call traffic that cannot be automated.
Directory assistance requests for business telephone numbers typically account for as much as 80% of all directory assistance calls, the remaining 20% involving requests for residential listings. Moreover, it has been observed by those in the industry that only a small fraction of all business listings contained in a given business listing database account for most of the directory assistance call traffic. In other words, out of all the hundreds of thousands of businesses in this country that would be included in any business-listing database, it is only a relatively small number of well-known companies in that database (e.g., Walmart, Sears and American Airlines) that account for the overwhelming majority of directory assistance calls. This small set of frequently requested business telephone numbers is often referred to as the set of frequently requested numbers (FRNs). Because requests for FRNs represent such a large share of all directory assistance calls, directory assistance service providers typically try to automate some portion of the FRN call traffic first.
In the last decade, there have been many tremendous advances in speech recognition technology, both in terms of its functionality and performance. Nevertheless, many significant problems still exist when it comes to using conventional speech recognition technology in certain commercial applications. DAA is a prime example of a commercial application where these problems arise and, heretofore, have not been adequately addressed by the conventional systems and methods.
A typical directory assistance request in the United States requires some kind of human intervention. Usually, the call is initiated when a caller seeking a telephone number for a particular listing dials 411. When the connection is established, the caller is usually asked to speak the names of a city and state for the listing (also called the locality). The caller's response is usually recorded and stored in a computerized system. Next, the caller will be prompted to state the name of the business or person the caller wishes to reach, usually with the phrase, “What Listing? ” or some variant of the same. Again, a computer system records and stores the caller's response. Typically, the caller's two responses (or utterances) are then forwarded to an operator who searches a telephone number database for the relevant telephone number. The database search usually involves a fast, pattern-matching algorithm that returns a relatively small list of telephone numbers deemed by the system to be the most likely candidates for the requested listing. The operator must quickly scan this list and select a unique telephone number to release to caller, or, if necessary, ask the caller for additional disambiguating information, such as a street name or address for the listing.
There are at least two significant problems with the typical system described above. First, it still involves using a human operator to scan a list of likely candidates and select the correct telephone number to release to the caller. Second, and more importantly, directory assistance callers rarely ask for a listing by saying it exactly as it appears in the telephone directory. Instead, callers frequently leave out parts of a listing (e.g., by saying “Sears” instead of “Sears, Roebuck and Company Department Store”). They insert extra words (e.g., “K-Mart Department Store on Main Street” instead of “K-Mart”). They insert extraneous words (e.g., “Uh, I want the number for Sears, please”). They abbreviate listings (e.g., “DMV” for “Department of Motor Vehicles”). They also use other names entirely (e.g., “The phone company” instead of “Verizon”). Often, callers are even speaking to other people while the DAA system is recording the caller's response to the prompts (e.g., “Hey, hold on a second, I'm on the phone . . . give me the number for Verizon, please.”).
Since it is extremely difficult to predict exactly what a caller will say when prompted for a listing, directory assistance providers typically supply the DAA system with some kind of finite-state grammar (FSG), comprising the most frequently occurring user requests (and their associated FRNs) from a random sample of real-world directory assistance calls. The grammars are created by recording callers' utterances (also known as training tokens) and storing them in a database along with the correct responses as determined by the human operators who took the calls. Then the DAA system is trained against the grammar of tokens and correct responses until the DAA system can do a reasonably good job of recognizing and responding to some specified target percentage of caller utterances that are likely to be received.
With conventional systems, practical difficulties still arise, however, when providers try to determine the amount and nature of training data that should be included in the grammar to achieve a target automation rate (i.e., the percentage of correct verses incorrect automated responses). The graph in FIG. 1 illustrates the rate at which the FRN set size grows with the call traffic coverage that is desired. The size of the FSG that must be used to train the system grows in proportion to the growth of the FRN set. As shown in FIG. 1, the increases in traffic coverage begins to level out significantly after a certain number of FRNs are added to the FSG. The fact that continuing to increase the size of the FSG begins to have less and less impact on the traffic coverage achieved presents a computational problem for higher automation levels (unless one designs an annoyingly deep hierarchical call flow that asks a series of disambiguating questions). The problem is compounded by the fact that, as stated above, there are so many ways in which different callers will ask for the same listing. Measurements on actual customer calls show that only 40% of the users ask for a given listing in one particular way. Even if one allows for 10 different ways of asking for each listing, the resulting grammar only covers about 66% of the queries. Thus, it can be extremely difficult for the directory assistance provider to determine the optimal number of FRNs to automate, the optimal number of training tokens needed to train the system to handle the FRNs to be automated, and the optimal allocation of the training tokens across all of the automated FRNs in order meet a specified target automation rate.
In addition, callers are often unsure about the physical location of a particular business and, therefore, an exact combination of the recognized locality and listing is often not present in the system. As a result, the requested FRN cannot be obtained from the DAA database, which means the call cannot be automated. For these reasons, the FSG approach described above does not scale well to higher automation rates.
Accordingly, a need exists for more efficient methods of estimating the nature and amount of training data required to achieve a target automation rate. In particular, the industry needs more robust and dependable ways to determine the optimal number of responses (such as FRNs) to automate, the optimal number of training tokens required to train the system, and the optimal allocation of those training tokens across the set of responses to be automated. Such a system would be even more useful if it were adapted to operate efficiently even though the training tokens do not exactly match the callers' spoken words.