1. Field of the Invention
The present invention is directed to a system that identifies and stores new phonetically based identifiers, such as names for a voice dialer and, more particularly, uses a dictionary and a word generator to produce candidates from a limited text input device, such as a telephone DTMF key pad or a spelling recognizer where there are potentially multiple candidates for the letters of a name, to produce name candidates one of which is selected by a speech recognizer.
2. Description of the Related Art
In speech-controlled systems, that is, systems where the human voice is the primary or only mode of user input, human speech is processed by a subsystem called a speech recognizer (or simply a recognizer), which may contain both software and hardware components. A typical speech-controlled system obtains a speech input (called an utterance) from a human user and uses the speech recognizer subsystem to determine which words were spoken (called the recognized text); it then uses those words to determine the actions to be carried out. Of course the recognized text will not always correctly match the utterance, since speech recognizers are still imperfect.
The current state of the art in speech recognition technology does not permit so-called xe2x80x9copen-setxe2x80x9d recognition, in which the human user may say anything at all and the speech recognizer determines the correct word sequence. Instead, every system that uses a speech recognizer must supply a description of the possible word sequences that the system expects to hear from the user; we call these possibilities the in-set utterances. The manner in which the in-set utterances are specified depends on the speech recognizer.
The present invention is concerned with conventional recognizers that require, as part of the specification of in-set utterances, a description of the pronunciation of each word in those utterances. The pronunciation of each word is typically provided as a phonetic spelling, a transcription of the pronunciation in a phonetic alphabet. For example, the word xe2x80x9cphonexe2x80x9d could be specified as being pronounced xe2x80x9cf ow nxe2x80x9d, where xe2x80x98fxe2x80x99, xe2x80x9cowxe2x80x9d, and xe2x80x9cnxe2x80x9d are elements of the alphabet. There are several phonetic alphabets, but any particular recognizer of this class uses only one.
These systems typically provide to the recognizer a list of all the words that occur in the in-set utterances, along with one or more phonetic spellings of each word. In the so-called speaker-independent systems with which the present invention is concerned, multiple phonetic spellings of a word are often necessary because of differences in the way people pronounce words; an example is xe2x80x9ctomaytoxe2x80x9d and xe2x80x9ctomahtoxe2x80x9d.
The maximum number of distinct words usable at anyone time depends on the particular recognizer. For simple recognizers, the maximum may be only a few dozen, or even fewer. More complex recognizers can handle hundreds or thousands of words at a time. When each utterance consists of only a single word, some recognizers can handle a few tens of thousands of words. Recognizers that handle multi-word xe2x80x9ccontinuous speechxe2x80x9d utterances are currently restricted to a few thousand or tens of thousands words at most.
As already mentioned, a speech recognition application must identify in advance all the legitimate xe2x80x9cin-setxe2x80x9d utterances. However, in certain applications it would be beneficial to provide the user with the ability to add new in-set utterances in the course of using the application.
For example, consider an application that permits a user to place telephone calls simply by speaking the name of the person desired. The user might say xe2x80x9cCall John Jonesxe2x80x9d. The system responds xe2x80x9cDialing John Jones at 555-1234xe2x80x9d and completes the call. Such a system is called a voice dialer.
Suppose that there is a need to provide a xe2x80x9cpersonalizedxe2x80x9d voice dialing service, where a user may speak a name from a personal list, unique to that user. In other words, each user has a personal address book containing a list of names and associated phone numbers, and each user""s address book is distinct from that of other users. The application must first identify the user to tell which address book to use; only after the correct address book is identified can the application provide the correct list of in-set utterances to the speech recognizer.
What is needed is a system that will allow the addition of new names, with associated phone numbers, to the personal address books using only a telephone, without a computer terminal or keyboard or any other device at all, and without the need for human intervention in any way. What is more particularly needed is a system that acquires from the user, over the telephone, enough information to create a phonetic spelling of the name to be added (because that phonetic spelling must be provided to the recognizer for subsequent recognitions from this user""s address book).
The reason that this problem is difficult is that a system cannot simply ask the user to pronounce the name to be added and process that utterance with a speech recognizerxe2x80x94since, by definition, we don""t know the name to be added.
The present invention assumes that a conventional name dictionary is available and which is a list of a large number of the most common names (perhaps several hundred thousand, covering about 95% of the population) with one or more phonetic spellings for each. However, the entire dictionary cannot be provided to the speech recognizer because it contains too many possible utterances. Moreover, the name that the user wishes to add may not be in the name dictionary, since it is impossible to compile an exhaustive list of names.
Notice that in this example the system does not actually need the English spelling of the name to be added (although having that spelling would suffice). The voice dialer does not need a text representation of the names in an address book since it never interacts with the user except over the telephone; it only needs a phonetic representation of each name (which is what must be loaded into the speech recognizer) and, for each name, the associated number to dial.
For the purposes of simplicity the discussion herein will continue to use this example as a typical one for our problemxe2x80x94that is, the specific problem is to obtain, by telephone only, the phonetic spelling of a name. But the general problem is to determine, using a limited character set input device, such as telephone, a phonetic spelling of a word or phrase from a set much larger than can be managed by the speech recognizer, where the set (in general) is not completely known in advance.
Given a text representation of a namexe2x80x94that is, its spellingxe2x80x94it is conventional to determine an adequate phonetic spelling. For the fairly rare name that is not in the name dictionary a conventional text-to-phoneme heuristic (e.g., the so-called Navy rules) that find a reasonable phonetic transcription given a text word is used. With this approach, only an extremely rare name will yield a phonetic transcription so poor that recognition is impossible.
There are a number of different ways that a system, can obtain a text representation of a name over the telephone.
One method is to recognize letter spelling using a speech recognizer. This is essentially a speech recognition problem with only twenty-six xe2x80x9cwords.xe2x80x9d A phonetic spelling for each letter is created, any sequence of letters is permitted as a legitimate utterance, and the user is asked to spell the name. The problem with this method is that speech recognition of the alphabet is extremely poor, since (a) all letters but one consist of a single syllable, giving the recognizer little chance at differentiation, and (b) many letters sound very much alike except for subtle distinctions difficult to detect with current recognition technology. Using letter spelling in conjunction with a dictionary of names when the word being spelled is in the dictionary works better, but still not well enough for all applications, such as voiced based dialing, because it is highly possible that a surname will not be in the dictionary.
To overcome the low accuracy of recognizing letter spelling, the user can be instructed to spell using a xe2x80x9cphonetic alphabetxe2x80x9d of the form Alpha, Bravo, Charlie, Delta, and so forth. This greatly improves spelling accuracy, but has the drawback that the user must learn the twenty-six equivalents for the letters of the alphabet.
Another possibility is to use Dual-Tone Multi-Frequency (DTMF) keys, sometimes called TOUCHTONE keys for spelling. There are several conventional approaches used for spelling with DTMF keys, using two key presses for each letter. For example, first press the key that contains the letter, then press 1 if the letter is first on the key, 2 if the letter is second on the key, and 3 if the letter is third on the key. So, for example, letter A is entered as 21, letter K is entered as 52, and letter S is entered as 73. The star key is typically used to denote the end of the name. The difficulty with this scheme is that it is slow, tedious, and error-prone, even with practice.
What is needed is a system that will overcome the above-described problems.
It is an object of the present invention to provide a system that inputs phonetic spellings of a voice recognizable name using a limited character input device which provides an input corresponding to a number of different possible words.
It is another object of the present invention to use a telephone to input spellings of a word via a key pad or voice.
It is a further object of the present invention to provide a voice dialer that dials using names.
It is also an object of the present invention to use speech recognition with letter by letter spelling a voice recognizable name.
It is another object of the present invention to allow applications, such as call routers, to be provisioned in a way that does not require a system administrator to use a screen interface to set up the names.
The above objects can be attained by a system that uses a limited text input device to narrow the possibilities for the selection of a phonetically based name used in a voice dialer. The system allows a user to enter a DTMF or voice spelled signature of a name; a signature is a sequence of alpha or numeric digits which has a number of possible interpretations and which can be called a multiple word possibility input sequence. Since the signature could actually represent any of a multiplicity of names, a dictionary is used to generate likely possibilities or candidates for the phonetic spelling of the word. A word generator generates additional likely possibilities from the signature. A speech recognizer picks the best representation from the list of names associated with the signature based on a spoken version of the name. The selection possibilities can be narrowed by asking the user if a high probability phonetic spelling candidate is the name. A first name and last name procedure, where the signatures produce separate candidate lists is used to provide an entry to an address book, is used for voice dialing.
These together with other objects and advantages which will be subsequently apparent, reside in the details of construction and operation as more fully hereinafter described and claimed, reference being had to the accompanying drawings forming a part hereof, wherein like numerals refer to like parts throughout.