This application claims the priority of German patent document 197 09 518.6, filed Mar. 10, 1997, the disclosure of which is expressly incorporated by reference herein.
The invention relates to a method and apparatus for real-time speech input of a destination address into a navigation system.
German patent document DE 196 00 700 describes a target guidance system for a motor vehicle in which a fixedly mounted circuit, a contact field circuit or a voice recognition apparatus can be used as an input device. The document, however, does not deal with the vocal input of a target address in a target guidance system.
Published European patent application EP 0 736 853 A1 likewise describes a target guidance system for a motor vehicle. The speech input of a target address in a target guidance system is, however, not the subject of this document.
Published German patent application DE 36 08 497 A1 describes a process for speech controlled operation of a long distance communication apparatus, especially an auto telephone. It is considered a disadvantage of the process that it does not deal with the special problems in speech input of a target address in a target guidance system.
Not yet prepublished German patent application P 195 33 541.4-52 discloses a method and apparatus of this type for automatic control of one or more devices, by speech commands or by speech dialogue in real time. Input speech commands are recognized by a speech recognition device comprising a speaker-independent speech recognition engine and a speaker-independent additional speech recognition engine that identifies recognition probability as the input speech command, and initiates the functions of the device or devices associated with this speech command. The speech command or speech dialogue is formed on the basis of at least one syntax structure, at least one basic command vocabulary, and if necessary at least one speaker-specific additional command vocabulary. The syntax structures and basic command vocabularies are presented in speaker-independent form and are established in real time. The speaker-specific additional vocabulary is input by the respective speaker and/or modified by him/her, with an additional speech recognition engine that operates according to a speaker-dependent recognition method being trained in training phases, during and outside real-time operation by each speaker, to the speaker-specific features of the respective. speaker by at least one-time input of the additional command. The speech dialogue and/or control of the devices is developed in real time as follows:
Speech commands input by the user are fed to a speaker-independent speech recognition engine operating on the basis of phonemes, and to the speaker-dependent additional speech recognition engine where they are subjected to feature extraction and are checked for the presence of additional commands from the additional command vocabulary and classified in the speaker-dependent additional speech recognition engine on the basis of the features extracted therein. PA1 Then the classified commands and syntax structures of the two speech recognition engines, recognized with a certain probability, are assembled into hypothetical speech commands and the latter are checked and classified for their reliability and recognition probability in accordance with the syntax structure provided. PA1 Thereafter, the additional hypothetical speech commands are checked for their plausibility in accordance with specified criteria and, of the hypothetical speech commands recognized as plausible, the one with the highest recognition probability is selected and identified as the speech command input by the user. PA1 Finally, the functions of the device to be controlled that are associated with the identified speech command are initiated and/or answers are generated in accordance with a predetermined speech dialogue structure to continue the speech dialogue. According to this document, the method described can also be used to operate a navigation system, with a destination address being input by entering letters or groups of letters in a spelling mode and with it being possible for the user to supply a list for storage of destination addresses for the navigation system using names and abbreviations that can be determined in advance.
The disadvantage of this method is that the special properties of the navigation system are not discussed, and only the speech input of a destination location by means of a spelling mode is described.
The object of the invention is to provide an improved method and apparatus of the type described above, in which the special properties. of a navigation system are taken into account and simplified.
Another object of the invention is to provide such an arrangement which enables faster speech input of a destination address in a navigation system, improving operator comfort.
These and other objects and advantages are achieved by the method and apparatus according to the invention for speech input of destination addresses in a navigation system, which uses a known speech recognition device, such as described for example in the document referred to above, comprising at least, one speaker-independent speech-recognition engine and at least one speaker-dependent additional speech-recognition engine. The method according to the invention makes possible various input dialogues for speech input of destination addresses. In a first input dialogue (hereinafter referred to as the "destination location input"), the speaker-independent speech recognition device is used to detect destination locations spoken in isolation, and if such destination location is not recognized, to recognize continuously spoken letters and/or groups of letters. In a second input dialogue (hereinafter referred to as "spell destination location"), the speaker-independent speech recognition engine is used to recognize continuously spoken letters and/or groups of letters. In a third input dialogue (hereinafter referred to as "coarse destination input"), the speaker-independent speech-recognition engine is used to recognize destination locations spoken in isolation, and if such destination location is recognized, to recognize continuously spoken letters and/or groups of letters. In a fourth input dialogue (hereinafter referred to as "indirect input"), the speaker-independent speech recognition engine is used to recognize continuously spoken numbers and/or groups of numbers. In a fifth input dialogue (hereinafter referred to as "street input"), the speaker-independent speech-recognition device is. used to recognize street names spoken in isolation and if the street name spoken in isolation is not recognized, to recognize continuously spoken letters and/or groups of letters.
By means of the input dialogues described above, the navigation system is supplied with verified destination addresses, each comprising a destination location and a street. In a sixth input dialogue (hereinafter referred to as "call up address"), in addition to the speaker-independent speech-recognition engine, the speaker-dependent additional speech-recognition engine is used to recognize keywords spoken in isolation. In a seventh input dialogue (hereinafter referred to as "store address"), a keyword spoken in isolation by the user is assigned a destination address entered by the user, so that during the input dialogue "call up address" a destination address associated with the corresponding recognized keyword is transferred to the navigation system.
The method according to the invention is based primarily on the fact that the entire admissible vocabulary for a speech-recognition device is not loaded into the speech-recognition device at the moment it is activated; rather, at least a required lexicon is generated from the entire possible vocabulary during real-time operation and is loaded into the speech-recognition device as a function of the required input dialogue for executing an operating function. There are more than 100,000 locations In the Federal Republic of Germany that can serve as vocabulary for the navigation system. If this vocabulary were to be loaded into the speech-recognition device, the recognition process would be extremely slow and prone to error. A lexicon generated from this vocabulary comprises only about 1500 words, so that the recognition process would be much faster and the recognition rate higher.
At least one destination file that contains all possible destination addresses and certain additional information for the possible destination addresses of a guidance system, and is stored in at least one database, is used as the database for the method according to the invention. From this destination file, lexica are generated that comprise at least parts of the destination file, with at least one lexicon being generated in real time as a function of at least one activated input dialogue. It is especially advantageous for the destination file for each stored destination location to contain additional information, for example political affiliation or a additional naming component, postal code or postal code range, telephone area code, state, population, geographic code, phonetic description, or membership in the lexicon. This additional information can then be used to resolve ambiguities or to accelerate the search for the desired destination location.
Instead of the phonetic description, a transcription of the phonetic description in the form of a chain of indices, depending on the implementation of the transcription, can be used instead of the phonetic description for the speech-recognition device. In addition, a so-called automatic phonetic transcription that performs a rule-based conversion of orthographically present names using a table of exceptions into a phonetic description can be provided. Entry of lexicon membership is only possible if the corresponding lexica are generated in an "off-line editing mode," separately from the actual operation of the navigation system, from the destination file and have been stored in the (at least one) database, for example a CD-ROM or a remote database at a central location that can be accessed by corresponding communications devices such as a mobile radio network. Generation of the lexica in the "off-line editing mode" makes sense only if sufficient storage space is available in the (at least one) database and is especially suitable for lexica that are required very frequently. In particular, a CD-ROM or an external database can be used as the database for the destination file since in this way the destination file can always be kept up to date.
At the moment, not all possible place names in the Federal Republic of Germany have been digitized and stored in a database. Similarly, a corresponding street list is not available for all locations. Therefore it is important to be able to update the database at any time. An internal nonvolatile storage area of the navigation system can also be used as the database for the (at least one) lexicon generated in the "off-line editing mode."
To facilitate more rapid speech entry of a desired destination address into the navigation system, following the initialization phase of the navigation system or with sufficiently large nonvolatile internal storage, a basic vocabulary is loaded each time the database is changed, which vocabulary comprises at least one basic lexicon generated from the destination file. This basic lexicon can be generated in the "off-line editing mode." The basic lexicon can be stored in the database in addition to the destination file or can be stored in a nonvolatile internal memory area of the navigation system. As an alternative, generation of the basic lexicon can wait until after the initialization phase. Dynamic generation of lexica during real-time operation of the navigation system, in other words during operation, offers two important advantages. Firstly this creates the possibility of putting together any desired lexica from the database stored in the (at least one) database, and secondly considerable storage space is saved in the (at least one) database since not all of the lexica required for the various input dialogues need to be stored in the (at least one) database prior to activation of the speech-recognition engine.
In the embodiment described below, the basic vocabulary comprises two lexica generated in the "off-line editing mode" and stored in the (at least one) database, and two lexica generated following the initialization phase. If the speech-recognition device has sufficient working memory, the basic vocabulary is loaded into it after the initialization phase, in addition to the admissible speech commands for the speech dialogue system, as described in the above mentioned German patent application P 195 33 541.4-52. Following the initialization phase and pressing of the PTT (push-to-talk) button, the speech dialogue system then allows the input of various information to control the devices connected to the speech dialogue system as well as to perform the basic functions of a navigation system and to enter a destination location and/or a street as the destination address for the navigation system. If the speech-recognition device has. insufficient RAM, the basic vocabulary is not loaded into it until a suitable operating function that accesses the basic vocabulary has been activated.
The basic lexicon, stored in at least one database, comprises the "p" largest cities in the Federal Republic of Germany, with the parameter "p" in the design described being set at 1000. This directly accesses approximately 53 million citizens of the FRG or 65% of the population. The basic lexicon comprises all locations with more than 15,000 inhabitants. A regional lexicon also stored in the database includes "z" names of regions and areas such as Bodensee, Schwabische Alb, etc., with the regional lexicon in the version described comprising about 100 names for example. The regional lexicon is used to find known areas and conventional regional names. These names cover combinations of place names that can be generated and loaded as a new regional lexicon after the local or regional name is spoken. An area lexicon, generated only after initialization, comprises "a" dynamically loaded place names in the vicinity of the actual vehicle location, so that even smaller places in the immediate vicinity can be addressed directly, with the parameter "a" in the embodiment described being set at 400.
This area lexicon is constantly updated at certain intervals while driving so that it is always possible to address locations in the immediate vicinity directly. The current vehicle location is reported to the navigation system by a positioning system known from the prior art, for example by means of a global positioning system (GPS). The previously described lexica are assigned to the speaker-independent speech-recognition engine. A name lexicon that is not generated from the destination file and is assigned to the speaker-dependent speech-recognition engine comprises approximately 150 keywords from the personal address list of the user, spoken by the user. Each keyword is then given a certain destination address from the destination file by the input dialogue "store address." These specific destination addresses are transferred to the navigation system by speech input of the associated keywords using the input dialogue "call up address." This results in a basic vocabulary of about 1650 words that are recognized by the speech-recognition device and can be entered as words spoken in isolation (place names, street names, keyword).
Provision can also be made for transferring addresses from an external data source, for example a PDA (personal digital assistant) or a portable laptop computer, by means of data transfer to the speech dialogue system or to the navigation system and integrate it as an address lexicon in the basic vocabulary. Normally, no phonetic descriptions for the address data (name, destination location, street) are stored in the external data sources. Nevertheless in order to be able to transfer these data into the vocabulary for a speech-recognition device, an automatic phonetic transcription of these address data, especially the names, must be performed. Assignment to the correct destination location is then performed using a table.
For the sample dialogues described below, a destination file must be stored in the (at least one) database of the navigation system that contains a data set according to Table 1 in the place found in the navigation system. Depending on the storage location and availability, parts of the information entered can also be missing. However, this only relates to data used to resolve ambiguities, for example additional naming component, county, telephone area codes, etc. If address data from an outside data source are used, the address data must be supplemented accordingly. The word subunits for the speech-recognition device are especially important, which act as hidden Markov model speech recognition engines (HMM recognition engines).
TABLE 1 Description of Entry Example Place Name Flensburg Political Affiliation or -- additional naming component Postal Code or Postal Code 24900-24999 Range Telephone Area Code 0461 County Flensburg, county State Schleswig-Holstein Population 87,526 Geographic Code 9.43677, 54.78204 Phonetic Description .linevert split.fl'Ens.linevert split.bUrk.linevert split. Word Subunits for HMM Speech- f[LN]le e[LN] n[C] s b[Vb] Recognizing Device U[Vb]r k. or 101 79 124 117 12 39 35 82 68 Lexicon Membership 3, 4, 78 . . .
Other objects, advantages and novel features of the present invention will become apparent from the following detailed description of the invention when considered in conjunction with the accompanying drawings.