1. Field of the Invention
This invention relates generally to speech recognition technology and is particularly concerned with portable, intelligent, interactive devices responsive to non-speaker specific commands or instructions.
2. Description of the Related Art
An example of conventional portable interactive speech recognition equipment is a speech recognition toy. For example, the speech recognition toy that was disclosed by the Japanese Laid Open Publication S62-253093 contains a plurality of pre-registered commands that are objects of recognition. The equipment compares the voice signals emitted by the children or others who are playing with the toy to voice signals pre-registered by a specific speaker. If perceived voice happens to match one or more of the pre-registered signals, the equipment generates a pre-determined electrical signal corresponding to the matched voice command, and causes the toy to perform specific operations based on the electrical signal.
However, because these toys rely on a particular individual's speaking characteristics (such as intonation, inflection, and accent) captured at a particular point in time and recognize only a prestored vocabulary, they quite frequently fail to recognize words and expressions spoken by another person, and apt not even to tolerate even slight variations in pronunciation by the registered speaker. These limitations typically lead to misrecognition or nonrecognition errors which may frustrate or confuse users of the toy, especially children, which, in turn, leads to disuse once the initial novelty has worn off. Further, speaker and word pre-registration is extremely time-consuming and cumbersome, since every desired expression must be individually registered one-by-one basis prior to use by a new speaker.
One potential solution may be to incorporate into such devices non-specific speech recognition equipment which uses exemplars from a large population of potential speakers (e.g. 200+ individuals). This technology does a much better job in correctly recognizing a wide range of speakers, but it too is limited to a predefined vocabulary. However, unlike speaker-specific recognition equipment, the predefined vocabulary cannot be altered by the user to suit individual needs or tastes. Further, proper implementation of these non-speaker specific techniques for suitably large vocabularies require copious amounts of memory and processing power currently beyond the means of most commercially available personal computers and digital assistants, as typically each pre-registered word, along with every speaker variation thereof, must be consulted in order to determine a match. Accordingly, conventional non-speaker specific recognition simply does not provide a practical recognition solution for the ultra-cost sensitive electronic toy, gaming or appliance markets.
Moreover, although specific speech recognition devices can nevertheless achieve relatively high recognition rates for a range of typical users, they cannot always achieve high recognition rate for all types of users. For example, voice characteristics such as interaction and pitch very widely depending on the age and sex of the speaker. The speech recognition device attuned to adult style speech may achieve extremely high recognition rates for adults but may fail miserably with toddlers' voices. Further, conventional non-specific speaker speech recognition could be used by a wide range of people for a wide ranging purposes. Consider the case of a speech recognition device used in an interactive toy context. In this scenario, the degree and type of interaction must be rich and developed enough to handle a wide age range from the toddler speaking his or her first words to mature adolescents, and all the conversation content variations and canned response variation must accommodate this broad range of users in order to enhance the longevity and commercial appeal of such a recognition toy. However as already discussed, a limited memory in processing resources can be devoted to speech recognition in order to make such a speech recognition device cost effective and reasonable responsive. So, heretofore a trade off between hardware costs and responsiveness versus interactably has been observed in nonspecific speaker voice recognizers.
It is, therefore, an object of the present invention to implement an interactive speech recognition method and apparatus that can perform natural-sounding conversations without increasing the number of pre-registered words or canned responses characterized by conventional canned matching type speech recognition. Moreover, it is a further object of the present invention to incorporate recognition accuracy and features approaching non-specific speaker speech recognition in a device relatively simple in configuration, low in price, easily manufactured, and easily adaptable to suit changing needs and uses. It is yet a further object of the present invention to provide a highly capable, low-cost interactive speech recognition method and apparatus which can be applied to a wide range of devices such as toys, game machines and ordinary electronic devices.
It is still a further object of the present invention to prove nonspecific speaker recognition rates for a wider range of voices then heretofore could be accommodated using conventional memory constructs. It is even a further object of the present invention that a wider range of conversation responses and detected phrases be accommodated on an as needed basis.