Conventional computer workstations use a “desktop” interface to organize information on a single display screen. Users of such workstations may work with multiple documents on the desktop interface. However, many users find it more desirable to organize multiple documents on a work surface than to focus attention to multiple windows on a fixed display screen. A physical work surface is usually much larger than the screen display, and multiple documents may be placed on the surface for quick access and review. For many users, it is more natural and convenient to look for a desired document by visual inspection in a physical workspace than to click on one of a number of windows on a display screen. Because it is desirable to shift positions throughout a long workday, many users also prefer to physically pick up a document and review it in a reclining position instead of remaining in a fixed, upright seated position to review the document on a fixed display screen.
In an article entitled “The Computer for the 21st Century,” which appeared in the September 1991 issue of Scientific American, Mark Weiser has proposed “electronic pads” that may be spread around on a desk like conventional paper documents. These pads are intended to be “scrap computers”, analogous to scrap paper and having no individualized identity or importance. The pads can be picked up and used anywhere in the work environment to display an electronic document and receive freehand input with a stylus.
An article in the Sep. 17, 1993 issue of Science included a further description of Weiser's ideas. In the article, electronic pads are characterized as thin note pads with a flat screen on top that a user could scribble on. The pads are intended to be as easy to use as paper. Instead of constantly opening and closing applications in the on-screen windows of a single desktop machine, a user could stack the pads around an office like paper and have all of them in use for various tasks at the same time.
The need has long been felt to generate text data (e.g., character codes in a word processor) by voice input, and to integrate voice input with pen-based computers. An article headlined “Pen and Voice Unite” in the October 1993 issue of Byte Magazine, for example, describes a system for responding to alternating spoken and written input. In the system described, the use of pen and voice alternates as the user either speaks or writes to the system. The article states that using a pen to correct misrecognitions as they occur could make an otherwise tedious dictation system acceptable. According to the article, misrecognized words could be simply crossed out. The article suggests, however, that a system in which the use of pen and voice alternates as the user either speaks or writes to the system is less interesting than possibilities that could arise from simultaneous speech and writing.
The effectiveness and convenience of generating text data by voice input is directly proportional to the accuracy of the speech recognizer. Frequent misrecognitions make generation of text data tedious and slow. To minimize the frequency of misrecognitions, it is desirable to provide voice input of relatively high acoustic quality to the speech recognizer. A signal with high acoustic quality has relatively consistent reverberation and amplitude, and a high signal-to-noise ratio. Such quality is conventionally obtained by receiving voice input via a headset microphone. By being mounted on a headset, a microphone may be mounted at a close, fixed proximity to the mouth of a person dictating voice input and thus receive voice input of relatively high acoustic quality.
Head mounted headsets have long been viewed with disfavor for use with speech recognition systems, however. In a 1993 article entitled “From Desktop Audio to Mobile Access: Opportunities for Voice in Computing,” Christopher Schmandt of MIT's Media Laboratory concludes that “obtaining high-quality speech recognition without encumbering the user with head mounted microphones is an open challenge which must be met before recognition will be accepted for widespread use.” In the Nov. 7, 1994 issue of Government Computer News, an IBM marketing executive in charge of IBM speech and pen products is quoted as saying that, inter alia, a way must be found to eliminate head-mounted microphones before natural-language computers become pervasive. The executive is further quoted as saying that the headset is one of the biggest inhibitors to the acceptance of voice input.
Being tethered to a headset is especially inconvenient when the user needs to frequently walk away from the computer. Even a cordless headset can be inconvenient. Users may feel self-conscious about wearing such a headset while performing tasks away from the computer.
The long felt need for a system incorporating speech recognition capability with pen-based computers has remained largely unfulfilled. Such a system presents conflicting requirements for high acoustic quality and freedom of movement. Pen-based computers are often used in collaborative environments having high ambient noise and extraneous speech. In such environments, the user may be an engaging in dialogue and gesturing in addition to dictation. Although a headset-mounted microphone would maintain acoustic quality in such situations, the headset is inconsistent with the freedom of movement desired by users of pen-based computers.
It has long been known to incorporate a microphone into a stylus for the acquisition of high-quality speech in a pen-based computer system. For example, the June 1992 edition of IBM's Technical Disclosure Bulletin discloses such a system, stating that speech is a way of getting information into a computer, either for controlling the computer or for storing and/or transmitting the speech in digitized form. However, the Bulletin states that such a system is not intended for applications where speech and pen input are required simultaneously or in rapid alteration.
IBM's European patent application EP 622724, published in 1994, discloses a microphone in a stylus for providing speech data to a pen-based computer system, which can be recognized by suitable voice recognition programs, to produce operational data and control information to the application programs running in the computer system. See, for example, column 18, lines 35-40 of EP 622724. Speech data output from the microphone and received at the disclosed pen-based computer system can be converted into operational data and control information by a suitable voice recognition program in the system.
To provide an example of a “suitable voice recognition program,” the EP 622724 application refers to U.S. patent application Ser. No. 07/968,097, which matured into U.S. Pat. No. 5,425,129 to Garman. At column 1, lines 54-56, the '129 patent states the object of providing a speech recognition system with word spotting capability, which allows the detection of indicated words or phrases in the presence of unrelated phonetic data.
Due to the shortcomings of available references such as the aforementioned June 1992 IBM Technical Disclosure Bulletin, EP 622724 European application, and 07/968,097 U.S. application, the need remains to integrate high-quality voice input for dictation with pen-based computers. This need is especially unfulfilled for a system that includes voice dictation capability with a number of electronic pads. Such a system would need to maintain acoustic quality while providing voice input to a desired one of the pads, which are spread out on a work surface, possibly in a noisy collaborative environment, at varying distances from the source of voice input.
The need also remains for a speech recognizer that may be automatically activated when voice input is desired. U.S. Pat. No. 5,857,172 to Rozak, for example, identifies a difficulty encountered with conventional speech recognizers in that such speech recognizers are either always listening and processing input or not listening. When such a speech recognizer is active and listening, all audio input is processed, even undesired audio input in the form of background noise and inadvertent comments by a speaker. As discussed above, such undesired audio input may be expected to be especially problematic in collaborative environments where pen-based computers are often used. The '172 patent discloses a manual activation system having a designated hot region on a video display for activating the speech recognizer. Using this manual system, the user activates the speech recognizer by positioning a cursor within the hot region. It would be desirable however, for a speech recognizer to be automatically activated without the need for a specific selection action by the user.
Speech recognition may be viewed as one possible subset of message recognition. As defined in Merriam-Webster's WWWebster Dictionary (Internet edition), a message is “a communication in writing, in speech, or by signals.” U.S. Pat. No. 5,502,774, issued Mar. 26, 1996 to Bellegarda discloses a “message recognition system” using both speech and handwriting as complementary sources of information. Speech recognizers and handwriting recognizers are both examples of message recognizers.
A need remains for message recognition with high accuracy. Misrecognitions by conventional message recognizers, when frequent, make generation of text data tedious and slow. Conventional language, acoustic, and handwriting models respond to user training (e.g., corrections of misrecognized words) to adapt and improve recognition with continued use.
In conventional message recognition systems, user training is tied to a specific computer system (e.g., hardware and software). Such training adapts language, acoustic, and/or handwriting models that reside in data storage of the specific computer system. The adapted models may be manually transferred to a separate computer system, for example, by copying files to a magnetic storage medium. The inconvenience of such a manual operation limits the portability of models in a conventional speech recognition system. Consequently, a user of a second computer system is often forced to duplicate training that was performed on a first computer system. In addition, rapid obsolescence of computer hardware and software often renders adapts language, acoustic, and/or handwriting models unusable after the user has invested a significant amount of time to adapt those models.
Conventionally, user training is also tied to a specific type of message recognizer. Correction of misrecognized words typically improves only the accuracy of the particular type of message recognizer being corrected. U.S. Pat. No. 5,502,774 to Bellegarda discloses a multi-source message recognizer that includes both a speech recognizer and a handwriting recognizer. The '774 patent briefly states, at column 4, lines 54-58, that the training of respective source parameters for speech and handwriting recognizers may be done globally using weighted sum formulae of likelihood scores with weighted coefficients. However, the need remains for a multi-source message recognizer that can apply user correction to appropriately perform training for multiple types of recognition, by applying the correction of a message misrecognition only to those models employed in the type of message recognition to which the training is directly relevant.
Conventional message recognition systems include a language model and a type-specific model (e.g., acoustic or handwriting). As discussed in U.S. Pat. No. 5,839,106, issued Nov. 17, 1998 to Bellegarda, conventional language models rely upon the classic N-gram paradigm to define the probability of occurrence, within a spoken vocabulary, of all possible sequences of N words. Given a language model consisting of a set of a priori N-gram probabilities, a conventional speech recognition system can define a “most likely” linguistic output message based on acoustic input signal. As the '106 patent points out, however, the N-gram paradigm does not contemplate word meaning, and limits on available processing and memory resources preclude the use of models in which N is made large enough to incorporate global language constraints. Accordingly, models based purely on the N-gram paradigm can utilize only limited information about the context of words to enhance recognition accuracy.
Semantic approaches to language modeling have been disclosed. Generally speaking, such techniques attempt to capture meaningful word associations within a more global language context. For example, U.S. Pat. No. 5,828,999, issued Oct. 27, 1998 to Bellegarda et al. discloses a large-span semantic language model which maps words from a vocabulary into a vector space. After the words are mapped into the space, vectors representing the words are clustered into a set of clusters, where each cluster represents a semantic event. After clustering the vectors, a first probability that a first word will occur (given a history of prior words) is computed. The probability is computed by calculating a second probability that the vector representing the first word belongs to each of the clusters; calculating a third probability of each cluster occurring in a history of prior words; and weighting the second probability by the third probability.
The '106 patent discloses a hybrid language model that is developed using an integrated paradigm in which latent semantic analysis is combined with, and subordinated to, a conventional N-gram paradigm. The integrated paradigm provides an estimate of the likelihood that a word, chosen from an underlying vocabulary, will occur given a prevailing contextual history.
A prevailing contextual history may be a useful guide for performing general semantic analysis based on a user's general writing style. However, a specific context of a particular document may be determined only after enough words have been generated in that document to create a specific contextual history for the document. In addition, the characterization of a contextual history by a machine may itself be inaccurate. For effective semantic analysis, the need remains for accurate and early identification of context.