The current trend in portable, e.g. hand-held, terminals drives the evolution strongly towards intuitive and natural user interfaces. In addition to text, images and sound (for example speech) can be recorded at a terminal either for transmission or to control a preferred local or remote (i.e. network-based) functionality. Moreover, payload information can be transferred over the cellular and adjacent fixed networks such as the Internet as binary data representing the underlying text, sound, images, and video. Modem miniature gadgets like mobile terminals or PDAs (Personal Digital Assistant) may thus carry versatile control input means such as a keypad/keyboard, a microphone, different movement or pressure sensors, etc in order to provide the users thereof with a UI (User Interface) truly capable of supporting the greatly diversified data storage and communication mechanisms.
Notwithstanding the ongoing communication and information technology leap also some more traditional data storage solutions such as dictating machines seem to maintain considerable usability value especially in specialized fields such as law and medical sciences wherein documents are regularly created on the basis of verbal discussions and meetings, for example. It's likely that verbal communication is still the fastest and most convenient method of expression to most people and by dictating a memo instead of typing it considerable timesavings can be achieved. This issue also has a language-dependency aspect; writing Chinese or Japanese is obviously more time-consuming than writing most of the western languages, for example. Further, dictating machines and modern counterparts thereof like sophisticated mobile terminals and PDAs with sound recording option can be cleverly utilized in conjunction with other tasks, for example while having a meeting or driving a car, whereas manual typing normally requires a major part of the executing person's attention and cannot definitely be performed if driving a car, etc.
Until the last few years though, the dictation apparatuses have not served all the public needs so well; information may admittedly be easily stored even in real-time by just recording the speech signal via a microphone but often the final archive form is textual and someone, e.g. a secretary, has been ordered to manually clean up and convert the recorded raw sound signal into a final record in a different medium. Such arrangement unfortunately requires a lot of additional (time-consuming) conversion work. Another major problem associated with dictation machines arises from their analogue background and simplistic UI; modifying already stored speech is cumbersome and with many devices still utilizing magnetic tape as a storage medium certain edit operations like inserting a completely new speech portion within the originally stored signal cannot be done. Meanwhile, modern dictation machines utilizing memory chips/cards may comprise limited speech editing options but the possible utilisation is still available only through rather awkward UI comprising only a minimum size and quality LCD (Liquid Crystal Display) screen etc. Transferring stored speech data to another device often requires manual twiddling, i.e. the storage medium (cassette/ memory card) must be physically moved.
Computerized speech recognition systems have been available to a person skilled in the art for a while now. These systems are typically implemented as application-specific internal features (embedded in a word processor, e.g. Microsoft Word XP version), stand-alone applications, or application plug-ins to an ordinary desktop computer. Speech recognition process involves a number of steps that are basically present in all existing algorithms, see FIG. 1 for illustration. Namely, the speech source emitted by a speaking person is first captured 102 via a microphone or a corresponding transducer and converted into digital form with necessary pre-processing 104 that may refer to dynamics processing, for example. Then the digitalized signal is input to a speech recognition engine 106 that divides the signal into smaller elements like phonemes based on sophisticated feature extraction and analysis procedures. The recognition software can also be tailored 108 to each user, i.e. software settings are user-specific. Finally the recognized elements forming the speech recognition engine output, e.g. control information and/or text, are used as an input 110 for other purposes; it may be simply shown on the display, stored to a database, translated into another language, used to execute a predetermined functionality, etc.
Publication U.S. Pat. No. 6,266,642 discloses a portable unit arranged to perform spoken language translation in order to ease communication between two entities having no common language. Either the device itself contains all the necessary hardware and software for executing the whole translation process or it merely acts as a remote interface that initially funnels, by utilizing either a telephone or a videoconference call, the input speech into the translation unit for processing, and later receives the translation result for local speech synthesis. The solution also comprises a processing step during which speech misrecognitions are minimized by creating a number of candidate recognitions or hypotheses from which the user may, via a UI, select the correct one or just confirm the predefined selection.
Despite the many advances the aforementioned and other prior art arrangements suggest for overcoming difficulties encountered in speech recognition and/or machine translation processes, some problems remain unsolved especially in relation to mobile devices. Problems associated with traditional dictation machines were already described hereinbefore. Further, mobile devices such as mobile terminals or PDAs are often relatively small-sized and light apparatuses that cannot include a large-sized display, versatile UI, top notch processing capability/memory capacity or highest-speed transceiver available, which are commonly present in many bigger devices such as desktop computers. Such features, although not absolutely necessary for carrying out the original purpose of a portable device, i.e. transferring speech or storing calendar and other personal information, would be beneficial from the viewpoint of speech-to-text conversions and machine translation. Data format conversions and translations are computationally exhaustive and consume a lot of memory space; these factors will inevitably also lead to higher battery consumption causing a more general usability problem with mobile devices that should always advantageously work location independently. Still further, the existing information transfer capabilities of the mobile devices may be insufficient for transferring all the necessary data to and from the remote station right from the beginning, or the transfer capacity, although in theory being adequate, may not be available for the purpose in full scale due to other ongoing information transfer with higher priority, for example.