The currently widespread availability of conference call services has provided a highly convenient alternative to face-to-face meetings for many business, educational and other purposes. Scheduling of such meetings can often be performed automatically through commonly available calendar applications for computers and work stations while additional time for travel to a meeting location can be avoided entirely or reduced to travel to locally available facilities. In this latter regard, it is speculated that cost savings provided by conference call services are increasing at a substantial rate as persons that may be involved in a given aspect of an enterprise and may need to hold such conferences (often referred to as teleconferences) become more geographically diverse and scattered throughout the world. By the same token, the likelihood that a given participant in a given teleconference may speak with an accent that diminishes the likelihood of being correctly understood is greatly increased and hinders the effectiveness of the teleconference while presenting the possibility of generating incorrect or inconsistent information among teleconference participants.
While additional facilities for teleconferences such as visual aids in the form of drawings or slides and video capabilities are known and technically feasible where the conference is performed through networked computers or terminals, such capabilities may or may not be immediately available to all participants who may find it preferable or sometimes necessary to participate through wired or wireless telephone links that may or may not have display or non-voice interface capabilities. That is, while provision of graphic information and/or the image of a speaker during a teleconference may increase the likelihood of the speaker being correctly understood, such facilities may not be available to all participants and, in any event, do not fully answer the problem of a speaker being correctly understood by all teleconference participants, especially when a participant may speak with a particularly heavy accent.
More generally, incorporating the medium of speech into input and output devices for various devices including data processing systems has proven problematic for many years although many sophisticated approaches have been attempted with greater or lesser degrees of success, largely due to difficulties in accommodating heavily accented speech. Speech synthesizers, at the current state of the art, are widely used as output interfaces and, in many applications, are quite successful, although vocabulary is often quite limited and emulation of accents, while currently possible, are not normally provided. The more sophisticated types of speech synthesizers having relatively comprehensive vocabularies are referred to as text-to-speech (TTS) devices.
Developing speech responsiveness for use as an input interface, however, has proven substantially more difficult, particularly in regard to accommodating accents. Simple devices that must distinguish only a small number of commands and input information often require a given speaker to pronounce each of the words that is to be recognized so that a command or information can be matched against a recorded version of the pronunciation. More sophisticated voice recognition systems take a similar approach but at the level of personalized phonemes (e.g. phonemes as spoken by a given individual) which can then be stitched together to reconstruct words that can be recognized. As can be readily understood, such systems are highly processing intensive if they must be able to recognize and differentiate a large vocabulary of words. Error rate reduction is extremely difficult in such systems due to variations in the sound of phonemes when pronounced together with other phonemes. Teleconferences present a particularly difficult application for either of these types of systems since speakers that are widely distributed geographically or may have different cultural backgrounds and/or primary languages will generally represent a wide variety of accents while a large and esoteric vocabulary is likely to be used.