Speech recognition techniques are known in the art. Many such techniques utilize models that are selected as a function of a polyphone network. For example, a network of phones (wherein a “phone” is generally understood to typically comprise a speech sound considered as a physical event without regard to its possible phonological status in the sound system of a language) that comprise at least the primary phonemes (wherein a “phoneme” is generally understood to comprise an abstract unit of the phonological system of a language that corresponds to a set of similar speech sounds that are perceived to be a single distinctive sound in the language) of a given spoken language (or specific dialect thereof can comprise an acoustic foundation upon which one derives a set of models that a corresponding speech recognition engine can utilize to effect recognition of a sample of uttered speech.
Speech recognition platforms (such as, for example, a cellular telephone) that support recognition of multiple languages are also known. Unfortunately, current multilingual (and/or multi-dialect) automatic speech recognition technologies face a number of practical constraints that appear to significantly limit widespread commercial application of such an approach. One problem involves the cost of acquiring and/or the relative availability of relevant language resources and expertise. For example, while many hundreds of different acoustic phones are known to be utilized globally in one language or another, any given language (or dialect) tends to use as phonemes only a relatively small subset of this larger pool. The limited availability of resources and technology-savvy expert knowledge required to linguistically characterize a language for speech recognition engineering purposes (including, but not limited to, specifically identifying those phonemes that appropriately characterize a given language) are an impediment to broad language coverage. Further, the time and expense of creating, finding, or otherwise acquiring appropriate acoustic speech data (with associated transcriptions, lexica, and so forth) of acceptable quality and quantity to permit training of speech models often make such endeavors commercially unfeasible, especially for consumer populations that represent a relatively small speaker group and/or an emerging market.
Computational resource limitations present another problem. A not atypical prior art approach combines a plurality of monolingual speech recognition systems to consolidate sufficient capability to support multiple languages and/or dialects. With such an approach, however, requirements for both the necessary language resources and computational resources increase substantially proportionally with each incremental supported language and/or dialect. The costs and/or form factor constraints associated with such needs can again influence designers away from including languages that correspond to smaller speaker populations and/or smaller present marketing opportunities.
To attempt to meet such problems, some effort has been made to consider sharing acoustic models across a plurality of languages and/or dialects. Such an approach typically requires alteration to the fundamental approach by which models are developed (for example, while models that result from the approaches noted earlier tend to preserve substantially all defined phonetic contrasts in a given set of targeted languages or dialects, such models also poorly support any attempt to exploit any cross-language allophonetic coincidences that might also exist; in addition, the set of models is often too large and too finely differentiated for cost-effective and efficient multilingual or multi-dialect automatic speech recognition needs). For example, one approach employs an acoustic feature-based data-driven method to achieve a kind of phone merger. Acoustic models from a collection of monolingual speech recognition systems are compared and acoustically similar models are merged. Such an approach indeed tends to at least reduce to some extent total memory requirements, but such an approach is also relatively indiscriminate with respect to the grouping of acoustically similar phones (for example, this approach tends to readily permit the merger of acoustically similar but phonologically contrastive models). This approach also fails in large measure to supply acoustic models of phones for which little or no data is conveniently available. One suggested alteration to this latter approach has been to constrain the acoustic data-driven method as a function of language knowledge. This approach seeks to retain improved memory requirements while also attempting to address acoustic confusability of distinct phones. Unfortunately, however, this attempt at improvement still remains largely dependent upon the availability and quality of language data.
Another impediment to fielding a commercially useful result is the present practice of tending to represent phone information with a transcription system that favors unique font sets and/or unusual characters that are typically non-alphanumeric, printable characters that are commonly used for special purposes in command line scripting as they have special control interpretations (such as, for example, ‘@’, ‘>’, ‘\’, ‘|’, ‘=’, ‘{’, ‘^’, ‘?’, ‘*’, ‘(’, and so forth) in one or more relevant computer command and control protocols. For example, some characters used by some transcription systems present phone information that is also interpreted by Unix command line interpreters as specific Unix commands. This unfortunate proclivity further complicates the matter of attempting to provide a flexible, efficient polylingual speech recognition method and apparatus.