The present invention relates to apparatus and methods for speech to text conversion using automatic speech recognition, and has various aspects.
Automatic speech recognition, as such, is known from, for example, xe2x80x9cAutomatic Speech Recognitionxe2x80x9d by Kai-Fu Lee, Kluwer Academic Publishers 1989.
Conventional known systems for converting speech to text involving automatic speech recognition are desktop stand alone systems, in which each user needs his or her own system. Such known speech to text conversion systems have been produced by such companies as International Business Machines, Kurzweil Applied Intelligence Inc and Dragon Systems.
When performing automatic speech recognition, adaptation is known to improve system performance. Adaptation is a mathematical process where descriptive models are fine-tuned. In particular, speaker adaptation adapts models to better fit the speech characteristics, and language adaption adapts to word usage of the speaker.
When performing ASR adaptation the performance is judged by the accuracy of the resulting text and the time required to produce it. Improving performance is primarily related to improving accuracy, though improvement is also achieved when the required computation time is reduced.
Known systems for Automatic Speech Recognition (ASR) model the acoustical patterns of speech and the word patterns of the language used. Although speech recognition is performed using both speech and language models within a statistical framework, the two are constructed independently.
Acoustical modelling captures the nature of different sounds. A word can be described, via a pronouncing dictionary, as some combination of these sounds.
Language modelling captures the likelihood that a given word occurs in some context. It is necessary, in practice, to compile statistic likelihoods from large amounts of data collected over time.
Language models are adapted by applying millions of words and would therefore not be of benefit for a long time indeed from occasional or regular usage of dictation by an individual.
Known ASR systems use pattern matching and other known techniques:
(1) to match acoustic speech patterns with sub-wound units (typically phoneme related),
(2) to associate sub-word vectors with orthographic words (using a pronouncing dictionary),
(3) to represent and exploit the likelihood that a particular word will occur given its location relative to other surrounding words,
(4) to search to find the best text sequence by examining all possible word sequences and selecting the one which best concords the given acoustic utterance and the knowledge expressed in (1), (2) and (3) above.
Known ASR systems decode an acoustic pattern into a word sequence by appropriate use of this information. To adapt the recognition system requires both acoustic (sub-word parameter, as in (1)) and language (word using statistic, as in (3)) adaptation. A pronouncing dictionary, as in (2) is usually static except that new words, ie. those encountered in real use but absent from the system dictionary, must be added to it.
Known speech recognition technology is based on sub-word modelling. This requires each word to have a known pronunciation. Given that pronunciation, any word can be assimilated into a recognition system. In practice, words will occur for which no pronunciation is known in advance. So-called xe2x80x9cText-To-Speechxe2x80x9d technology exists to invent a plausible pronunciation. However these are complicated and can be inaccurate, involving considerable hand-crafting effort.
The correct transcription of an audio recording to be used for adapting the ASR system is a word-for-word verbatim text transcript of the content of that speech recording.
The transcripts returned from audio typists may not match the word-for-word speech. For example, embedded instructions may have been interpreted (eg. delete that sentence), information inserted (eg. insert date), stylisation applied (date and number format), or obvious mistakes corrected (eg. xe2x80x9cU.S. President Abram Lincolnxe2x80x9d might be manually corrected to xe2x80x9cAbraham Lincolnxe2x80x9d). When applied in ASR adaptation these variations between the speech and corrected text can cause errors.
Known ASR use speaker independent acoustic modelling. The models can be adapted through usage to improve the performance for a given speaker. Speaker dependent models are unpopular, because they require a user to invest time (usually one or two hours) before be or she can use the ASR system.
In a first aspect, the present invention relates to a speech to text convertor comprising a plurality of user terminals for recording speeches, at least one automatic speech recognition processor, and communication means operative to return the resulting texts to the respective user, in which at least one automatic speech recognition processor is adapted to improve recognition accuracy using data of the recorded speeches and the resulting texts, the data being selected dependent upon subject matter area.
This advantageously provides subject-matter area specific adaptation whereby data from previous user""S in a subject matter area is used to improve performance of automatic speech recognition processors for subsequent users in that subject matter area.
New users benefit from previous adaptation using data according to their subject matter area. Both occasional and regular users benefit from adaptation using data from others in their subject matter area.
Data for adaptation is preferably accumulated by pooling according to subject matter area prior to adaptation. In particular, given, say hundreds or thousands of users over time but a much fewer number of subject matter areas, (say five or ten), data for adaptation is quickly accumulated by pooling according to subject matter area.
The subject matter areas can be various disciplines, such as legal, medical, electrical, accounting, financial, scientific and chemical subject matter areas; also personal correspondence and general business.
Preferably language models are adapted dependent on which subject matter area they are used for using data from that subject matter area. New words which occur in a subject matter area are acquired by a language model for each new word being provided, and subsequently adapted. The probabilities of word occurrences dependent on subject matter area are learnt and used for improved automatic speech recognition accuracy.
Preferably, each recorded speech has an indicator of subject matter area and the selection of data for adaptation is dependent upon the indicator. This indicator can be provided by the user or determined and applied subsequently.
Preferably, the data for adaptation can be selected dependent not only on subject matter area but also on the user""s accent grouping. This can further improve accuracy of automatic speech recognition.
In a second aspect, the present invention relates to a speech to text convertor comprising a plurality of user terminals for recording speeches, at least one automatic speech recognition processor, and communication means operative to return the resulting texts to the respective user, in which at least one automatic speech recognition processor is adapted to improve recognition accuracy using data of the recorded speeches and the resulting texts, the data being selected dependent upon accent group.
Accent group specific adaptation advantageously enables data from previous user""s in an accent group to be used to improve performance of automatic speech recognition processors for subsequent users belonging to the same accent group. In particular, as a result of previous adaptation, acoustic models are closer to the new user""s speech giving improved performance.
Data for adaptation is preferably accumulated by pooling according to accent group prior to adaptation.
The accent groups can refer to country, region and/or city, eg. United Kingdom, United States or any other specific accents or sub-accents.
Preferably acoustic models are adapted dependent on which accent group they are used for using data from that accent group.
Preferably, each recorded speech has an indicator of accent group. This indicator can be provided by the user or determined and applied subsequently.
Preferably, the data for adaptation can be selected dependent not only on accent grouping but also on subject matter area or other feature. This can further improve accuracy of automatic speech recognition.
In a third aspect, the present invention relates to a speech to text convertor comprising a plurality of user terminals for recording speeches, at least one automatic speech recognition processor, and communication means operative to return the resulting texts to the respective user, in which at least one automatic speech recognition processor is adapted to improve recognition performance using data of the recorded speeches and the resulting texts selected from more than one user.
The data is preferably suggested from multiple users.
Each recorded speech preferably has an indicator and the data is selected dependent on the indicator.
An indicator comprises information about the recorded speech with which it is associated. The information can comprise the user""s company, address and/or identity. The information can comprise information of the user""s expected usage, such as user""s subject matter and/or identity of the user terminal used. The user terminal can be a telephone or microphone. The information can comprise information known about the user for example from previous questioning, such as gender. The information can comprise processing instructions such as output format of resulting text and/or urgency rating.
The data is preferably recorded such that data for adaptation can be selected from all the recorded speeches and resulting texts from a user. Data can preferably also be selected from other recorded speeches and/or texts made by the user when not using the speech to text convertor.
The adaptation is preferably performed in a hierarchical manner. By requiring one or more indicated properties, data can be selected in various ways for adaptation. In particular, after a first adaptation, an additional indicated property or indicated properties can be required when selecting data for further adaptation. For example, first a particular accent country then also an accent region can be required; or for example, first accent group then also subject matter area can be required.
The invention in its various aspects, has the advantage that improvements in automatic speech recognition performance are shared between users, in particular by adapting the automatic speech recognition processors dependent on previous user""s subject matter area and/or accent grouping.
As regards the invention in all its aspects:
When speech is recorded, associated identifiers of the user""s identity and/or accent group and/or subject matter area are also stored. The identifiers can be selected by the user. In particular, the identifiers can be selected from predefined lists, for example, using a mouse or arrow-keys on a user""s terminal to select from pull-down type lists. The last identifiers selected by the user can be stored as a future preference.
Preferably, upon the recorded speech being received by one of said automatic speech recognition processors, the identifier of accent group is used to select the acoustic models to be applied in automatic speech recognition and/or the identifier of subject matter area is used to select the language models to be applied in automatic speech recognition.
Preferably, the identifier of user""s identity can also be used to select the acoustic models to be applied. This has the advantage of further improving the accuracy of automatic speech recognition.
Preferably, said at least one user terminal is remote from said at least one automatic speech recognition processor. Preferably, the speech to text convertor includes a server remote from said at least one user terminal, the server being operative to control transfer of recorded speech files to a selected automatic speech recognition processor.
Preferably, the or each user terminal communicates the recorded speech files to the remote server by electronic mail.
The term xe2x80x9celectronic mailxe2x80x9d is intended to include Internet xe2x80x9cFile Transfer Protocolxe2x80x9d and xe2x80x9cWorld Wide Webxe2x80x9d.
The text files resulting from automatic speech recognition are preferably sent to correction units. The correction units are preferably remote from the automatic speech recognition processors. Communications from the automatic speech recognition processors to each correction unit are preferably undertaken under the control of the server, and preferably by electronic mail. The correctors are preferably remotely distributed.
The corrector units can preferably communicate to said at least one user terminal by electronic mail.
The corrector unit preferably includes a visual display unit for display of the text and a manual interface, such as a keyboard and/or mouse and/or a foot pedal control, usable to then select text portions.
Correction is effected by the manual operator. The corrections can be recorded and transmitted back to the automatic speech recognition processor which undertook the automatic speech recognition for adaptation of the operation of the automatic speech recognition processor. These corrections are preferably sent by electronic mail. The adaptation has the effect of making the automatic speech recognition more accurate in future processing.
Data of recorded speeches and resulting texts are preferably screened for mismatches before adaptation Speech words without corresponding text words are not used for adaptation. Mismatches are determined automatically when speech words and resulting text words, which are those after correction, do not satisfy a predetermined set of grammatical rules with sufficient accuracy.
The recorded speech can be sent to the selected correction unit for correction of the text file resulting from automatic speech correction. The server can control this selection. The choice of correction unit can depend on the accent of the speaker of the recorded speech, in particular the files can be sent to a correction unit in an area where that accent is familiar, or to a correction unit where the particular human corrector is familiar with that accent.
If the resulting correct text includes text words not recognised by the automatic speech recognition processor, pronunciation dictionary entries can be created for them. The pronunciation dictionary entries are preferably created using text to phoneme conversion rules. Text words not previously recognised are identified by comparing each text word with those in a database of words, preferably stored in the automatic speech recognition processor.
The recorded speech is preferably continuous speech.
The server acts to control assignment of recorded speech files for processing to automatic speech processors by queuing the received speech files and submitting them according to predetermined rules. This allows high utilisation of the available automatic speech recognition resources, according to an off-line or batch processing scheme.
Speech to text conversion can be done as a single fully automatic operation, or as a part-automatic and part-manual operation using the automatic speech recognition processor and corrector unit respectively.
Undertaking the speech to text conversion in a non-interactive and off-line basis prevents the user switching repeatedly between speech recording and speech correction tasks. This results in improved efficiency.
The predetermined rule or rules by which the server queues jobs can be according to urgency or user priority ratings.
The present invention relates in its various aspects both to apparatus and to corresponding methods.