This invention relates to a multiple speaker speech processing system for automatically converting the vocal statements of multiple speakers into a unitary, combined record, and more particularly to a system that creates a transcript of a proceeding among multiple speakers, accurately transcribing the words and tracking the identity of each speaker.
Stenographic and stenotype based reporting and transcriptions of meetings, conferences, hearings, judicial proceedings are well known. Even the best systems and methods have significant drawbacks. The use of a court reporter, stenographer or other person to record and transcribe proceedings requires a highly trained individual. Years of practice are required for a reporter to reach a high level of competence. Even with the extensive training and experience, however, that individual suffers from normal human frailties. One such frailty is fatigue. Another is the susceptibility to repetitive stress injury and other medical conditions brought on by extended use of a stenotype machine. A further shortcoming is the frequent inability of one to accurately transcribe overlapping conversations by two or more persons. Even experienced reporters have difficulty in such situations.
Computer software and stenotype machines with dedicated hardware are available that can convert the stenotyped input into text. These facilitate quicker availability of the transcript including real-time conversion from stenotyped input into text. The patent literature describes devices for improving the accuracy of stenographic transcription such as that of Jackson et al, U.S. Pat. No. 5,745,875. This patent describes the simultaneous recording of proceedings by a human reporter and a speech recognition unit. The parallel conversions of speech to written record allow each to serve as a check against the other in-real-time transcription. The reporter has the computer generated written words to compare against his or her stenotyped record. While the increased use of computers has streamlined the transcription process, the need for a reporter and the attendant problems have not been overcome.
Commercially available software for automatically converting speech to text is generally known as speech recognition software. Speech recognition quality ranges from poor, for speaker independent, limited vocabulary software, to reasonably good, for speaker dependent, trained software. A computer equipped with speaker independent software accepts speech input from any person and recognizes the 100,000 or so most commonly used words. Such software exhibits mediocre performance at best. A computer equipped with: speaker dependent trained software starts out as speaker independent. The individual whose voice is to be recognized is asked to participate in a training session, whereby the programmed computer comes to recognize the individual""s speech. As the individual continues to use the computer and correct its mistakes, the computer refines its ability to accurately translate the speech of that individual. Software implementation of trained systems exists in commercial packages such as Via Voice Gold (IBM Corp.), Naturally Speaking Deluxe (Dragon Systems, Inc.), and Kurzweil VoicePro (Alpha Software).
As such, speech recognition has not replaced a human reporter. While the reporter may make errors when transcribing, he or she easily outperforms even the best computer systems in environments with multiple speakers. The speech recognition computer must take a digital representation of human utterances, determine where in this representation words begin and end, and finally use some algorithm or mapping model to convert the representation of the individual words into recognized words. These tasks are extremely complex for a computer faced with multiple speakers.
Voice recognition software is generally. not able to electronically recognize the identity of a speaker, i.e., tell one speaker from another. Nor does voice recognition software have the ability to deconstruct two overlapping vocal statements from two speakers, accurately reproduce written records of the statements and recognize who made them. The patent literature describes systems designed to record multiple speakers onto audio tape along with a tag indicating their identity. The art also describes systems that allow an audio tape to be recorded and synchronized to the keystrokes on the stenotype machine.
Individual microphones associated with dedicated transmitters, each transmitting on a different frequency are known for the purpose of differentiating between speakers. U.S. Pat. No. 4,596,041 to Mack describes such a system with a plurality of demodulators each tuned to one of the frequencies of the transmitters. Once demodulated, each speaker""s statements are recorded. A means of recording a time indication at the beginning and end of each statement is described, as well.
Because, for computer speech recognition, the problems of discriminating among speakers and correctly recognizing overlapping words from different speakers have not been solved, no currently existing methods or systems are known that can listen to a hearing, conference or any type of conversation, distinguish among speakers, and correctly transcribe the spoken words into a transcript of the proceedings with speakers correctly identified.
In accordance with the present invention, a system using speech recognition for the preparation of transcripts of multi-speaker proceedings uses individual microphones assigned to individual speakers. Each microphone has a distinguishing characteristic, channel or line that is electrically distinctive to uniquely identify a particular. individual speaker from among all the speakers. Each statement at a microphone is transmitted to a computer with speech recognition software. Preferably, individual trained speech recognition components of the software convert to text the statements of the speakers. As used herein, xe2x80x9ctrained speech recognition software componentsxe2x80x9d means either individual, trainable computer programs or portions or modules of a program, the portions or modules of which are capable of being trained to the speech of different individuals.
Conventionally, each microphone converts the sound from a particular speaker into an analog signal. In one preferred embodiment, each microphone is connected to a transmitter with its own assigned frequency. Signals representative of statements of speakers are transmitted in either analog or digital format. A multi-channel receiver has individual receiving sections tuned to the frequencies of the transmitters. These are connected with one or more computers running trained speaker dependent voice recognition software programming.
Alternatively, each microphone can be hard-wired to the remainder of the system, in which case the distinguishing characteristic of a particular speaker is the hard-wired channel on which the signal is transmitted. Certainly other methods of labeling the signal of a particular speaker""s microphone can be employed. Whichever method of electrically distinguishing the statements from the microphones is used, the identification that this provides serves two purposes. It permits a speaker""s statements to be directed to a software component trained to recognize her or his speech, and it allows the statements to be attributed correctly in the ultimate, assembled record.
In a preferred embodiment, a time stamping system is added that tracks the beginning and ending time of each speech segment. Once this timing data is combined with the speaker specific text, it is used to determine the order of assembly of the statements of the individual speakers into a combined transcript. A word-processing program assembles the text data into a transcript of the court proceeding, hearing, etc. In the present invention the transcript can be kept in electronic form, displayed on a computer monitor, printed, or otherwise manipulated and subsequently output.
The system can easily be used to record the spoken statements for later batch processing into text or as described, for real-time speech processing. Finally, an audio recorder can be usefully incorporated to provide an audio backup of the proceeding.
The above and further features and advantages of the invention will be better understood from the following description of a preferred embodiment, when taken with the accompanying drawing.