The present invention relates generally to converting between text and speech, and specifically to converting speech to text in the presence of speech intonation.
Methods for converting between text and speech are known in the art. Text-to-speech conversion methods have been commercially produced for at least fifteen years, with improvements being made to the quality of the products as time has proceeded. Speech-to-text conversion is significantly more difficult to achieve than text-to-speech, and general-purpose, commercial speech-to-text systems have only been available in the last few years.
The Productivity Works, Inc., of Trenton, N.J., produces a xe2x80x9cSoftVoicexe2x80x9d text-to-speech product known as xe2x80x9cSVTTS,xe2x80x9d which analyzes text into phonemes, and generates speech from the phonemes. SoftVoice is a trademark of SoftVoice Inc. Tags and commands (which are not themselves converted to speech) may be embedded into the text so as to indicate to the SVTTS how the speech is to be generated. For example, there are tags for speaking in an English or Spanish accent, or in a whisper or speaking with a breathy quality.
IBM Corporation of Armonk, New York, produces a speech-to-text software package known as xe2x80x9cViaVoice.xe2x80x9d ViaVoice is a registered trademark of International Business Machines Corporation. Preferably, the system uses a learning period, during which an operator is able to adjust to the system, and during which a computer upon which the system is installed becomes accustomed to the speech of the operator. During operation, the system converts speech to text, and inter alia, the system may be taught to recognize specific words and output them in a special format. For example, the system may be instructed to convert the spoken word xe2x80x9ccommaxe2x80x9d to the punctuation mark xe2x80x9c,xe2x80x9d.
In an article titled xe2x80x9cSuper Resolution Pitch Determination of Speech Signals,xe2x80x9d by Medan et al., in IEEE Transactions on Signal Processing 39:1 (January, 1991), which is incorporated herein by reference, the authors describe an algorithm giving extremely high resolution of pitch value measurements of speech. The algorithm may be implemented in real time to generate pitch spectral analyses.
In a book titled xe2x80x9cPitch Determination of Speech Signalsxe2x80x9d by W. Hess, (Springer-Verlag, 1983), which is incorporated herein by reference, the author gives a comprehensive survey of available pitch determination algorithms. The author points out that no single algorithm operates reliably for all applications.
It is an object of some aspects of the present invention to provide improved methods and apparatus for converting speech to text.
In preferred embodiments of the present invention, a speech/text processor automatically converts speech to text, while analyzing one or more non-verbal characteristics of the speech. Such non-verbal characteristics include, for example, the speed, pitch, and volume of the speech. The non-verbal characteristics are mapped to corresponding format characteristics of the text, which are applied by the speech/text processor in generating a text output. Such format characteristics can include, for example, font attributes such as different font faces and/or styles, character height, character width, character weight, character position, spacing between characters and/or words, and combinations of these characteristics. Text with such associated characteristics is herein termed expressive text, and cannot be generated by speech-to-text systems known in the art.
The expressive text produced from the speech may be used, for example, in an electronic mail transmission and/or to produce a hardcopy of the speech. Alternatively, the expressive text may be converted to a marked-up text, by a custom mark-up language or a standard mark-up language, such as HTML (hypertext mark-up language). Associating format characteristics with text to register non-verbal characteristics of speech is an innovative and extremely useful way of converting between speech and text, and overcomes limitations of speech-to-text and text-to-speech methods known in the art.
In some preferred embodiments of the present invention, the expressive text generated by the speech/text processor is converted back to speech by a speech synthesizer. The speech synthesizer recognizes the format characteristics of the expressive text, and applies them to generate speech so as to reproduce the non-verbal characteristics originally analyzed by the speech/text processor. Alternatively, similar format characteristics may be generated using a suitable word processor program, so that text that is input using a keyboard is reproduced by the speech synthesizer with certain desired non-verbal characteristics.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for converting speech to text, including:
receiving a spoken input having a non-verbal characteristic; and
automatically generating a text output, responsive to the spoken input, having a variable format characteristic corresponding to the non-verbal characteristic of the spoken input.
Preferably, receiving the spoken input includes analyzing the spoken input to identify the non-verbal characteristic.
Preferably, receiving the spoken input includes determining words and boundaries between words, and generating the text output includes generating text corresponding to the words.
Preferably, the non-verbal characteristic includes at least one characteristic of the words selected from a group consisting of a speed, a pitch, and a volume, of the words.
Preferably, receiving the spoken input includes determining parts of words and boundaries between parts of words in the spoken input, and the non-verbal characteristic includes at least one characteristic of the parts of the words selected from a group consisting of a speed, a pitch, and a volume of the parts of the words.
Preferably, generating the text output includes encoding-the text output as marked-up text.
Preferably, generating the text output includes generating the text output according to a predetermined mapping between the variable format characteristic and the non-verbal characteristic.
Further preferably, generating the text output includes normalizing a distribution of the non-verbal characteristic over a predetermined quantity of speech according to an adaptive mapping.
Alternatively, generating the text output includes generating the variable format characteristic according to a user-alterable absolute mapping.
Preferably, generating the text output according to the predetermined mapping includes generating the text output according to a quantized mapping, wherein a range of values of the non-verbal characteristic is mapped to a discrete variable format characteristic.
Alternatively, generating the text output according to the predetermined mapping includes generating the text output according to a continuous mapping, wherein a range of values of the non-verbal characteristic is mapped to a range of values of the variable format characteristic.
Preferably, automatically generating the text output includes:
applying the predetermined mapping at a transmitter;
encoding the text output with the variable format characteristic as a data bitstream at the transmitter
transmitting the data bitstream from the transmitter to a receiver; and
decoding the data bitstream to generate the text output with the variable format characteristic at the receiver.
Preferably, applying the predetermined mapping at the transmitter includes altering the predetermined mapping at the transmitter.
Alternatively, automatically generating the text output includes:
encoding the text output and the non-verbal characteristic as a data bitstream at a transmitter;
transmitting the data bitstream from the transmitter to a receiver;
decoding the data bitstream at the receiver; and
applying the predetermined mapping at the receiver, responsive to the non-verbal characteristic encoded in the data bitstream, so as to generate the text output with the variable format characteristic.
Preferably, applying the predetermined mapping at the receiver comprises altering the predetermined mapping at the receiver.
Preferably, generating the text output includes varying at least one attribute of the text selected from a group consisting of font face, font style, character height, character width, character weight, character position, character spacing, kerning, fixed pitch, proportional pitch, strikethrough, underline, double underline, dotted underline, bold, bold italic, small capitals, toggle case, all capitals, and color.
Further preferably, generating the text output includes generating a custom-built font for the text output, having one or more variable features used to express the non-verbal characteristic.
There is further provided, in accordance with a preferred embodiment of the present invention, a speech/text processor, which is adapted to receive a spoken input having a non-verbal characteristic and to automatically generate a text output, responsive to the spoken input, having a variable format characteristic corresponding to the non-verbal characteristic of the spoken input.
There is also provided, in accordance with a preferred embodiment of the present invention, a computer software product, including a computer-readable medium in which computer program instructions are stored, which instructions, when read by a computer, cause the computer to receive a spoken input having a non-verbal characteristic and to automatically generate a text output, responsive to the spoken input, having a variable format characteristic corresponding to the non-verbal characteristic of the spoken input.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a method for converting text to speech, including:
receiving a text input having a given variable format characteristic; and
synthesizing speech corresponding to the text input and having a non-verbal characteristic corresponding to the variable format characteristic of the text input.
Preferably, receiving the text input includes analyzing the text input to identify the given variable format characteristic.
Preferably, receiving the text input includes analyzing the text input to identify words and parts of words, and synthesizing speech includes synthesizing speech corresponding to the words and parts of words.
Preferably, the non-verbal characteristic includes at least one characteristic of the words selected from a group consisting of a speed, a pitch, and a volume of the words.
Further preferably, the non-verbal characteristic includes at least one characteristic of the parts of the words selected from a group consisting of a speed, a pitch, and a volume of the parts of the words.
Preferably, synthesizing speech includes synthesizing speech according to a predetermined mapping between the variable format characteristic and the non-verbal characteristic.
Further preferably, synthesizing speech according to the predetermined mapping includes generating speech according to a continuous mapping, wherein a range of values of the variable format characteristic is mapped to a range of values of the non-verbal characteristic.
Preferably, receiving the text input includes receiving input in which at least one of the attributes of the text, selected from a group consisting of font face, font style, character height, character width, character weight, character position, character spacing, kerning, fixed pitch, proportional pitch, strikethrough, underline, double underline, dotted underline, bold, bold italic, small capitals, toggle case, all capitals, and color, is varied.
Preferably, receiving the text input includes receiving a text input in a custom-built font having one or more variable features used to express the non-verbal characteristic.
There is further provided, in accordance with a preferred embodiment of the present invention, a speech/text processor, which is adapted to receive a text input having a given variable format characteristic, and to synthesize speech corresponding to the text input and having a non-verbal characteristic corresponding to the variable format characteristic of the text input.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a computer-readable medium in which computer program instructions are stored, which instructions, when read by a computer, cause the computer to receive a text input having a given variable format characteristic, and to synthesize speech corresponding to the text input and having a non-verbal characteristic corresponding to the variable format characteristic of the text input.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings, in which: