Text to speech (TTS) converters are devices that convert a text document to audible speech sounds. Such devices are useful for enabling vision impaired individuals to use visible texts. Alternatively, TTS converters are useful for communicating information to any individual in situations where a visual display is not practical, as when the individual is driving or must focus his or her eyes elsewhere, or where a visual display is not present but an audio device, such as a telephone or radio, is present. Such visible texts may originate in tangible (e.g., paper) form and are converted to electronic digital data form by optical scanners and text recognizers. However, there is a large source of electronic or computer originating visual texts, such as from electronic mail (Email), calendar/schedule programs, news and stock quote services and, most notably, the World Wide Web.
In the case of electronic originating texts, speech data may be separately generated, e.g., by digitizing the voice of a human reader of the text. However, digitized voice data consumes a large fraction of storage space and/or transmission capacity--far in excess of the original text itself. It is thus desirable to employ a TTS converter for electronic originating texts.
Generating speech from an electronic originating text intended for visual display presents certain challenges for the TTS converter designers. Most notably, information is present not only from the content of the text itself but also from the manner in which the text is presented, i.e., by capitalization, bolding, italics, listing, etc. Formatting and typesetting codes of a text normally cannot be pronounced. Punctuation marks, which themselves are not spoken, provide information regarding the text. In addition, the pronunciation of text strings, i.e., sequences of one or more characters, is subject to the context in which text is used. The prior art has proposed solutions in an attempt to overcome these problems.
U.S. Pat. No. 5,555,343 discloses a TTS conversion technique which addresses formatting and typesetting codes in a text, contextual use of certain visible characters and formats and punctuation. A first predetermined table maps formatting and positioning codes, such as codes for generating bold, italics or underlined text, to speech commands for changing the speed or volume of the speech. A second predetermined table maps predetermined patterns of visible text, such as numbers separated by a colon (time) or numbers separated by slashes (date or directory), to replacement text strings. A third predetermined table maps punctuation, such as an exclamation point, to speech commands, such as a change in spoken pitch. An inputted text is scanned and spoken and non-spoken characters are mapped according to the tables prior to inputting the text to a TTS converter.
U.S. Pat. No. 5,634,084 discloses another TTS conversion technique. Inputted text is classified according to the context in which it appears. The classified text is then "expanded" by consultation to one or more tables that translate acronyms, initialisms and abbreviation text strings to replacement text strings. The replacement text strings are converted to speech in much the same way as a human reader would convert the text strings. For example, the abbreviation text string "SF, CA" may be replaced with the text string "San Francisco California", the initialism "NASA" may be left unchanged, and the mixed initialism, acronym "MPEG" may be replaced with "m peg."
The most important source of electronic text is the World Wide Web. Most of the electronic texts available from the World Wide Web are formatted according to the hyper text markup language (HTML) standard. Unlike other electronic texts, HTML "source" documents, from which content text is displayed, contain embedded textual tags. For example, the following is an illustrative example of a segment of an HTML source document:
______________________________________ &lt;!BODY BGCOLOR=#DBFFFF&gt; &lt;body bgcolor=white&gt; &lt;CENTER&gt; &lt;map name="Main"&gt; &lt;area shape="rect"coords="157,12,257,112"href="Main.html"&gt; &lt;area shape="rect"coords="293,141,393,241"href="VRML.html"&gt; &lt;area shape="rect"coords="18,141,118,241"href="VRML.html"&gt; &lt;area shape="rect"coords="157,266,257,366"href="Main.html"&gt; &lt;/map&gt; &lt;img src="Images/Main.gif" usemap="#Main" border=0&gt;&lt;/img&gt; &lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt; &lt;b&gt; &lt;font size=3 color=black&gt; Welcome to the VR workgroup of our company &lt;/font&gt; &lt;a href= "http://www.itri.org.tw"&gt;&lt;font size=3 color=blue&gt;ITRI&lt;/font&gt;&lt;/a&gt; &lt;font size=3 color=black&gt;/&lt;/font&gt; &lt;a href= "http://www.ccl.itri.org.tw"&gt;&lt;font size=3 color=blue&gt;CCL&lt;/font&gt;&lt;/a&gt; &lt;font size=3 color=black&gt;. We have been&lt;br&gt; developing some advanced technologies as fotlows.&lt;br&gt; &lt;/b&gt; &lt;ul&gt; &lt;a href="Main.html"&gt; &lt;li&gt;&lt;font size=3 color=blue&gt;PanoVR&lt;/font&gt; &lt;/a&gt; &lt;font size=3&gt;(A panoramic image-based VR)&lt;/font&gt;&lt;br&gt; &lt;a href="VRML.html"&gt; &lt;li&gt;&lt;font size=3 color=blue&gt;CyberVR&lt;font&gt; &lt;/a&gt; &lt;font size=3&gt;(A VRML 1.0 browser)&lt;/font&gt;&lt;br&gt; &lt;/ul&gt; &lt;br&gt;&lt;br&gt;&lt;a href="Winner.html"&gt;&lt;img src= "Images/Winner.gif" border=no&gt;&lt;/img&gt;&lt;/a&gt;&lt;br&gt; &lt;a&gt; &lt;br&gt;&lt;br&gt; &lt;font size=3 color=black&gt; &lt;br&gt;You are the &lt;img src="cgi-bin/Count.cgi?df= vvr.dat"border=0 align=middle&gt;th visitor&lt;br&gt; &lt;/font&gt; &lt;HR SIZE=2 WIDTH=480 ALION=CENTER&gt; (C) Copyright 1996 Computer and Communication Laboratory,&lt;BR&gt; Industrial Technology Research Institute, Taiwan, R.O.C. &lt;/BODY&gt; ______________________________________
The HTML source document is entirely formed from displayable text characters. The HTML, source document can be divided into content text and HTML tags. HTML tags are enclosed between the characters "&lt;" and "&gt;". There are two types of HTML, tags, namely, start tags and end tags. A start tag starts with "&lt;" and an end tag starts with "&lt;/". Thus, "&lt;font size=3 color=black&gt;" is a start tag for the tag "font" and &lt;/font&gt; is an end tag for the tag "font". All other text is content text.
HTML tags impart meaning to content text encapsulated between a start tag and an end tag. Such "meaning" may be used by a display program, such as a web browser, to change attributes associated with the display, e.g., to display content text in a particular location of the display screen, with a particular color or font, a particular style (bold, italics, underline), etc. However, the choice as to which actual attributes, if any, to impart to the content text encapsulated between the start and end tags is entirely in the control of each browser. This enables a variety of browsers and display terminals with varying display capabilities to display the same content text, albeit, somewhat differently from browser to browser and terminal to terminal. In this fashion, the HTML tags structure the content text which structure can be used for, amongst other things, altering the display of the content text. Note also a second property of HTML tags, namely, that the tags can be nested in a tree-like structure. For example, tags "&lt;b&gt;" and "&lt;font size=3 color=black&gt;" apply to the content text "Welcome to the VR workgroup of our company", tags "&lt;b&gt;", "&lt;a href="http://www.itri.org.tw"&gt;" and "&lt;font size=3 color=blue&gt;" apply to the content text "ITRI", tags "&lt;b&gt;" and "&lt;font size=3 color=black&gt; apply to the content text "/", tags "&lt;b&gt;", "&lt;a href="http://www.ccl.itri.org.tw"&gt;" and "&lt;font size=3 color=blue&gt;" apply to the content text "CCL", tags "&lt;b&gt;" and "&lt;font size=3 color=black&gt;" apply to the content text ". We have been" and tags "&lt;b&gt;" and "&lt;br&gt;" apply to the content text "developing some advanced technologies as follows."
The above example of an HTML document is in the English language. However, the HTML standard supports display of documents of a variety of languages including languages such as Chinese, Japanese and Korean which use a large symbol set instead of a simple alphabet. Most users of the World Wide Web who access HTML documents primarily in a language other than English are familiar with certain common technical English language terms such as "Web," "World Wide Web," "HTML," etc. It is therefore not uncommon to find HTML documents available on the World Wide Web containing content texts that are composed mostly of a language other than the English language, such as Chinese, but also containing some standard technical English language terms.
Another aspect of languages other than English, such as Chinese, is that certain symbols a of such languages may have multiple enunciations depending on the other symbols in the text string with which the symbol in question appears. The same is true for certain English language texts when a term in another language is phonetically transliterated to English, such as from Chinese, French, Hebrew, etc.
The conventional TTS converters described above are not well suited for translating HTML documents. First, the HTML tags used by the browser to modify the positioning or attributes of the content text, themselves, are text and are thus not easily parsed or distinguished from the content text. In any event, the prior art TTS converters do not teach how to identify which content text to assign a particular intonation and speed when such content text is encapsulated by attribute or position indications such as HTML start and end tags, especially when such HTML tags can be nested in a tree-like structure. Second, the prior art TTS converters do not modify the enunciation of a particular symbol of a language whose enunciation can vary with the context in which the symbol is used. TTS converters are available for converting non-English texts, such as Chinese texts to speech. However, such TTS converters can only translate the text of that language correctly and typically ignore text in another language, such as English.
Accordingly, it is an object of the present invention to overcome the disadvantages of the prior art.