1. Field of the Invention
The present invention relates to a document processing apparatus, a document processing method, and a document processing program in which digitized text information is extracted in order to make it more convenient to read the text aloud, and further relates to a recording medium for use therewith.
2. Description of the Related Art
Recently, as the Internet has become increasingly widespread, a large volume of digitized text data has been handled over networks. On the Internet, in particular, voluminous digitized text data is exchanged via web sites on the World Wide Web (WWW) or by e-mail. E-mail messages mainly contain plain text information. On a web site, on the other hand, text data is mainly described in HTML (Hyper Text Markup Language).
In HTML, a document in a text-data format has codes, called tags, embedded therein, which are also expressed using text data, and the tags can be used to define the document structure. A document described in HTML is read using viewer software supporting the document in order to view the document in a layout according to the document structure defined by the tags. Hereinafter, a document described in HTML is simply referred to as an “HTML document”.
The data format of text data exchanged over a network is still different between e-mails and HTML documents, thus requiring different viewers therefor.
Occasionally, it may be necessary to extract sentences in a predetermined fashion from the text data obtained in this way over a network according to the document structure. For example, in order to read aloud a document in a synthetic voice, etc., sections to be read aloud may have to be automatically extracted from the obtained text data. In order to view a document on a display, again, more conveniently, a selective extraction of desired sections is automated.
In the related art, sentences are extracted from such HTML documents merely by removing the tag information.
A typical viewer for viewing text data presents ruled lines by continuously repeating a symbol such as “*” or “−” on one line, or by using a symbol such as “|”, in a text-format document, such as an e-mail message. In this way, symbols can be used to form a table in a text-format document. When sentences are extracted from such a document, generally, the symbols used as a ruled line are simply segmented as a character string, and are not identified as a table.
In text data, typically, a quotation symbol such as “>” is used for quoting the document of others. In an e-mail response, for example, this quotation symbol may be added at the beginning of each line of an original e-mail message to indicate that the original message has been quoted.
There has been a system in the related art which is configured to identify a block including quoted sections to distinctively show the quoted sections and the other sections in different colors. In this case, again, if a sentence is extracted from the quoted text, the sentence together with the quotation symbol, such as “>”, is take out.
An extended e-mail system which has become popular recently is a system such as a so-called mail magazine capable of transmitting the same information to multiple destinations at once. The transmitted e-mail often contains a large amount of information, namely, blocks of advertisements, a signature, and the like, in addition to the text body. Generally, it is difficult to remove such additional information from the text data to acquire only the text body information.
Furthermore, as described above, an HTML document uses tags to define the document structure, and the document structure is viewed using an appropriate viewer in a style according to the tags. This allows a tag to be generally used as a control code indicating visual functionality for display, that is, a layout, whereas the positional functionality, such as whether a text section associated with the tag indicates a table or a heading in the document, may not be determined from the tag even in an HTML document.
In a typical apparatus for reading aloud an HTML document, therefore, sections to be read aloud and the other sections cannot be differentiated in the HTML document only by tags. In addition, an operator cannot specify which sections are to be read aloud.