1. Field of the Invention
The present invention is related to an extraction technique of web content, more particularly, to a method of establishing a plain text document from a HTML document, wherein the plain text document having contents most related to the title of the HTML document.
2. Description of the Prior Art
HTML documents are made readable by an internet browser displaying them in the form of web pages. In comparison with plain text documents, the HTML documents contain not only text but also tags and other forms of information, such as images or video clips. The content of web pages will be displayed by the internet browser according to the tags, whereby the web content could have rich and diverse information. However, with the size limit of display, it is inconvenient for users to read a complete web page on a portable computer device. Besides, in some applications, only an important part of the content on the web page is necessary, rather than the whole page. Therefore, the technique of extraction of text from HTML documents has been developed to support those devices and applications.
In Taiwan patent 434492, titled “Hyper text-to-speech conversion method”, a hyper text markup language (HTML) analyzer is disclosed. The HTML analyzer reads and analyzes the input hyper text and divides it into a text content, a HTML tag for marking up the text structure, and an articulation control command for controlling the way of articulation. However, all the texts in the HTML document are extracted without further processing.
Taiwan patent publication 200813763, titled “System and method for multithreading analyzing web page”, discloses a system based on a specific analyzing rule, determining whether an XML webpage contains a corresponding analyzing rule, then determining whether the XML webpage should be evaluated using an analyzing module, and determining the analyzed webpage meets the requirement of evaluation according to the criteria in the analyzing rule. The system increases the speed and efficiency of web content extraction. However, no detail of the analyzing rule is disclosed.
A well-known hyper text-to-pure text conversion techniques includes the steps of pre-processing the HTML document by omitting some irrelevant HTML elements, identifying the HTML element having the longest content in the pre-processed HTML document, shifting a current HTML element to a candidate HTML element having the contents with lengths longer than a predetermined threshold and having intervals with the longest HTML element smaller than a predetermined threshold, repeating the shifting step for the HTML elements ahead of and behind the longest HTML elements until there is no candidate HTML element, and respectively identifying the final current HTML element as a starting and ending HTML element, and using the contents of the starting and ending HTML elements, and those of the HTML elements between the starting and ending HTML elements as the content of the plain text document.
The hyper text-to-pure text conversion technique described above has the following drawbacks:    1. The plain text document may contain irrelevant sentences or words.    2. The first or last paragraph of the article in the webpage may be too short to be extracted.    3. Although the longest HTML element usually contains the important content of the webpage, but there are exceptions. For example, the content of news is the most important content of a webpage but have a length shorter than the content of an advertisement or other hyperlinked news titles in the webpage. In this case, the plain text document may contain only irrelevant sentences or words.
Therefore, extracting the texts from the webpage without further processing could not solve the problem in the prior art. Although the traditional extraction technique could establish a plain text document containing selected content from the webpage, it is probably that the selected content is irrelevant. An extraction technique which could establish a plain text document with content closely relevant to the title of the webpage is the most interested.