1. Field of the Invention
The invention relates to a method for converting web pages to plain text, more particularly to a method and system for converting Hypertext Markup Language web pages to plain text.
2. Description of the Related Art
With the popularity of the Internet, people have grown used to obtaining information and searching data through the Internet, like going directly to websites to browse web pages of news, articles, etc. At present, web pages are mostly written in the Hypertext Markup Language (hereinafter referred to as HTML).
Currently, there is available a new way of providing information on the networks, which is known as Really Simple Syndication (hereinafter referred to as RSS for short). RSS makes it possible for users to subscribe to information content of interest, and for the most recent information on web pages to be sent to subscribers in realtime. Specifically, to read RSS content, a user needs to install an RSS reader in a user terminal and then subscribe to various RSS feeds or channels provided by websites through the RSS reader. The RSS reader checks the subscribed RSS feeds regularly for updates, i.e., the RSS reader will automatically download summaries (including, e.g., a title, a description, and a URL or link) of latest content of news or articles on the subscribed feeds to the user terminal every user-determined time, so that the user can have updated information on the subscribed feeds in real time. If the user is interested in any new content or update, the user can click the corresponding summary to use the associated URL or link to connect to the corresponding HTML web page so as to browse the full text of the new content.
However, whether the user connects to a website directly to browse an HTML webpage or uses the RSS reader to browse the HTML web page, the HTML web page contains a lot of information (such as advertisements, caption links to other web contents, website information, etc.) irrelevant to the main content of the web page, which may affect the user's speed of reading the main content of the web page and which makes it difficult for the user to quickly comprehend the main content of the web page.