With the growing popularity of wireless Internet, the demand for presenting a wide variety of existing web pages on mobile or portable devices such as wireless telephones, pagers, and personal data assistants (PDA) is a compelling need. However, among a vast quantity of web pages accessible over the Internet, there are only a few web pages, which are specifically designed for mobile or portable devices. The majority of web pages existing today are written in Hypertext Markup Language (HTML) which cannot be rendered by many portable devices. For example, typical wireless telephones can only access pages written in Wireless Markup Language (WML) or Handheld Device Markup Language (HDML). Accordingly, in order to display an existing HTML page on a portable device, this HTML page needs to be transformed into a WML page or HDML page that is viewable on the portable device.
One of the most challenging problems in transforming an HTML page is separating the main content included in the HTML page from auxiliary HTML data that surrounds the main content. The main content includes meaningful information that should be displayed to users of mobile devices. For example, in a news web page, the main content includes the news story, together with its title and/or headline. The auxiliary HTML data may include formatting code embedded in the text and some text segments that are not a part of the main content. The formatting code (also known as markup tags) may be used, for example, to define the page layout, fonts and graphic elements, as well as the hypertext links to other documents on the Internet. The auxiliary text segments include information that does not need to be displayed to the users of mobile devices (e.g., text in the header of the page, navigation links, inset boxes with text, etc.).
A typical HTML page includes alternating markup tag segments and text segments. Because portable devices do not understand HTML, the markup tags need to be removed from the HTML page when transforming the HTML page into a WML or HDML page. In addition, the auxiliary text segments need to be separated from the main content. However, since HTML is fundamentally a formatting language, no semantic information exists to define the content of the text segments on an HTML page. For example, a block of data that looks like “<b>xxxxxx</b>” might be a date, an author, a headline, or an advertisement, and there is no mechanism to determine the content of this block of data without a human reading it. As a result, it is difficult to identify the portions of the HTML page that need to be marked up for display on a mobile device.