This section provides background information related to the present disclosure which is not necessarily prior art.
Hypertext Markup Language (HTML) is a text markup language widely adopted on the World Wide Web (WWW). The HTML enables Web browsers to show web pages in a structured manner by using a series of marks.
For example, the following HTML text is shown in an INTERNET EXPLORER (IE) browser as the page in FIG. 1, which includes a table of 8 fields made up with the HTML.
<TR bgColor=“#f2f8ff”> <TD noWrap><a href=“/search.aspx?q=learning&p=Seed&b=0”> learning </a></TD> <TD><Ahref=“http://www.cnplayer.com/ upload/2006/2/13/200621323483592551238218.torrent”target=_blank>CPA2005 learning material-Accounting economiclaws tax laws ISO classical material </A></TD> <TD noWrap><a href=“http://bbs.fkee.com/” target=_blank> relevant discussion </a></TD> <TD><A href=“http://www.cnplayer.com/bt/study/210591.htm” target=_blank> View </A></TD> <TD align=“center”><b><font color=red>147</font></b></TD> <TD align=“center”><b><font color=red>734</font></b></TD> <TD align=“center”><font color=red>1354M</font></TD></TR>
In the preceding HTML text, tags including <TR></TR>, <TD></TD>, <A></A> are HTML tags. The characteristics of the HTML text include that information between tags <TR> and </TR> indicates information in a second row of the table shown in FIG. 1, information between each pair of <TD> and </TD> indicates a field in the table shown in FIG. 1 and every piece of information shown in a field of FIG. 1 is enclosed by the tags “><”.
These characteristics are common characteristics which are not only shown in the preceding HTML text, but also in most of web pages in the form of tables. In some pages, different tags may be used, but the basic characteristics are always the same.
To sum up, HTML texts have the following basic characteristics:
1) the format of the HTML texts is indicated with tags;
2) the HTML texts have to follow certain grammar to express Web information with the tags;
3) in a web page in the form of a table, information enclosed by a pair of <TR> and </TR> indicates a row of the table;
4) in the web page in the form of the table, information enclosed by a pair of <TD> and </TD> indicates a field in a row of the table;
5) in the web page in the form of the table, every piece of information shown in the field is enclosed by tags “><”; and
6) the HTML tags are insensitive to cases.
A web browser parses the HTML tags and displays web information in a format designated by the tags. The HTML tags are a superset of a character set, i.e. a set of keywords. Different versions of browsers support different versions of HTML. When a browser is parsing HTML texts, the browser analyzes the grammar of the HTML texts first, executes dynamic content in the HTML texts and eventually displays formatted web information to a user.
A web browser has the following characteristics:
1) support to static pages composed with HTML texts;
2) support to other dynamic script languages, such as JavaScript, Dynamic HTML (DHTML), etc.;
3) support to the method for posting data by web clients to servers and getting data by the servers from the web clients;
4) support to dynamic web technologies, such as pages containing script codes (Active Server Pages, ASP), JSP, JaveBeans, etc. Wherein JSP is a dynamic web page standard promoted by Sun Microsystems and established by multiple corporations, and JaveBeans, belonging to a Java class, is an object having a certain function and being capable of processing a service by encapsulating attributes and actions;
5) grammar parse of the HTML texts only, without semantic analysis of the HTML texts; and
6) the functions of a web page display tool only, without classification and aggregation of the web information.
In practical applications, a user may need to extract appealing web information from web pages for classification and aggregation. The commonest methods for such purpose in the conventional method include analyzing contents of web files and analyzing information of the contents according to keywords therein. Such methods include keyword complete match method, keyword fuzzy match method and regular expression algorithm.
According to the basic theory of the keyword complete match method, a keyword to be matched is regarded as a string to be searched for and a web file to be processed is regarded as a source character string. The source web text is analyzed by using a string matching algorithm. Such method is suitable for the extraction of a small amount of information. When massive data shall be handled, the string matching algorithm costs too much time and provides poor extensibility.
The keyword fuzzy match algorithm is an improvement of the keyword complete match algorithm. Though the keyword fuzzy match algorithm provides better extensibility, it costs no less time.
The regular expression algorithm has no advantage in handling varieties of web information.
The common approach of the three methods is to parse the contents of the web information in order to extract the web information. However, algorithms of the three methods have high time complexity and poor extensibility.