With the electronic information explosion caused by Internet, a huge amount of diversified information is accumulated on the Web, and still continues to grow at a staggering rate. It is a challenging task to help net-citizens find useful information amongst this enormous information pool.
Information retrieval (IR) is the science of searching for information in a set of documents, which can further be divided into searching for a piece of information contained in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets, for texts, sounds, images or data. Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured or semi-structured information from unstructured machine-readable documents. Originated from these two long-established research disciplines, web search engine (e.g., Google or Baidu) is a document retrieval system designed specifically to help find information stored on the Web, which allows one to ask for the contents that meet specific criteria (typically those containing a given word or phrase) and to retrieve a list of items that match those criteria. Recently, a new type of web search engine, i.e., vertical search engine, becomes popular on the Web. Utilizing certain information extraction or web mining technologies, it extracts structured information from a highly refined database or some websites about a specific topic to provide more accurate and valuable information to people interested in a particular area.
In all these information retrieval or information extraction solutions of the Internet era, web page filtering plays an important role inside, no matter for a general (vertical) web search engine or a specific web mining system.
Technically, the process for web page filtering is mainly composed of two steps: first, to select proper and efficient web page features for specific filtering purpose; and then, to model filtering mechanisms based on these selected features. From the aspect of selected features, the current approaches for web page filtering can be roughly classified into four categories, i.e., content based filtering, PageType based filtering, link-based filtering, and extended anchor based filtering. The four categories of web page filtering approaches will be simply introduced below.
Content-based approach: This approach derived directly from the information retrieval research [1]-[2], which is query dependent algorithm, i.e., it assigns a similarity score to each web page whenever a query is submitted. Its basic ideas is that: The words appeared in a web page are employed for retrieving the relevant web pages, i.e., higher scores are given to those web pages that contain the query terms early on in the document or in a large or boldfaced font. Based on Vector Space Model (VSM), the cosine measure can be adopted for computing the similarity between the web page and the corresponding query, and then the relevant web page filtering is realized from the similarity scores.
PageType-based approach: Considering the fact that most Internet users can recognize a certain document type to which a particular web page belongs just by casually looking at it, the conclusion that human's evaluation of a web page based on not only from its contents but also from its various format and design information is drawn. From this observation, the content of a web page together with its structural characteristics are employed in a rule-based classifier for web page type classification. The basic structural characteristics include typical pairs of a tag and strings, the size and number of inline images, the kind and number of links, and URL strings. Based on the inside features (e.g., anchor text, keywords, title, URL, etc) of similar Web page, a machine learning based method can be adopted for web page classification.
Link-based approach: Since the Web is a collection of hyperlinks, in addition to the textual content of the individual pages, the link structure of such collections contains information which can, and should, be utilized for web page filtering. Based on the assumed “random surfer” model of web browser's behavior, a link-based method is proposed for web page importance ranking. It makes use of the link structure of the Web to calculate a quality ranking for each web page, which is called PageRank score. It is computed by weighting each in-link to a page proportionally to the quality of the page containing the in-link. Since the ranking score of a web page is determined solely by a page's location in the Web's graph structure (external information of the web page), then it is query independent and can be computed ahead of the query time. At last, the combination of rank values respectively from content-based and linked-based methods is conducted to determine the final score for measuring the relativity between the web page and the subject.
Extended anchors based approach: When exploiting the hyperlink structure of the Web for web page filtering, the text appeared on the link, i.e., anchor text can also be utilized for web page ranking. The anchor text can not only be associated with the page that the link is on but also be associated with the page the link points to. Especially for the second case, anchor text often provide more accurate descriptions of web pages than the pages themselves; also it helps search non-text information, and expands the search coverage with fewer downloaded documents, such as images, programs, and databases. Based on above consideration, an extended anchor based approach for web page filtering is proposed. First, all the anchor text which appear in the web page and navigate a web browser from the top home page to each target web page is collected to build the extended anchor list. Then, the keywords appeared in the extended anchor list are employed for target web page filtering.
However, the existing web page filtering solutions have disadvantages. First, the information retrieval models adopted by content, PageType, and link based approaches treat each web page as an independent document, i.e., single page based indexing and ranking, which means that the returned page must include all the keywords in a query. They ignore the fact that the internal content of a web page is often not self-contained. Since the indexing function of such solutions indexes web pages solely based on their internal content, the web page filtering results generated from such limited content can't have a satisfied quality.
Typically, during a user's Web navigation, the contextual information of a specific web page (e.g., its domain, directory, and navigational hyperlinks from other pages to this one) are also within the mind of the user and provide an important indication on the content of the web page. However, in the prior art, the contextual information has not been utilized sufficiently.
The content based approach handles the Web as a traditional document repository, the special characteristics of the Web and web pages, such as the contextual information, are not exploited for web page filtering. The textual content of a web page is incomplete for high accurate web page filtering.
For the PageType based approach, although some structural characteristics of a web page are utilized for web page filtering, the hyperlink information in the Web is not considered inside. Since the link structure of hyperlinks collection reflects human's implicit recommendation about the targeted web page, it should make a good contribution to improve the quality of the web page filtering results.
The hyperlink information in the Web is utilized in the link based and extended anchors based approaches, but it is not exploited to its full potential. For the link-based approach, the assumed random surfer's clicking on links might not be at random. The user also utilizes the anchor text to navigate their web browsing. Therefore, besides the number of in-links and their weighting, the anchor text appeared in the navigational path also provides an important indication about the destination web page. However, in the extended anchors based approach, only the anchor text information is considered for web page filtering, the text in the page title, URL text, even the domain or host also provide important indications about the content of the web page, but are not involved.