The internet has become indispensable in many aspects of our lives. The importance of the Internet lies in providing users access to an enormous amount of data related to almost any conceivable subject. The amount of data available on the Internet is enormous and the enormity of data has its own disadvantages. A major disadvantage being difficulty in finding relevant information from the enormous amount of the data. Generally, tools like search engines assist users in locating information on the Internet. However, the search engines generally provide the users with a large number of web pages in response to keywords provided by the user.
To find web pages that may be relevant to a given keyword, various techniques have been developed. One such technique is web page classification. Web page classification attempts to classify web pages available on World Wide Web (the web) under appropriate categories. Conventionally, web page classification may be performed manually or automatically, such as performed by search engines. Once such classification is done, the web pages may be identifiable through such classification. On the other hand, manual web page classification may be difficult and time consuming technique in cases where volume of data available on the web is enormous.
Further, the web pages are heterogeneous in nature thus making the classification complex. For example, the web pages may be unstructured documents like text document, semi structured documents like HyperText Markup Language (HTML) files, or fully structured documents like Extensible Markup Language (XML) file. The web pages may also contain files of various formats, such as image files, audio files, and video files. Thus the distinct varieties of the web pages may pose a challenge in web page classification.