The present invention relates to the classification of content resources, e.g., web pages or other documents, and, more specifically, to techniques which employ structured patterns embedded in or associated with the content resources to facilitate classification.
Current approaches in the area of content classification focus on content analysis, using natural language approaches, and/or the analysis of meta-data (data about data) in which the content associated with documents is used for the classification. The first type of approach requires a semantic analysis of the content which, because of the processing resources required for such analysis, and/or the typical size of the corpus, means that such approaches are not scalable, and are thus not suitable for the large volumes of information in contexts such as the World Wide Web. The second type of approach is prone to data sparsity. Due to the minimal amounts of meta-data and the sparseness of meta-data associated with many large corpora, it is generally only possible to classify small portions of a corpus with meta-data alone.