The World Wide Web contains millions of documents that provide large pool of information covering many subjects. Web documents are accessed via Uniform Resource Locations (URLs). Web documents are mostly represented in Hyper Text Markup Language (HTML) that can contain text, links to images or links to other web documents. Users who want to access the World Wide Web use a browser to access a specific web document URL or can use a search engine to query and find web documents with the information they are looking for. A search engine can search the World Wide Web on a frequent basis to parse, analyze, index, and store the content of web documents in addition to meta-data or attributes that describe web documents such as their creation dates, languages and authors.
Many Search engines provide multiple criteria that can be used to search and limit the search results and better zoom in on what the user is looking for. For example, a user can enter a query of one or more key words to search previously indexed web documents where key words can be separated by Boolean terms (e.g. and, or, not). Furthermore, a user can specify to limit the search to a language, a document file format, a document creation or crawled date. Furthermore, users can find which web documents link to other web documents and can limit search to sections of the web document such as the document title or document URL. As the web grows, so is the chance that more search results are returned back to the user who has to further discover the search results to find the closest match. Furthermore, users of different educational levels and ages could be searching for the same information but for different purposes and want the search results to cope with their interests.
Document classification or categorization is an information science concept that tries to assign one or more classifications to documents based on their content. This classification can be performed manually or automatically with little or no user intervention depending on the classification method. Classifications can be hierarchical where a document belongs to a branch in a hierarchical tree of categories or it can be a faceted classification where a document belongs to one or more defined classifications.
Ontology is the general knowledge representation of concepts and the relationships between them within a domain. Specifically in the field of document classification, many methods, standards and commercial systems exist to add categorization, contextual and classification information to documents for the purpose of search. One published ontology is the Simple Dublin Core Metadata Element Set which is a resource description standard used to find documents where the standard consists of 15 elements: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source Language, Relation, Coverage and Rights.
The Semantic Web is a web 3.0 concept that transforms web documents into data. The W3C group introduced the Resource Description Framework (RDF) language to represent information about resources in a structured readable model. Semantic web can be used by search engines to better discover resources and categorize web sites data instead of inferring web sites data. There are emerging hyper-data web browsers that use RDF to semantically represent web sites data.
In general, document classification can be either provided by the document publisher as it is the case for Semantic Web RDF language and popular domain ontologies or it can be discovered using either inference rules based on a predefined knowledge base and user experiences or on semantic modeling based on natural language processing.
A search of prior art did not disclose any patents that read directly on the claims of this invention. Some existing U.S. patents were considered to be related to the subject of using context search in addition to key words search to limit search results:
U.S. Pat. No.INVENTORISSUED7,283,998Moon; CharlesOct. 16, 20077,167,871Farahat; Ayman O.Jan. 23, 20077,194,471Nagatsuka; TetsuroMar. 20, 20077,158,983Willse; Alan R.Jan. 2, 20077,305,415Vernau; JudiDec. 4, 20077,296,016Farach-Colton; MartinNov. 13, 20077,305,380Hoelzle; UrsDec. 4, 20077,152,064Bourdoncle; FrancoisDec. 19, 20067,024,408Dehlinger; Peter J.Apr. 4, 20067,016,895Dehlinger; Peter J.Mar. 21, 20066,947,930Anick; Peter G.Sep. 20, 20056,859,797Skopicki; JakobFeb. 22, 20056,778,986Stern; JonathanAug. 17, 20046,751,600Wolin; BenJun. 15, 20046,701,305Holt; Fredrick BadenMar. 2, 20046,697,800Jannink; Jan F.Feb. 24, 20046,691,108Li; Wen-SyanFeb. 10, 20046,675,159Lin; Albert DeirchowJan. 6, 20046,233,575Agrawal; RakeshMay 15, 20016,189,002Roitblat; Herbert L.Feb. 13, 20016,490,579Gao; YongDec. 3, 20026,633,868Min; Shermann LoyallOct. 14, 20036,505,195Ikeda; TakahiroJan. 7, 20035,619,709Caid; William R.Apr. 8, 19977,257,530Yin; HongfengAug. 14, 20075,424,947Nagao; KatashiJun. 13, 19956,327,593Goiffon; David A.Dec. 4, 20015,761,631Nasukawa; TetsuyaJun. 2, 19985,963,940Liddy; Elizabeth D.Oct. 5, 19992003/0101166Uchino, KanjiMay 29, 2003
In U.S. Pat. No. 7,167,871, an authoritative document ranking method and system that can be used to re-rank search results was described where document content features (e.g. words with learned prefixes, words with learned suffixes, words in certain grammatical locations, HTML features) where a subset of the document content features are extracted using one or more metric regression or boosted decision tree algorithms and provided to a trained document textual authority model. The textual authoritativeness value determines the reliability of the document's subject. Our patent is different in these ways:                (1) It's not an authoritative ranking method such as other Page ranking methods used by Search Engines such as “Page-Rank”. Our patent is explicitly a document written style and structure classification method similar to classifying a document based on its language, domain type or copyright content.        (2) The document content features (complexity of document words, subjectivity of sentences, descriptive images) used by our method are not subject-specific and are neither extracted by any regression process nor inputted to any model for further calculation. U.S. Pat. No. 7,167,871 describes a subject-related ranking method.        (3) The classification metrics (complexity count/ratio, subjectivity count/ratio, descriptive-images count) as described by our method and system can be used by users in more than one way to filter, sort and tag search results and can be combined with each others to describe a general pre-determined context of documents. For example, “Plain and Simple” classification can mean documents with no images and low complexity.        
In U.S. Pat. No. 6,778,986, a method was patented to determine a web site type by examining web site features such as external/internal links, site tree and distribution of multimedia elements where a Bayesian network is trained to use the combined test results and determine the subject web site type.
In U.S. Pat. No. 7,194,471, a document classification method for the purpose of comparing documents was described where an operator designates a classification based on selecting appropriate items contained in the document and such feature vector is used to measure the similarity between classified documents.
In U.S. Pat. No. 7,296,016, a method for providing search results relating to a point-of-view (POV) where the POV may be defined from Uniform Resource Locators (URLs), key words from previously user-selected documents and a POV can be either on-topic or off-topic URLs. The POV is used to filter and limit search results.
In U.S. Pat. No. 7,305,380, a method for limiting search results using context information that can be extracted from user access patterns, a favorites list created by a user or by presenting a hierarchical directory of category listings to the user where the context information is used to filter in or out the search results.
In U.S. Pat. No. 6,189,002, a process was described to transform the search query of a user into a semantic profile that would be compared with the semantic profiles of a cluster of previously processed documents and the documents with the closest weighted match are returned.
In U.S. Pat. No. 6,490,579, a search method was detailed where a context field being a subject area, information type or problem type are used to narrow information resources matching at least one of the context fields.
In U.S. Pat. No. 6,633,868, a method was described where a word relationship matrix is constructed for each document from word frequencies, counts and proximities and a search matrix of the query vector is compared with the document matrix to produce a document rank.
In U.S. Pat. No. 5,619,709, a method was detailed to describe how to extract and compare document contexts by constructing context vectors of document key words based on the proximity between words and the search query is converted into a context vector that is compared to stored context vectors.
In U.S. Pat. No. 7,257,530, a text mining method was described where sentences and phrases are extracted out of document text where related phrases construct a knowledge base. A user is presented with a knowledge base related to the search query and can use the knowledge base to refine search engine results.
In U.S. Pat. No. 5,424,947, a system for natural language analysis was described to analyze structural ambiguity of sentences and find dependencies between words using a background knowledge base where the system can be used in a question and answer system.
In U.S. Pat. No. 6,327,593, a method was provided to allow users to interactively modify the search index when performing concept-based searches by modifying the concepts hierarchy and associations.
In U.S. Pat. No. 5,761,631, a method was described to improve the accuracy of natural language processing by using dependency structures of well-formed sentences to analyze ill-formed grammatical sentences.