1. Field of Invention
The invention relates to a system and method for semantically classifying data and utilizing the semantically classified data. More specifically, the invention relates to utilization of a WorldModel to semantically classify data.
2. Description of Related Art
The Internet is a global network of connected computer networks. Over the last several years, the Internet has grown in significant measure. A large number of computers on the Internet provide information in various forms. Anyone with a computer connected to the Internet can potentially tap into this vast pool of information.
The most wide spread method of providing information over the Internet is via the World Wide Web (the Web). The Web consists of a subset of the computers connected to the Internet; the computers in this subset run Hypertext Transfer Protocol (HTTP) servers (Web servers). The information available via the Internet also encompasses information available via other types of information servers such as GOPHER and FTP.
Information on the Internet can be accessed through the use of a Uniform Resource Locator (URL). A URL uniquely specifies the location of a particular piece of information on the Internet. A URL will typically be composed of several components. The first component typically designates the protocol by with the address piece of information is accessed (e.g., HTTP, GOPHER, etc.). This first component is separated from the remainder of the URL by a colon (`:`). The remainder of the URL will depend upon the protocol component. Typically, the remainder designates a computer on the Internet by name, or by IP number, as well as a more specific designation of the location of the resource on the designated computer. For instance, a typical URL for an HTTP resource might be: EQU http://www.server.com/dir1/dir2/resource.htm
where http is the protocol, www.server.com is the designated computer and /dir1/dir2/resouce.htm designates the location of the resource on the designated computer.
Web servers host information in the form of Web pages; collectively the server and the information hosted are referred to as a Web site. A significant number of Web pages are encoded using the Hypertext Markup Language (HTML) although other encodings using the eXtensible Markup Language (XML) or the Standard Generic Markup Language (SGML) are becoming increasingly more common. The published specifications for these languages are incorporated by reference herein. Web pages in these formatting languages may include links to other Web pages on the same Web site or another. As known to those skilled in the art, Web pages may be generated dynamically by a server by integrating a variety of elements into a formatted page prior to transmission to a Web client. Web servers and information servers of other types await requests for the information that they receive from Internet clients.
Client software has evolved that allows users of computers connected to the Internet to access this information. Advanced clients such as Netscape's Navigator and Microsoft's Internet Explorer allow users to access software provided via a variety of information servers in a unified client environment. Typically, such client software is referred to as browser software.
The Web has been organized using syntactic and structural methods and apparatus. Consequently, most major applications such as search, personalization, advertisements, and e-commerce, utilize syntactic and structural methods and apparatus. Directory services, such as those offered by Yahoo! and Looksmart, offer a limited form of semantics by organizing content by category or subjects, but the use of context and domain semantics is minimal. When semantics is applied, critical work is done by humans (also termed editors or catalogers), and very limited, if any, domain specific information is captured.
Current search engines rely on syntactic and structural methods. The use of keyword and corresponding search techniques that utilize indices and textual information without associated context or semantic information is an example of such a syntactic method. Use of these syntactic methods in information retrieval using keyword-based search is the most common way of searching today. Unfortunately, most search engines produce up to hundreds of thousands of results, and most of them bear little resemblance to what the user was originally looking for, mainly because the search context is not specified and ambiguities are hard to resolve as discussed in Jimmy Guterman, "The Endless Search, The Industry Standard", Dec. 20, 1999 http://www.thestandard.com/article/display/0,1151,8340,00.html. One way of enhancing a search request is using Boolean and other operators like "+/-" (word must/must not appear) or "NEAR" whereby the number of resulting pages can be drastically cut down. However, the results still may bear little resemblance to what user is looking for.
Searches provided by companies like Snap.com and AltaVista, currently allow users to query for non-textual assets including video or audio files. Searches of this kind are usually formed by specifying a number of keywords and, in some cases, a desired media type. Even if the results are restricted to be of a certain media type, those keywords are not put into a semantic context, and the consequence is poor precision of the results.
Most search engines and Web directories offer advanced searching techniques to reduce the amount of results (recall) and improve the quality of the results (precision). Some search methods utilize structural information, including the location of a word or text within a document or site, the numbers of times users choose to view a specific results associated with a word, the number of links to a page or a site, and whether the text can be associated with a tag or attributes (such as title, media type, time) that are independent of subject matter or domain. In a few cases when domain specific attributes are supported (as in the genre of music), the search is limited to one domain or one site (i.e. Amazon.com, CDNow.com). It may also be limited to one purpose, such as product price comparison. Also, the same set of attributes is provided for search across all assets (rather than domain specific attributes for a certain collection of assets, context, or domain).
Grouping search results by Web sites, as some search engines like Excite offer, can make it easier to browse through the often vast number of results. NorthernLight takes the idea of organizing the Web one step further by providing a way of organizing search results into so-called "buckets" of related information (such as "Thanksgiving", "Middle East & Turkey", a.s.o). Both approaches do not improve the search quality per se, but they facilitate the navigation through the search results.
To further aid the user in getting to the information users are looking for, some search engines provide "premium content" editorially collected and organized into directories that help put the search in the right context and resolve ambiguities. For example, when searching for "turkey" on Excite.com, the first results include links to premium content information on both turkey the poultry and Turkey the European country. Yahoo is a Web directory that lets the user browse their taxonomy and search only within certain domains to cut down on the number of results and improve precision.
Directory services support browsing and a combination of browsing with a limited set of attributes for the content managed or aggregated by the site. When domain information is captured, a host of people (over 1200 at one company providing directing services and over 350 at another) classifies new and old Web pages, to ensure the quality of those domain search results. This is an extremely human-intensive process. The human catalogers or editors use hundreds of classification or keyword terms that are mostly proprietary to that company. Considering the size and growth rate of the World Wide Web, it seems almost impossible to index a "reasonable" percentage of the available information by hand. NorthernLight uses a mostly automated classification apparatus that classifies newly found content based on comparison with more than 2000 subject terms.
Several Web sites have classified their assets into domains and attributes. Amazon.com visitors, for instance, can search classical music CDs by composer, conductor, performer, etc. Customers looking for videos can search mgm.com by title, director, cast, or year. Video indexing machines like Excalibur allow a company to segment its video assets, enter and search by an arbitrary number of user-specified attributes. Unfortunately, this powerful search is restricted to one particular Web site only. No large-scale attribute search for all kinds of documents has been available for the whole Internet. While WebCrawlers can reach and scan documents in the farthest locations, the classification of structurally very different documents has been the main obstacle of building a metabase that allows the desired comprehensive attribute search against heterogeneous data.
The search engine NorthenLight automatically extracts some content-descriptive and content-independent metadata (subject, type, source, and language) and maintains an extensive hierarchy of domains, but fails to further identify and extract domain-specific attributes such as "composer" or "cast".
The context of a search request is necessary to resolve ambiguities in the search terms that the user enters. For instance, a digital media search for "windows instructions" in the context of "computer technology" should find audio/video files about how to use windowing operation systems in general or Microsoft Windows in particular. However, the same search in the context of "home and garden" is expected to lead to instructional videos about how to mount windows in your own home.
Due to the unstructured and heterogeneous nature of the Web resources, every Web site uses a different terminology to describe similar things. A semantic mapping of terms is then necessary to ensure that the system serves documents within the same context in which the user searched. The Context Interchange Network (COIN) that was developed at the MIT presents a system that translates requests into different context as required by a search against disparate data sources. The support for semantics is very limited, primarily dealing with unit differences and functions for mapping values. No domain modeling is supported. What is better (and is achieved by the present invention), the context of digital media is determined before metadata is inserted into the metabase. Differences in terminology (like "cast" versus "starring") are dealt with at the source.
Current manual or automated content acquisition may use metatags that are part of an HTML page, but these are proprietary and have no contextual meaning for general search applications. A newly proposed, but not ratified or adopted, Web standard mechanism called DAML (Defense Advanced Research Projects Agency Agent Markup Language) would be easily understandable to DAML-enabled user agents and programs. However, this would require widespread adoption of this possible future standard, and its use for page and site creators to appropriately use DAML, before appropriate agents can be written. Even then, existing content cannot be indexed, cataloged, or extracted to make it a part of what is called a "Semantic Web".
The concept of a Semantic Web is an important step forward in supporting higher precision, relevance and timeliness in using Web-accessible content. Some of the current use of this term does not reflect the use of various components that support broad and important aspect of semantics, including context, domain modeling, and knowledge, and primarily focuses on terminological and ontological components as further described in R. Hellman, "A Semantic Approach Adds Meaning to the Web", Computer (IEEE Computer Society), December 1999, pp.13-16.
Research in heterogeneous database management and information systems have addressed the issues of syntax, structure and semantics, and have developed techniques to integrate data from multiple databases and data sources. Large scale scaling and associated automation has, however, not be achieved in the past. One key issue in supporting semantics is that of understanding and modeling context.
Currently, syntax and structure-based methods pervade the entire Web--both in its creation and the applications realized over it. The challenge has been to include semantics in creating physical or virtual organizations of the Web and its applications--all without imposing new standards and protocols as required by current proposals for the Semantic Web. These advantages and others are realized by the present invention.