The invention relates to computer database systems and more specifically to distributed computer database systems.
The World Wide Web (WWW) is much more than just a collection of Web pages. Each page contains references to other pages. Such references are called links, and one of the most important features of a Web browser is the ability to follow a link and display the page that is being referenced. A collection of documents linked together in this way is called a hypertext.
The link structure of a hypertext is a rich source of knowledge about the content of the hypertext. In the field of bibliometrics, links in the form of citations have been used for understanding documents by using citation analysis techniques. The link structure of the WWW is now being exploited as a means of categorization and knowledge extraction. This is being done in two ways:
1. General hypertext query languages.
2. Cluster analysis algorithms.
A Web query language, such as WebSQL, is a query language for extracting information from the Web, based on hypertext structure as well as content. For example, one might be interested in a job opportunity for a librarian. One can query the Web using WebSQL to find all pages containing the keywords xe2x80x9cemploymentxe2x80x9d or xe2x80x9cjob opportunitiesxe2x80x9d and then list all the pages referenced by such a page and containing the keyword xe2x80x9clibrarian.xe2x80x9d
Cluster analysis algorithms make use of Web query languages to find specific patterns in the link structure of the WWW. The most common cluster analysis pattern is the authority/hub pattern. To compute this pattern, one first specifies a topic area using one or more keywords. For example, one might be interested in the topic xe2x80x9cknowledge managementxe2x80x9d. A page is potentially relevant if it contains one or more keywords of the topic. An authority page for a topic is a page that is referenced by a large number of pages potentially relevant to the topic. Note that an authority page need not contain any of the keywords of the topic. Authority is conferred on it by virtue of being referenced frequently by potentially relevant pages. A hub page for a topic is one that references a large number of pages potentially relevant to the topic. An authority page for knowledge management is one that is highly referenced by pages that mention knowledge management. If one is interested in knowledge management, then it seems natural to look first at the authority pages.
Web query languages in general, and Web cluster analysis algorithms in particular, are limited in an important respect. They can only evaluate outgoing links, not incoming links. This is due to the way that Web links are defined. A link within one page specifies the page to which it linked, not the other way around. For example, suppose that one was interested in all the pages that refer to one""s own home page. WebSQL cannot answer such a query.
The WWW is not just a hypertext. Pages can contain images, sound and video streams, and the structure of the WWW is continually changing. For these reasons, the WWW is called a hypermedia environment. Web resources are located by a Universal Resource Locator (URL) which uniquely identifies the resource. More generally, a hypermedia environment consists of information objects that are uniquely identified by an object identifier (OID) and that can contain links to other information objects. A hypermedia environment is also called an object database.
To assist in finding information in an object database, special search structures are employed called indexes. Large databases require correspondingly large index structures to maintain pointers to the stored data. Such an index structure can be larger than the database itself. Current technology requires a separate index for each attribute or feature. This technology can be extended to allow for indexing a small number of attributes or features in a single index structure, but this technology does not function well when there are hundreds or thousands of attributes. Furthermore, there is considerable overhead associated with maintaining an index structure. This limits the number of attributes or features that can be indexed. Current systems are unable to scale up to support databases for which there are: many object types; millions of features; queries that involve many object types and features simultaneously; and new object types and features being continually added.
Further information can be had regarding the foregoing concepts with reference to the following publications:
1 L. Aiello, J. Doyle, and S. Shapiro, editors. Proc. Fifth Intern. Conf. on Principles of Knowledge Representation and Reasoning. Morgan Kaufman Publishers, San Mateo, Calif., 1996.
2 G. Arocena, A. Mendeizon, and G. Mihaila. Applications of a web query language. In Proc. 6 Intern. World Wide Web Conf., 1997.
3 K. Baclawski. Distributed computer database system and method, December 1997. U.S. Pat. No. 5,694,593. Assigned to Northeastern University, Boston, Mass.
4 S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proc. 7 Intern. World Wide Web Conf., 1998.
5 A. Del Bimbo, editor. The Ninth International Conference on Image Analysis and Processing, volume 1311. Springer, September 1997.
6 N. Fridman Noy. Knowledge Representation for Intelligent Information Retrieval in Experimental Sciences. PhD thesis, College of Computer Science, Northeastern University, Boston, Mass., 1997.
7 D. Gibson, J. Kleinberg, and P. Raghavan. Inferring Web communities from link topology. In Proc. 9 ACM Conf. on Hypertext and Hypermedia, 1998.
8 R. Jain. Content-centric computing in visual systems. In The Ninth International Conference on Image Analysis and Processing, Volume II, pages 1-13, September 1997.
9 J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. ACM-SIAM Sympos. on Discrete Algorithms, 1998.
10 Y. Ohta. Knowledge-Based Interpretation of Outdoor Natural Color Scenes. Pitman, Boston, Mass., 1985.
11 P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow""s ear: Extracting usable structures from the web. In CHI""96 Proceedings: Conference on Human Factors in Computing Systems: Common Ground, pages 118-125, Vancouver, BC, 1996.
12 E. Riviin, R. Botafogo, and B. Schneiderman. Navigating in hyperspace: Designing a structure-based toolbox. Comm. of the ACM, 37(2):87-96, February 1994.
13 G. Salton. Automatic Text Processing. Addison-Wesley, Reading, Mass., 1989.
14 G. Salton, J. Allen, and C. Buckley. Automatic structuring and retrieval of large text files. Comm. ACM, 37(2):97-108, February 1994.
15 E. Spertus. ParaSite: Mining structural information on the web. In Proc. 6 Intern. World Wide Web Conf., 1997.
16 A. Tversky. Features of similarity. Psychological review, 84(4):327-352, July 1977.
17 R. Weiss, B. Velez, M. Sheldon, C. Nemprempre, P. Szilagyi, and C. Giffor. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In Proc. Seventh ACM Conf. on Hypertext, pages 180-193, 1996.
18 H. White and K. McCain. Bibliometrics. Ann. Rev. Info. Sci. and Technology, pages 119-186, 1989.
The disclosures of the publications referenced in this xe2x80x9cBackground of the Inventionxe2x80x9d are incorporated herein by reference.
It would be desirable to provide an information retrieval system that can retrieve link and other information from a unified database of word and non-word based information, including documents, images and other forms of multimedia, using a single indexing system, and otherwise overcome many of the performance and other problems and limitations of current systems. Such information retrieval systems preferably would be highly scalable, versatile, robust and economical.
The present invention resides in an indexing and search engine for extraction of information based on the content of information objects in a database as well as links between information objects. Unlike Web query languages such as WebSQL, the present invention supports queries directed at retrieving information with respect to either outgoing or incoming links, or both. For example, the present invention can be implemented to determine all the pages that refer to one""s own home page.
Hypertext query languages and algorithms that make use of them, such as cluster algorithms, depend on the retrieval of three kinds of information in an object database: (1) Retrieval of objects relevant to a query, as typically provided in conventional information retrieval systems; (2) Retrieval of link information relevant to a query; and (2) Retrieval of all link information for a specific object relevant to a query, including both incoming and outgoing links.
For providing such retrieval, the invention refines the types of queries that users can submit. Accordingly, a general query is composed of a number of elementary queries each corresponding to a kind of information retrieval:
index query
An elementary query for retrieval of information objects relevant to a query.
link query
An elementary query for retrieval of link information relevant to a query.
object query
An elementary query for retrieval of link information relevant to a query for a specific object, including both incoming and outgoing links.
In a first aspect of the invention, a computerized information retrieval system can be provided for processing a query for word based and non-word based retrieval of information from a database, which has a first mechanism for parsing a query into a plurality of elementary queries each including one of an index query and a link query; a second mechanism for extracting a number of features from each of the elementary queries; a third mechanism for fragmenting each of the features into feature fragments; a fourth mechanism for hashing each of the feature fragments into hashed feature fragments; and a fifth mechanism for using each of the hashed feature fragment in accessing a corresponding hash table for obtaining an object identifier therefrom for use in obtaining information from the database relevant to the elementary queries, including information relevant to the index queries and link information relevant to the link queries.
In another aspect of the invention, an information indexing system can be provided for indexing information for facilitated retrieval from a database, which has a first mechanism for extracting a number of features from an information object, each of the features comprising one of an index feature and a link feature; a second mechanism for fragmenting each of the features into feature fragments; a third mechanism for hashing each of the feature fragments into hashed feature fragments; and a fourth mechanism for using each of the feature fragments in accessing a corresponding hash table that identifies a location at which data is to be stored, the data including (i) an object identifier if the feature of the information object comprises an object feature, and (ii) an object identifier of an object referenced by a link specified by the link feature if the feature of the information object comprises a link feature.
More specifically, a distributed computer database system implementing the invention includes one or more front end computers, one or more home nodes, one or more index nodes and one or more object nodes interconnected by a network into a search engine for retrieval of hypertext documents. A query from a user is transmitted to one of the front end computers, which forwards the query to one of the home nodes, of the search engine. The home node parses the query into one or more elementary queries and schedules the elementary queries for processing. Each elementary query can be one of a number of types, including an index query, a link query or an object query. To process an index query or link query, the home node extracts features from the index query or link query, fragments the extracted features into feature fragments, and hashes these features. Each hashed feature fragment is transmitted to one index node on the network. Each index node on the network that receives a hashed feature fragment uses the hashed feature fragment of the index query or link query to perform a search on its respective partition of the database. The results of the searches of the local databases are gathered by the home node. To process an object query, the home node transmits the object identifier contained in the object query to the object node on the network containing the information associated with the object. The object node that receives the object query uses the object identifier to perform a search on its respective partition of the database. The results of the search of the local database are transmitted to the home node. The home node processes the results for each elementary query according to the specifications in the query. The processing may include evaluation of additional elementary queries. When all processing is completed by the home node, the results are returned to the front end node which formats the results for presentation to the user.
In another embodiment of the invention, a distributed computer database system includes one or more front end computers, one or more home nodes, one or more index nodes and one or more object nodes interconnected by a network. A single computer processor can fulfill the functionality of one or more front end, home, index and object nodes. The combination of computer nodes interconnected by a network operates as a search engine. A user wishing to query the database, transmits the query to one of the front end nodes which in turn forwards the query to one of the home nodes of the network. The node receiving the query, termed the home node of this query, parses the query into elementary queries. For an index or link query, the home node extracts the features of the received query, fragments the features into feature fragments and then encodes the feature fragments using a hash function. A portion of each hashed feature fragment is used by the home node as an addressing index by which the home node transmits the hashed index or link query feature to an index node on the network specified by that fragment portion. For an object query, the home node uses a portion of the OID as an addressing index by which the home node transmits the object query to an object node on the network. Each index node on the network that receives a hashed index or link query feature fragment uses the hashed index or link query feature fragment to perform a search on its respective database. Index nodes finding data corresponding to a hashed index query feature fragment return the set of OIDs of the information objects possessing this feature. Index nodes finding data corresponding to a hashed link query feature fragment return the set of pairs of OIDs of the links between information objects which possess this feature fragment. Each object node on the network that receives an object query uses the OID contained in the object query to perform a search on its respective database. The object node returns the information associated with the OID as specified in the object query. Such information may include any or all of the following: the location of the object whose OID is contained in the object query, the set of OIDs that represent objects referenced by the object whose OID is contained in the object query, the set of OIDs that represent objects that reference the object whose OID is contained in the object query, and other auxiliary information associated with the object whose OID is contained in the object query. The OIDs or pairs of OIDs are then gathered by the home node. For an index or link query, a similarity function is computed based on the features that are in common with the index or link query. The similarity function is used to rank the objects or links between objects. The objects or links between objects that have the largest similarity value are used in subsequent processing of the query. For an object query, the information returned by the object node is used in subsequent processing of the query. The subsequent processing of the query by the home node may involve the construction of new elementary queries using the information returned from earlier elementary queries. Processing continues until no additional elementary queries are needed. The home node then performs any remaining processing required by the query, and the results are transmitted to the front-end node. The front end node formats the response to the user based on the OIDs and any other information transmitted by the home node. For example, if the front end node is a World Wide Web server, then the front end node constructs a page in HTML format containing a reference to a URL and auxiliary information for each object. The front end transmits the formatted response to the user.