The invention relates to computer database systems and more specifically to distributed computer database systems.
As is generally recognized in the art, two of the most significant changes in the nature of information processing in the last decade are the transition from primarily alphanumeric text processing to multimedia processing and the connection of formerly isolated computers by networks, which have been connected in turn by intranets and the Internet. The first change has resulted in computer images becoming as common on computers as text. The second change has resulted in vast quantities of information, both text and multimedia, being accessible to individuals. This increase in information availability to individuals has come at the cost of increasing difficulty in finding relevant information.
a) Word Based Search Engines
Search engines have been developed to assist in information retrieval, but they are still primarily based on matching words in a query with words in text documents. In practice, this means that they cannot typically search effectively for features of images and other kinds of multimedia. Word based systems and non-word based systems presently employ separate and distinct approaches to extract relevant information.
One way of extracting information from a word based database is to submit an information request in the form of a query. Responsive to the query, a computer can extract information from the database that is related to that specified by the query. The extracted information can be used for determining the degree of xe2x80x9csimilarityxe2x80x9d or xe2x80x9crelevancexe2x80x9d between a query and an object in the database. A variety of computer-implemented similarity measures have been developed for comparing a query with an object in the database when the query and database information are documents in a natural language. A commonly used measure of similarity is the cosine measure. The cosine measure is given by the formula, COS (v, w), where the vector v denotes the query and the vector w denotes the document. These vectors are in a space in which each possible word (or set of synonymous words) represents one dimension of the space. Further information regarding the cosine measure can be had with reference to G. Salton. Automatic Text Processing. Addison-Wesley, Reading, Mass., 1989; and G. Salton, J. Allen, and C. Buckley. xe2x80x9cAutomatic structuring and retrieval of large text files,xe2x80x9d Comm. ACM, 37:97-108, 1994.
b) Non-Word Based Search Engines
As noted above, non-word based techniques currently employ approaches to extracting relevant information that are different and distinct from those used in word based systems. Non-word based information retrieval techniques are utilized advantageously, for example, in the field of medicine for extracting diagnostic information from images of the human body. Lung cancer is one of the most difficult cancers to cure. Early detection is important to improve the recovery rate. Chest CT scans are more effective than conventional chest X-ray techniques, but CT scans result in many more images to be examined, making computer assistance essential for mass screening programs. Computer aided diagnosis of CT images requires the extraction of a large number of features such as the lung area, blood vessels, air clusters and tumors. These features are detected using a computer-implemented thresholding algorithm along with smoothing to remove artifacts of the CT scanner. These features, in turn, have a complex structure involving attributes such as their shape, area, thickness and position within the lung. In implementing such algorithms on a computer for detecting such features, it is useful to employ an object database. An object database is a collection of data or information objects organized and stored on a computer storage medium pursuant to a data model. Each Information object has a type, such as, image, sound or video stream, as well as data object, e.g., text file or structured document. Each information object is identified uniquely by an object identifier (OID). An OID can be an Internet Universal Resource Locator (URL) or some other form of identifier such as a local object identifier. Databases containing images, sound and/or video streams can include not only the information objects themselves but also features and metadata. The data model used for such a database can support the representation of information at many levels of abstraction, including:
1. The data representation level, which contains the actual data of the information object.
2. The data object level, which stores data objects (such as lines and regions) extracted from the information object. The objects on this level do not have a domain interpretation.
3. The domain object level, which associates a domain object with each object at the data object level.
4. The domain event level, which associates domain objects with each other, providing the semantic representation of spatial or temporal relationships.
A feature at the data object level (i.e., at Level 2, above) can be represented as a set of domain-independent data such as lines and regions. A feature at the domain level (i.e., Levels 3 and 4, above) can be represented as a set of domain objects related to one another by domain-dependent relationships.
Consider another example in medicine. Mammography is one of the most effective methods of early detection of breast cancer, one of the leading causes of cancer among women. Manual reading of mammograms is labor intensive, so computer assistance is essential. A very large number of features in mammograms have been identified as being important for proper diagnosis, such as clustered microcalcifications, stellate lesions and tumors. Each of these can be represented as a set of medical domain objects with a complex structure. For example, a stellate lesion has a complex structure, consisting of a central mass surrounded by spicules. The spicules, in turn, have a complex, star-shaped structure. Extracting these complex domain objects and their relationships with each other is important for effective detection of breast cancer.
Features of images, sound and video streams can be represented in a computer system as a set of data structures stored in a database. The features can be categorized into the following types:
Features, such as the name of the photographer or date taken, that cannot be directly extracted from an information object, and are often descriptive of other data regarding the information object. Features of this kind are called metadata.
Features that can be extracted directly from the information object at the time of insertion into the database.
Features that are not calculated until needed.
Features can be as simple as the value of an attribute such as brightness of an image, but many features are more complicated and are thus represented using a complex data structure. An example of such a complicated feature is a representation of the structure of a stellate lesion in a mammogram.
Typically, features can be extracted from structured documents by parsing the document to produce data structures, and can be extracted from unstructured documents by using one of the many feature extraction algorithms that have been developed for implementation on a computer. As in the case of structured documents, feature extraction from an unstructured document produces data structures. A large variety of feature extraction algorithms have been developed for media such as sound, images and video streams. For a discussion of such algorithms, reference should be had to A. Del Bimbo, editor. The Ninth International Conference on Image Analysis and Processing, volume 1311. Springer, September 1997. For example, medical images typically use edge detection algorithms to extract the data objects, while domain-specific knowledge is used to classify the data objects as medically significant objects, such as blood vessels, lesions and tumors. Fourier and wavelet transformations as well as many filtering algorithms are also used for feature extraction. For example, wavelet analysis has been used to characterize the texture of a region and to determine a shape (such as a letter) without regard to the location or the orientation of the shape within the image.
The data structures that represent features typically conform to a data model for the database that determines the kinds of components and attribute values that are allowed. Each feature can have one or more values associated with components of the data structure that represents the feature. In the simplest case, the data structure can have a single component with an associated value, and the feature can be represented by one attribute of the object. More complex features can be represented by several inter-related components, each of which may have attribute values. The data model for features at the domain level is often called an ontology. An ontology models knowledge within a particular domain, such as, for example, medicine. An ontology can include a concept network, specialized vocabulary, syntactic forms and inference rules. In particular, an ontology specifies the features that objects can possess as well as how to extract features from objects. Each feature of an object may have an associated weight, representing the xe2x80x9cstrengthxe2x80x9d of the feature or the degree with which the object has the feature.
Current systems for performing feature extraction from information objects use very simple ontologies. Furthermore, the ontologies are implicit in the design of the system rather than being a separate component of the system. As a result, current systems cannot be used for more than the single ontology for which they were designed. Using a different ontology or even adding new capabilities to the ontology are generally not possible without completely redesigning the system. Such systems are not suited for the large, complex, evolving ontologies that are typical of modern application domains.
When the information object is not a document written in a natural language, the information retrieval system cannot employ the cosine measure described above to measure relevance of the information, and so other measures (discussed below) have been developed for use in extracting features from images and other multimedia in those systems. This distinction further illustrates the differences between word-based and non-word based information retrieval systems, as recognized by those skilled in the art.
To assist in finding information in a database representing image features and the like, special search structures are employed called indexes. Current indexing techniques are very limited when it comes to addressing the problem of similarity indexing. Many search engines are limited to indexing the metadata attached to the information objects and do not index the content of the information objects. Other search engines that can directly index the content of information objects use indexing techniques that degrade significantly with increase in dimension or scale, and they generally just select some of the information objects rather than rank order them.
Current technology generally requires a separate index for each attribute or feature. Even the most sophisticated indexes in this technology are limited to a very small number of attributes. Since each index can be as large as the database itself, this technology does not function well when there are hundreds or thousands of attributes, as is often the case when objects such as images, sound and video streams are directly indexed. Furthermore, there is considerable overhead associated with maintaining each index structure. This limits the number of attributes that can be indexed. Current systems are unable to scale up to support databases for which there are many object types, including images, sound and video streams; millions of features; queries that involve simultaneously many object types and features; and new object types and features being continually added.
Another characteristic of current technology is that it treats each information object as an indivisible unit for the purposes of retrieval, i.e., an information object is retrieved as a whole or not at all. For example, World Wide Web browsers retrieve each document as a unit and present it only when the entire document has been downloaded and formatted. Individual data entries and even sections within an object are not individually indexed. Some search engines are even more extreme in this respect; i.e., they only categorize entire Web sites.
Current search engines commonly include index entries that are stale, i.e., the documents that produced the index entries have been updated or deleted since the time when the documents were indexed. This behavior is necessary because it is prohibitively expensive to monitor so many documents continuously. For many documents, this behavior is acceptable, but for certain time-sensitive documents, such as those containing commodity prices, it is important for the index to be current.
Further information can be had regarding the foregoing concepts with reference to the following publications:
1. L. Aiello, J. Doyle, and S. Shapiro, editors. Proc. Fifth Intern. Conf. on Principles of Knowledge Representation and Reasoning. Morgan Kaufman Publishers, San Mateo, Calif., 1996.
2. K. Baclawski. Distributed computer database system and method, December 1997. U.S. Pat. No. 5,694,593. Assigned to Northeastern University, Boston, Mass.
3. N. Fridman Noy. Knowledge Representation for Intelligent Information Retrieval in Experimental Sciences. PhD thesis, College of Computer Science, Northeastern University, Boston, Mass., 1997.
4. P. Hayes and J. Carbonell. Scoutxe2x80x94automated query-relevant document summarization. Technical Report 1997 Project Summary, Carnegie Group, Pittsburgh, Pa., 1997.
5. Y. Ohta. Knowledge-Based Interpretation of Outdoor Natural Color Scenes. Pitman, Boston, Mass., 1985.
6. M. Zloof. Query-by-example: the invocation and definition of tables and forms. In Proc. Conf. on Very Large Databases, pages 1-24, 1975.
The disclosures of the publications referenced in this xe2x80x9cBackground of the Inventionxe2x80x9d are incorporated herein by reference.
It would be desirable to provide an information retrieval system that can retrieve information from a unified database of word and non-word based information, including documents, images and other forms of multimedia, using a single indexing system, and otherwise overcome many of the performance and other problems and limitations of current systems. Such information retrieval systems preferably would be highly scalable, versatile, robust and economical.
The invention resides in processing a query in an information retrieval apparatus for word based and non-word based retrieval of information from a database by extracting a number of features from the query, fragmenting each of the features into feature fragments, and hashing each of the feature fragments into hashed feature fragments. The hashed feature fragments can be used in accessing a hash table for obtaining object identifiers therefrom that can be used for obtaining information from the database relevant to the query. In another aspect, the invention resides in an information indexing system for indexing information for facilitated retrieval from a database, by extracting a number of features from the information, fragmenting each of the features into feature fragments, and hashing each of the feature fragments into hashed feature fragments. The hashed feature fragments are used in accessing a hash table for storing object identifiers specifying the locations determined by the hashed feature fragments at which information should be stored. The information retrieval apparatus can be implemented in a distributed computer database system.
In general, the term xe2x80x9cfeaturexe2x80x9d as used herein means any information or knowledge associated with an information object or derived from the content of the information object, regardless of whether the information object represents a document, image or other multimedia, which has meaning within the applicable domain and conforms to the applicable ontology. Thus, for example, where the information object represents a photographic image of a human face, e.g., for entry in a photography contest, the features of the image include the eyes, nose and mouth because they can be perceived when the image is viewed by the judges. When the same image is used for skin disease diagnosis, the domain and ontology shift, and the features can include even blemishes that are not noticeable with the unaided eye.
More specifically, the distributed computer database system in accordance with an aspect of the invention can include one or more front end computers and one or more computer nodes interconnected by a network into a search engine for retrieval of database objects including, e.g., images, sound and video streams, as well as plain and structured text. A query or query object, preferably in the same format as the database objects to be retrieved, is transmitted from a user to one of the front end computers, which forwards the query to one of the computer nodes, termed the home node, of the search engine. The home node extracts features from the query, generates fragments from the features, and hashes these feature fragments. Each hashed feature fragment is transmitted to a node on the network. Each node on the network that receives a hashed feature fragment uses the hashed feature fragment to perform a search on its respective partition of the database. Results of the searches of the local databases are gathered by the home node. If requested by the user, this process is repeated a second time by the home node to refine the results of the query.
The foregoing distributed computer database system can be implemented with a number of useful capabilities. For example, the system can be implemented to support indexing and retrieval of information objects such as images, sound and video streams, as well as objects such as text files and structured documents. Both the content of the information objects themselves as well as any metadata attached to the objects can be indexed. The retrieval of objects relevant to a query is based preferably on an ontology, which is regarded as a separate component of the system and can be large, complex and evolving. The information objects themselves need not be stored in the database system itself so long as their locations are available in the database system, for example, as long as the database stores pointers to the information objects stored at remote locations. For example, the database can store URLs of documents stored at remote servers connected to the Internet or an intranet. Moreover, responsive to an indication that an information object is time-sensitive, the system can download the object for processing only when (and not until) it is relevant to a query, thus eliminating stale data in the database.
The distributed computer database system of the invention can also have the capability of supporting the indexing of all three kinds of features: metadata, features computed when the object is indexed and features computed during query processing. The features may be complex data structures, and any suitable computer-implemented similarity measure, such as the Feature Contrast Model, may be used to compare a query with an information object. One or more than one similarity measure may be used within the same query or information object. The objects in the database can associate similarity functions with the feature types with which they are to be employed, or even specify those similarity functions.
The distributed computer database system can use a high-performance distributed indexing methodology that scales to support the indexing of very large numbers of object types, including images, sound and video streams, millions of features, queries that involve many object types and features simultaneously, and new object types and features being continually added to the system. This avoids the limitations of current systems. The indexing methodology e.g., permits indexing and retrieval of individual data entries within a single information object rather than only entire documents as in many current systems.
For presentation to a user, the distributed computer database system collects database entries from a number of relevant sources and, e.g., organizes them into a single table for presentation to the user. Furthermore, the user may specify that the requested information is time-sensitive, in which case the present invention will download the current state of the information object and process it to extract the relevant information. This avoids the limitation of current search engines that contain large numbers of stale index entries.
In another aspect of the invention, a distributed computer database system that includes one or more front end computers and one or more computer nodes interconnected by a network operates as a search engine. A user wishing to query the database, transmits the query to one of the front end computers which in turn forwards the query to one of the computer nodes of the network. The node receiving the query, termed the home node of the search engine, extracts the features of the received query using the feature extraction algorithms specified in the ontology. The features are fragmented into data structures having a bounded size. The fragments are then hashed using one of the many hashing algorithms that are available. A portion of each hashed fragment is used by the home node as an addressing index by which the home node transmits the hashed query feature to a node on the network. Each node on the network that receives a hashed query fragment uses the hashed query fragment to perform a search on its respective database. Nodes finding data corresponding to the hashed query fragment return, e.g., the OIDs of the objects possessing this fragment. A computer-implemented matching function, e.g., specific to the type of the fragment, may be invoked to select, e.g., a subset of the OIDs to be returned. The home node gathers the extracted information objects and a computer-implemented similarity function or algorithm is computed based on the fragments that are in common with the query as well as the fragments that are in the query but not in the returned object. The similarity function is used to rank the objects, e.g., based on a computed strength of the match, i.e., degree of similarity or relevance. The function used for each fragment can be, e.g., specific to the type of the fragment. The results are, e.g., either a list of object identifiers in rank order or a table of data associated with or extracted from the objects. The home node can also reduce redundancy when the same information is contained in more than one document. In particular, the extracted information can be arranged, e.g., according to the Maximum Marginal Relevance (MMR) metric of Hayes and Carbonell, referenced above. The results, whether a list or a table, are transmitted to the front end node which formats the response to the user. For example, if the front end node is a World Wide Web server, then the front end node constructs a page in HTML format containing a list of URLs or a table each of whose entries have the extracted parts of a relevant document as well as a reference to the URL of the document. The front end computer transmits the formatted response to the user.
The foregoing distributed computer database system can process information objects to be indexed in the same manner as queries, except that the query nodes simply store data in their respective databases and no information is returned to the home node.
In yet another aspect of the invention, the distributed computer database system can also provide, responsive to a user request, higher levels of service, e.g., Level 1 service, as described above, as well as Level 2 and 3 service. For level 2 or level 3 service, the OIDs obtained in the basic service above are transmitted to additional nodes on the network by using a portion of each OID as an addressing index. In addition, if level 3 service is requested, the features each object has in common with the query are transmitted along with the OIDs to the same nodes on the network. Each node on the network that receives an OID uses the OID to perform a search on its respective database for the corresponding object information. In level 2 service, auxiliary information is retrieved and transmitted to the front end node. The auxiliary information can include, e.g., the URL of the object or an object summary or both. For level 3 service, a dissimilarity value is computed based on the fragments that the object possesses but the query does not. The dissimilarity value as well as the auxiliary information about the object is transmitted to the home node. The dissimilarity value can use functions specific to the types of the fragments. The dissimilarity values are gathered by the home node, which uses them to modify the similarity values of the objects obtained in the first level of processing. The modified similarity values are used to rank the objects. The OIDs and any auxiliary information about the objects that have the largest similarity value are transmitted to the front end node. Level 3 service has the additional capability of downloading and processing the original information object if this is specified. There are a number of ways that it can be specified, for example:
1. The ontology can specify that a type of fragment is time-sensitive.
2. The information object itself can specify that it is time-sensitive.
3. The query can specify that some or all of its fragments are time-sensitive.
In each case above, to avoid stale data, the information object is downloaded if it is requested and the most recent download is older than a specified length of time. The length of time can be specified by the user, can be a system parameter, or can be computed dynamically, e.g., based on the type of information object. Regardless of the level of service requested, the front end node formats the response to the user, e.g., based on the OIDs and any auxiliary information transmitted by the home node. For example, if the front end node is a World Wide Web server, then the front end node can construct a page in HTML format containing a reference to a URL and auxiliary information for each object. The front end node transmits the formatted response to the user.
Accordingly, the invention can provide an information retrieval system that can retrieve information from a unified database of word and non-word based information, including documents, images and other forms of multimedia, using a single indexing system, and otherwise overcome many of the performance and other problems and limitations of current systems. The invention can also provide an information indexing system coordinated with the retrieval system for facilitated retrieval of the information. Such information indexing and retrieval systems can be based on a distributed model and, consequently, highly scalable, versatile, robust and economical.