The exponential growth in the amount and accessibility of data has raised many challenges in the field of information search and retrieval. These challenges are compounded by the heterogeneous nature of real world data, which may exist in a structured, semi-structured or unstructured state. The goal of much research has been the automatic or semiautomatic discovery of common entities and relationships across such disparate kinds of data. This may be done, for example, by crawling thousands of data sources, for example, on networks such as the internet. Another factor in the complexity of information search and retrieval is the multitude of ways of situational integration of data. One way to deal with these challenges is by using extensible data structures and creative ways for data retrieval across disparate data sources. In the case of the internet, one example is by crawling thousands of data sources and using search engines to index the crawled web documents.
One approach to information retrieval is to model data as graphs of objects connected by relationships. However, it is not easy to formulate precise, yet flexible queries that will find different meaningful connections between objects in such graphs. Standard database query languages, such as XQuery, are too rigid, and require full knowledge of the database schema from the user. Conventional search systems have very limited functionality and typically only find objects that contain all the keywords in a search.
An example of a query which illustrates the difficulties in dealing with relationships across disparate data is as follows. Consider a product manager looking for employees in a certain department who somehow (directly or indirectly) contributed to a shipped product. One approach may be to take the product plan data coming from a content repository and dynamically combine it with the company employee data to find employees. The product manager expects to find employees who, for example, owned components of the product, developed components, or consulted employees on the development of components.
For the above-described search, the product manager is looking for data retrieval with “high recall” rather than “high precision”, which is usually the case with users of search engines. Since large amounts of data may be related to the query, it is important to be able to perform the search quickly and efficiently and to be able to summarize the results, for example, by identifying the highest ranking objects and relationships individually, and aggregating the less important ones.
Another challenge is in finding efficient and user-friendly ways to represent the results of the search, where the results may be voluminous and complex.
Accordingly, there is a need for improved systems that can search across large volumes of heterogeneous, real world data. There is also a need for ways to formulate precise, yet flexible queries that will find meaningful connections between data objects. There is also a need for such techniques which are fast and do not require full knowledge of the database schema.