1. Technical Field
The present invention relates generally to an improved method for obtaining, managing, and providing complex, detailed information stored in electronic form in a plurality of sources. The invention may find particular use in organizations that have a need to discover relationships among various pieces of information in a given field.
2. Background Information
With the advent of the Internet, the Information Age is upon us. Today, one can find vast amounts of information about any given field or topic at the touch of a button. This information may be available from myriad sources in a variety of commonly recognized formats, such as XML, flat-files, HTML, text, spreadsheets, presentations, diagrams, programming code, databases, etc. This information may also be kept in third-party proprietary formats.
Amid this apparent wealth of online information, people still have problems finding the information they need. Online information retrieval may have problems including those related to inappropriate user interface designs and to poor or inappropriate organization and structure of the information. Additionally, the storage of information online in the variety of formats described above also leads to retrieval problems.
The existence of a variety of information sources leads to many problems. First, there is a lack of a unified information space. An “information space” is the set of all sources of information that is available to a user at a given time or setting. When information is stored in many formats and at many sources, a user is forced to spend too much overhead on discovering and remembering where different information is located (e.g., web pages, online databases, etc). The user also spends a large amount of time remembering how to find information in each delivery mechanism. Thus, it is difficult for the user to remember where potentially relevant information might be, and the user is forced to jump between multiple different tools to find it.
The existence of a variety of information sources also leads to information discovery strategies that lack cohesion. Users must learn to use and remember a variety of metaphors, user interfaces, and searching techniques for each delivery mechanism and class of information. Other problems associated with large numbers of information sources include a lack of links between information sources, and poor delivery mechanisms that don't provide a global view of the information space.
To overcome these problems, knowledge discovery tools have been developed. These tools extract information from a plurality of data sources, integrate the information into a common data model, and provide a graphical user interface for viewing the information. While these types of systems have been useful for unifying the information space for a given domain, they still suffer from several limitations.
First, each of these data sources typically includes a large volume of files. Thus, collecting and integrating information from a particular data source consumes both time and resources. However, in order to truly represent the information space for a given domain, these tools must collect data from many data sources. Each data source added to the process becomes an additional strain on both resources and time. Moreover, this information must be processed repeatedly to ensure that the data model includes the most current information. Present systems will process a data source in its entirety each and every time an extraction and integration cycle take place. Accordingly, there is a need for a system that doesn't waste time and resources re-integrating information that has already been integrated into the data model.
Second, integrating information from a plurality of data sources also leads to problems in the consistency of the information contained in the data model. Information in the data model may be overwritten by less reliable data. For example, a particular person's name may be found in both a structured database maintained by the IRS and the text of an email. In present systems, the name sourced from the email may be used to overwrite the name obtained from the IRS if the email is integrated later. Because the information maintained by the IRS is inherently more reliable than the text of an email (because of both source credibility and structured data), there is a need for a system that takes into account the reliability of the information maintained by the data sources before integrating that information into the data model.
Third, the information integrated into the data model is inherently related as that information defines the information space for a given domain. Unfortunately, present systems do not fully realize these interrelationships. Typically, relationships between the data in the knowledge must be defined manually. Manually defining these relationships, however, is a time consuming and expensive process. While systems automatically incorporate those relationships maintained by a particular data source (for example, relationships defined by a database data source), these relationships only represent a fraction of the relationships present among the information contained in the data model. Accordingly, there is a need for a system automatically discovering and generating various types of relationships.
The present invention provides a robust technique for integrating, from a plurality of data sources, only the necessary, most reliable data into a data model, and automatically discovering inter-relationships among the various elements of the data model.