This invention relates in general to the field of data processing. More specifically, this invention relates to automated systems and methods for analyzing collections of documents to extract important information from the collections.
An enormous amount of information is contained in data processing systems around the world. For example, a single large business organization typically has multiple banks of e-mail servers containing millions of e-mail messages for thousands of employees. In addition, organizations often have thousands of personnel records stored on one or more different systems, such as mini or mainframe computer systems. Additional kinds of information typically kept include marketing materials, technical reports, business memoranda, and so on, stored in various types of computer systems.
For instance, organizations typically use different programs to create and modify different kinds of information and typically use many different kinds of hardware, operating systems, file systems, and data formats to store the information. When stored, the information is typically organized into discrete records containing closely related data items. For example, a typical e-mail server stores each e-mail message as a separate row in a single database file, with multiple columns within the row holding the data that constitutes the message. Likewise, some personnel systems store each employee""s personnel data as related records in one or more files, with multiple fields in each record containing information such as employee name, start date, etc. Similarly, a Web server may store each Web page as lines of text in a file or a group of related files. However, despite the differences in file format and such used for different types of information, each e-mail message, each Web page, each employee""s personnel data, and each similar collection of information is referred to as a xe2x80x9cdocument.xe2x80x9d
When organization databases grow to contain thousands or millions of documents, traditional tools for retrieving data, such as search and sort functions, lose much of their practical utility. For example, when millions of e-mail messages are available, searching for a particular message or for a message relating to a particular topic is like trying to find a needle in a haystack. In such a situation, the individual performing the search is faced with too much information (TMI), and the knowledge embedded within the stored information remains largely untapped.
In recent years, some businesses have attempted to utilize the large pools of information on their data processing systems to greater advantage by analyzing that information with techniques known generally as data mining. As defined by the Microsoft Press Computer Dictionary, data mining is xe2x80x9cthe process of identifying commercially useful patterns or relationships in databases or other computer repositories through the use of advanced statistical toolsxe2x80x9d (4th ed., p. 125).
As one example, a cluster tool organizes documents into groups based on the contents of the documents. For instance, a business with customer complaint e-mails could identify areas of concern by using a cluster tool to group related customer complaints together. By contrast, traditional search techniques require the user to know in advance what characteristics are important. For example, with a traditional search function, an automobile manufacturer specifies a specific term, such as xe2x80x9cengine,xe2x80x9d to determine whether engine complaints are numerous. A cluster tool, on the other hand, groups complaints into subject areas, thereby highlighting areas of concern that the manufacturer might not otherwise think to explore.
However, a number of disadvantages are associated with conventional data mining systems, including shortcomings relating to the amount of time required to produce results, the pertinence of the results to the organization using those results, and the ability to analyze documents from different time periods, particularly when the analysis involves documents that have been archived.
Embodiments of the present invention provide a system and method for extracting knowledge from documents. In one embodiment, a data mining system according to the present invention includes a data retrieving component, a data integrating component, and a query manager. The data retrieving component and the data integrating component cooperate to generate intermediate data, such as marked-up documents, key term vectors, and/or data cubes, based on raw documents, such as e-mail messages, associated with an organization. The query manager uses the intermediate data to respond to queries relating to the raw documents.
In another embodiment, the data integrating component generates and stores the intermediate data automatically and substantially independently of the query manager. For instance, the intermediate data may be generated and stored according to a sampling period.
In another embodiment, the data retrieving component identifies which raw documents are pertinent to the organization, based on characteristic data for the organization (i.e., organization data), such as personnel records. In this embodiment, the data retrieving component filters the raw documents by generating marked-up documents for the raw documents identified as pertinent. For example, if processing e-mail messages, the data retrieving component may generate marked-up documents only for e-mail messages which were both sent and received by members of the organization.
Additional embodiments provide other technological solutions which facilitate knowledge extraction.