The invention relates generally to the field of information searching and analysis, and more particularly to data for a personal computer.
For thousands of years, mankind has kept records. From wall paintings, to stone tablets, to papyrus scrolls, to the illuminated vellum of the Middle Ages, man has always written down those things important to him. These records have been preserved, copied, and elaborated upon through the ages. For many years, however, access to this information was relatively arduous. Public establishments such as libraries might not have the proper documents, or might not have enough copies to satisfy demand, and even when source material is available, finding particular information of value within sizable sources documents requires significant expenditures of time.
With the advent of the computer, information storage and proliferation underwent a quantum increase. Documents could be easily stored in an electronic medium, replicated quickly, and disseminated virtually at will. Information of value can be located in documents via key word searches but, as experience with internet search confirms, getting what you want is still very arduous. Classifying data has become more difficult than ever before; searching through documents for specific words or concepts may resemble the proverbial needle in a haystack. Such searches may also yield little of value if the searcher doesn""t already know the right words for which to search.
An application for which this problem is especially acute concerns information flows (e.g., document publications, emails) that require monitoring to assure that sensitive information (e.g., military information or business trade secrets) is not revealed. Manually examining large numbers of documents for sensitive information can be very expensive and prone to human error. State of the art computer-implemented solutions typically employ key-word text searches that are neither very accurate nor fast, because the amount of search time depends on the number of key words. Often the present computer-implemented solutions require more time than the average person to classify (or xe2x80x9cdeclassifyxe2x80x9d) the electronic information.
In the information age, the productivity of knowledge acquisition is a fundamental parameter underlying success. The fundamental problem around which productivity in knowledge acquisition turns is xe2x80x9chow to quickly and accurately find information of value.xe2x80x9d There is a general need in the art for a data classification system that analyzes and sorts data rapidly and accurately. In this context, xe2x80x9cclassificationxe2x80x9d refers to the discovery of xe2x80x9cinformation of valuexe2x80x9d (IOV) in the data. This IOV may be text, bitmaps, sound files, or any other uniquely identifiable data.
Generally stated, the present invention is computer-implemented data analysis and classification system. The data analysis and classification system reads and analyzes data for the presence of IOV. IOV are concepts or associations of concepts in the form of words, alphanumeric expressions, or any other data that represent knowledge that is determined to be valuable. This IOV and the network of concepts related to it form an enterprise concept network (ECN). The ECN consists of (1) a set of concepts that represent a particular enterprise (i.e., a particular knowledge domain) and (2) a set of relationships between these concepts that enable IOV to be identified. In various embodiments, the ECN may be constituted with entirely different concepts and interrelationships. For example, one ECN might represent the knowledge domain related to nuclear materials, while another may characterize the knowledge domain associated with toxic waste. In this latter case, regulations related to toxic waste management might constitute IOV. Once the data analysis is completed, the data analysis and classification system returns a report on the field of data analyzed.
The data analysis and classification system identifies IOV in electronic information through the use of a hash table and an object model. This object model, which embodies the ECN, is referred to as the xe2x80x9cactive concept networkxe2x80x9d (ACN). Initially, the system retrieves a dictionary file storing a set of concepts. Concepts may be alphanumeric characters, such as words or phrases, or any other uniquely identifiable data that the system has previously encountered and is able to process. The data analysis and classification system then creates a hash table, placing each concept into the table.
Once the data analysis and classification system generates a hash table and places the concepts comprising the dictionary list into the hash table, relationships are created that define how objects interact within the ACN. Two types of relationships may exist within the ACN: first, operational relationships, which refer to logical connections, such as equivalency or implication; and second, output relationships, which specify what type of output is appropriate inside a given application. After creating the ACN, the data analysis and classification system then reads data, breaks the data down into component concepts and traces the relationships for each concept through the ACN. The system activates all relationships, both operational and output, pertaining to a given concept. This process is referred to as xe2x80x9cpropagation.xe2x80x9d If any concept propagation contains an output relationship, then appropriate output occurs.
The invention improves over the drawbacks of prior document classification systems and accomplishes the advantages described above. This will become apparent from the following detailed description of the embodiments and the appended drawings and claims.