The present invention relates generally to a method for automatically analyzing and characterizing the content of electronically formatted natural language based documents.
As the global economy has become increasingly driven by the skillful synthesis of information across all disciplines, be they scientific, economic, or otherwise, the sheer volume of information available for use in such a synthesis has rapidly expanded. This has resulted in an ever increasing value for systems or methods which are able to analyze information and separate information relevant to a particular problem or useful in a particular inquiry from information that is not relevant or useful. The vast majority of information available for such synthesis, 95% according to estimates by the National Institute for Science and Technology (NIST), is in the form of written natural language. The traditional method of analyzing and characterizing information in the form of written natural language is to simply read it. However, this approach is increasingly unsatisfactory as the sheer volume of information outpaces the time available for manual review. Thus, several methodologies for automating the analysis and characterization of such information have arisen. Typical for such schemes is the requirement that the information is presented, or converted, to an electronic form or database, thereby allowing the database to be manipulated by a computer system according to a variety of algorithms designed to analyze and/or characterize the information available in the database. For example, vector based systems using first order statistics have been developed which attempt to define relationships between documents based upon simple characteristics of the documents, such as word counts.
The simplest of these methodologies is a simple search wherein a word or a word form is entered into the computer as a query and the computer compares the query to words contained in the documents in the database to determine if matches exist. If there are matches, the computer then returns a list of those documents within the database which contain a word or word form which matches the query. This simple search methodology may be expanded by the addition of other Boolean operators into the query. For example, the computer may be asked to search for documents which contain both a first query and a second query, or a second query within a predetermined number of words from the first query, or for documents containing a query which consist of a series of terms, of for documents which contain a particular query but not another query. Whatever the particular parameters, the computer searches the database for documents which fit the required parameters, and those documents are then returned to the user.
Among the drawbacks of such schemes is the possibility that in a large database, even a very specific query may match a number of documents that is too large to be effectively reviewed by the user. Additionally, given any particular query, there exists the possibility that documents which would be relevant to the user may be overlooked because the documents do not contain the specific query term identified by the user; in other words, these systems often ignore word to word relationships, and thus require exacting queries to insure meaningful search results. Because these systems tend to require exacting queries, these methods suffer from the drawback that the user must have some concept of the contents of the documents in order to draft a query which will generate the desired results. This presents the users of such systems with a fundamental paradox: In order to become familiar with a database, the user must ask the right questions or enter relevant queries; however, to ask the right questions or enter relevant queries, the user must already be familiar with the database.
To overcome these and other drawbacks, a number of methods have arisen which are intended to compare the contents of documents in an electronic database and thereby determine relationships between the documents. In this manner, documents that address similar subject matter but do not share common key words may be linked, and queries to the database are able to generate resulting relevant documents without requiring exacting specificity in the query parameters. For example, vector based systems using higher order statistics may be characterized by the generation of vectors which can be used to compare documents. By measuring conditional probabilities between and among words contained within the database, different terms may be linked together. However, these systems suffer from the drawback that they are unable to discern words which provide insight into the meaning of the documents which contain them. Other systems have sought to overcome this limitation by utilizing neural networks or other methods to capture the higher order statistics required to compress the vector space. These systems suffer from considerable computational lag due to the large amount of information that they are processing. Thus, there exists a need for an automated system which will analyze and characterize a database of electronically formatted natural language based documents in a manner wherein the system output correlates documents within the database according to the meaning of the documents and required system resources are minimized.
Accordingly, the present invention is a System for Information Discovery (SID). The overriding intent of the present invention is to provide a system for analyzing and characterizing a database of electronically formatted natural language based documents wherein the output is information concerning the content and structure of the underlying database in a form that correlates the meaning of the individual documents within the database. A sequence of word filters are used to eliminate terms in the database which do not discriminate document content, resulting in a filtered word set whose members are highly predictive of content. The filtered word set is then further reduced to determine a subset of topic words which are characterized as the set of filtered words which best discriminate the content of the documents which contain them. These two word sets, the filtered word set and the topic set, are then formed into a two dimensional matrix. Matrix entries are then calculated as the conditional probability that a document will contain a word in a row given that it contains the word in the column of the matrix. The number of word correlations which is computed is thus significantly reduced because each word in the filtered set is only related to the topic words, with the topic word set being smaller than the filtered word set. The matrix representation thus captures the context of the filtered words and allows the resultant vectors to be utilized to interpret document contents with a wide variety of querying schemes. Surprisingly, while computational efficiency gains are realized by utilizing the reduced topic word set (as compared with creating a matrix with only the filtered word set forming both the columns and the rows), the ability of the resultant vectors to predict content is comparable or superior to approaches which consider word sets which have not been reduced either in the number of terms considered or by the number of correlations between terms.
The first step of the system is to compress the vocabulary of the database through a series of filters. Three filters are employed, the frequency filter, the topicality filter and the overlap filter. The frequency filter first measures the absolute number of occurrences of each of the words in the database and eliminates those which fall outside of a predetermined upper and lower frequency range.
The topicality filter then compares the placement of each word within the database with the expected placement assuming the word was randomly distributed throughout the database. By expressing the ratio between a value representing the actual placement of a given word (A) and a value representing the expected placement of the word assuming random placement (E), a cutoff value may be established wherein words whose ratio A/E is above a certain predefined limit are discarded. In this manner, words which do not rise to a certain level of nonrandomness, and thus do not represent topics, are discarded.
The overlap filter then uses second order statistics to compare the remaining words to determine words whose placement in the database are highly correlated with one and another. Measures of joint distribution are calculated for word pairs remaining in the database using standard second order statistical methodologies, and for word pairs which exhibit correlation coefficients above a preset value, one of the words of the word pair is then discarded as its content is assumed to be captured by its remaining word pair member.
At the conclusion of these three filtering steps, the number of words in the database is typically reduced to approximately ten percent of the original number. In addition, the filters have discriminated and removed words which are not highly related to the topicality of the documents which contain them, or words which are redundant to words which reveal the topicality of the documents which contain them. The remaining words, which are thus highly indicative of topicality and non-redundant, are then ranked according to some predetermined criteria designed to weight them according to their inherent indicia of content. For example, they may be ranked in descending order of their frequency in the database, or according to ascending order according to their rank in the topicality filter.
The filtered words thus ranked are then cut off at either a predetermined limit or a limit generated by some parameter relevant to the database or its characteristics to create a reduced subset of the total population of filtered words. This subset is referred to as a topic set, and may be utilized as both an index and/or as a table of contents. Because the words contained in the topic set have been carefully screened to include those words which are the most representative of the contents of the documents contained within the database, the topic set allows the end user the ability to quickly surmise both the primary contents and the primary characteristics of the database.
This topic set is then utilized as rows and the filtered words are utilized as columns in a matrix wherein each of the elements of the columns and the rows are evaluated according to their conditional probability as word pairs. The resultant matrix evaluates the conditional probability of each member of the topic set being present in a document, or a predetermined segment of the database which can represent a document, given the presence of each member of the filtered word set. The resultant matrix can be manipulated to characterize documents within the database according to their context. For example, by summing the vectors of each word in a document also present in the topic set, a unique vector for each document which measures the relationships between the document and the remainder of the database across all the parameters expressed in the topic set may be generated. By comparing vectors so generated for any set of documents contained within the data set, the documents be compared for the similarity of the resultant vectors to determine the relationship between the contents of the documents. In this manner, all of the documents contained within the database may be compared to one and another based upon their content as measured across a wide spectrum of indicating words so that documents describing similar topics are correlated by their resultant vectors.
The subject matter of the present invention is particularly pointed out and distinctly claimed in the concluding portion of this specification. However, both the organization and method of operation, together with further advantages and objects thereof, may best be understood by reference to the following description.