Globally, industries are experiencing a significant increase in the amount of unstructured information, especially textual. This fact is supported by reports from various market research firms, such as the recent finding that the amount of text-based data alone will grow to over 800 terabytes by the year 2004. It has also been recently determined that between 80-90% of all information on the Internet and stored within corporate networks is unstructured. Finally, it has been noted that the amount of unstructured data in large corporations tends to double every two months.
Thus, the volume of extant textual data is growing exponentially. Because this data can contain a significant amount of business-critical information, there is a growing need for some type of automated means for processing this data, such that the value buried in it can be extracted. Ideally, such a method would involve natural language processing systems and methods. One such potential method is the ability to compare and contrast two or more documents at the semantic level, to detect similarities of meaning or content between them as well as identify specific concepts which may be of interest. While a group of human reviewers or researchers could easily perform such a task, given the sheer volume and wide variety of subject matter created by even medium size businesses, it would tend to be complex and time consuming. In most large businesses, such a task would require a dedicated department whose cost would generally not be economically justifiable.
What is thus needed in the art is an automated system and method for the natural language processing of textual data so as to discover (i) valuable content contained within such data; and (ii) mutual similarities or divergences between or among two or more documents. Such a system and method could then generate reports of the results of such processing in one or more forms readily useable by humans.