1. Field of the Invention
This invention relates generally to the field of data processing and, more particularly, to the automated analysis and mining of concepts from unstructured data.
2. Related Art
Structured data or objects generally refer to data existing in an organized form, such as a relational database, that can be accessed and analyzed by conventional techniques (i.e. Standard Query Language, SQL). By contrast, so-called unstructured data or objects refer to objects in a textual format (i.e. faxes, e-mails, documents, voice converted to text) that do not necessarily share a common organization. Unstructured information often remains hidden and un-leveraged by an organization primarily because it is hard to access the right information at the right time or to integrate, analyze, or compare multiple items of information as a result of their unstructured nature. There exists a need for a system and method to provide structure for unstructured information such that the unstructured objects can be accessed with powerful conventional tools (such as, for example, SQL, or other information query and/or analysis tools) and analyzed for hidden trends and patterns across a corpus of unstructured objects.
Conventional systems and methods for accessing unstructured objects have focused on tactical searches that seek to match keywords. These convention systems and methods have several shortcomings. For example, assume a tactical search engine accepts search text. For purposes of illustration, suppose information about insects is desired and the user-entered search text is ‘bug’. The search engine scans available unstructured objects, including individual objects: In this example, one unstructured object concerns the Volkswagen bug, one is about insects at night, one is about creepy-crawlies, one is about software bugs, and one is about garden bugs. The tactical search engine performs keyword matching, looking for the search text to appear in at least one of the unstructured objects. In this ‘bug’ example, only those objects about the Volkswagen bug, software bugs, and garden bugs actually contain the word ‘bug’ and will be returned. The objects about insects at night, and creepy-crawlies may have been relevant to the search but unfortunately were not identified by the conventional tactical search engine.
One conventional method of addressing this problem allows a user to enter detailed searches utilizing phrases or Boolean logic, but successful detailed tactical searches can be extremely difficult to formulate. The user must be sophisticated enough to express their search criteria in terms of Boolean logic. Furthermore, the user needs to know precisely what he or she is searching for, in the exact language that they expect to find it. Thus, there is a need for a search mechanism to more easily locate documents or other objects of interest, preferably searching with the user's own vocabulary. Further, such a mechanism should desirably enable automatically searching related words and phrases, without knowledge of advanced searching techniques.
In another conventional method, the search is done based on meaning, where each of the words or phrases typed is semantically analyzed, as if second guessing the user (for example, use of the term Juvenile picks up teenager). This increases the result set and thus makes analysis of search results even more important. Also, this technique can be inadequate and quite inaccurate when the user is looking for a concept like “definition of terrorism” or “definition of knowledge management,” where the “concept” of the phrase is more important than the meaning of the individual words in the search term.
Even when tactical searches succeed in searching or finding information, the problem of analyzing unstructured information still remains. Analyzing unstructured information goes beyond the ability to locate information of interest. Analysis of unstructured information would allow a user to identify trends in unstructured objects as well as to quickly identify the meaning of an unstructured object, without first having to read or review the entire document. Thus, there further exists a need to provide a system and methodology for analyzing unstructured information.
Prior art classification systems exist that can organize unstructured objects in a hierarchical manner. However, utilizing these classification systems to locate an object of interest requires knowing what the high-level of interest would be, and following one path of inquiry often precludes looking at other options.
Some prior art technologies store data and information utilizing proprietary methods and/or data structures. This prevents widespread or open access or analysis by keeping objects in a native non-standard proprietary format. Thus, there is a need to store captured information about unstructured objects in an open architecture and preferably in a readily accessible standard storage format.