Structured information may be defined as information whose intended meaning is explicitly represented in the structure or format of the data. The canonical example of structured information is a relational database. Unstructured information may be characterized as information whose meaning requires interpretation in order to approximate and extract the intended meaning. Examples include natural language documents, speech, audio, images, and video. In other words, unstructured data is any data residing unorganized outside a database. Unstructured data can be text, audio, video, or graphics.
Unstructured information represents the largest, most current, and fastest growing source of information available to the world of business or to governments. In some estimation, unstructured data represents 80% of all corporate information. High-value information in these huge amounts of data is difficult to discover. Unstructured information is not in a format adapted to search techniques. Searching for information in unstructured sources is impractical. First, data must be analyzed to detect and locate items of interest. The result must then be structured so that powerful search engines and database engines can efficiently find what is requested, when it is requested. The bridge from the unstructured world to the structured world is called Information Extraction (IE).
An Unstructured Information Management (UIM) application is generally a software system that analyzes a large volume of unstructured information (text, audio, video, images, etc.) to discover, organize and deliver relevant knowledge to a client or to an end-user. An example is an application that processes millions of medical documents and reports to discover critical interactions between drugs, side effects and a disease history. Another example is an application that processes millions of documents to discover key evidence indicating probable terrorist threats.
The management of unstructured data is recognized as one of the major unsolved problems in the information technology (IT) industry, the main reason being that the tools and techniques that can successful transform structured data into business intelligence and usable information simply don't work when applied to unstructured data.
An Unstructured Information Management (UIM) system deploys Information Extraction (IE) techniques on large volumes of unstructured information in order to discover, organize and deliver relevant knowledge to a client.
Information Extraction (IE) is an important unsolved problem of Natural Language Processing (NLP). One of the most important problems in information extraction is the extraction of entities from text documents and the extraction of relations among these entities. Examples of entities are “people”, “organizations”, and “locations”. Examples of relations are “ORG-EMPLOY-EXECUTIVE”, “ORG-LOCATION”, and so on. For instance, the sentence “John Adams is the chief executive officer of XYZ Corporation” contains an “ORG-EMPLOY-EXECUTIVE” relation between the person “John Adams” and the organization “XYZ Corporation”.
Various techniques have been used to extract relations between related entities. In supervised approaches, human experts manually identify entities and relation in given examples. A classifier is trained on these examples and is used later to identify relations and entities at runtime. Semi-supervised approaches use seed samples provided by an expert and try to automatically obtain more examples similar to the seed samples. Then, the seed samples and the obtained examples are used to train a classifier like in the supervised case.
Unstructured data comprises additional information other than entities and relations, such as the social network that represents the relations between the different entities, the period during which the entities have some relations, common factors shared between different entities, . . . , . This complex and rich information is difficult to acquire and very difficult to represent in an informative way.
The HITS (“Hypertext Induced Topic Selection”) algorithm is an algorithm for rating, and therefore also ranking, Web pages. HITS uses two values for each page, the “authority value” and the “hub value”. “Authority value” and “hub value” are defined in terms of one another in a mutual recursion. An authority value is computed as the sum of the scaled hub values that point to that page. A hub value is the sum of the scaled authority values of the pages it points to. Relevance of the linked pages is also considered in some implementations. The HITS algorithm takes profit of the following observation: when a page (hub) links to another page (authority), the former confers authority over the latter. The HITS approach is described in the publication entitled “Authoritative Sources in a Hyperlinked Environment”, J Kleinberg, J. ACM (1999).
Extracting knowledge from unstructured data for some domains is a costly and unfeasible task since many hand crafted rules need to be generated to capture various information. Although it is very difficult operation to extract such knowledge for any given domain, it is more difficult to present and visualize data in a clear and useful way to the user. There are several issues associated with extracting and presenting information from unstructured data including, for example: the automatic discovery of patterns for extracting relations between entities from any unstructured data and in any domain (application); extraction of knowledge characterizing each entity and relation from the unstructured data (such as the time during which the relation was valid and the location of this entity at that time); definition of Multi-Layered relations (relations with various constraints and conditions, for example relations in a given time frame or relations between two persons in a given organization, . . . ); and visualization of the extracted knowledge (presenting the extracted knowledge in way that enables the user to ingest and digest this knowledge).
Most prior art only focus on the first issue which consists in extracting relations between entities from unstructured text. Work in this field can be found in the article entitled “Extracting Patterns and Relations from the World Wide Web”, (by Sergy Brin—Computer Science Department Stanford University) published in “The proceedings of the 1998 International Workshop on the Web and Databases”. This publication is directed to the extraction of authorship information as found in book descriptions on the World Wide Web. This publication is based on dual iterative pattern-relation extraction wherein a relation and pattern set is iteratively constructed. This approach has two major drawbacks: (1) using hand-crafted seed examples to extract more examples similar to these hand-crafted seed examples; and (2) employing a lexicon as main source for extracting information.
The article entitled “Snowball: Extracting Relations from Large Plain-Text collections” (Eugene Agichtein and Luis Gravano—Department of Computer Science Columbia University, 1214 Amsterdam Avenue NY), published in “Proceedings of the Fifth ACM International Conference on Digital Libraries”, 2000 discloses an idea similar to the previous work by using seed examples to generate initial patterns and to iteratively obtain further patterns. Then ad-hoc measures are deployed to estimate the relevancy of the patterns that have been newly obtained. The major drawbacks of this approach are: (1) its dependency on seed examples leads to limited capability of generalization, (2) using hand-crafted examples leads to domain dependency, and (3) the estimation of the relevancy of patterns requests the deployment of ad-hoc measures.
US patent application publication 2004/0167907 entitled “Visualization of integrated structured data and extracted relational facts from free text” (Wakefield et al.) discloses a mechanism to extract simple relations from unstructured free text. However, this mechanism has several major drawbacks:                The mechanism to extract relations depends on a parse tree. This is a major drawback because accurate relations cannot be extracted.        It depends on human made rules. The mechanism is designed for certain problems and must be changed for each problem which is costly and not easy.        It deploys lexicons and other costly resources to extract information.        It is not general and cannot solve different problems in different domains.        It provides only simple relations and is not capable of providing highly detailed relations.        It is not fully automatic and needs a human intervention.        
U.S. Pat. No. 6,505,197 entitled “System and method for automatically and iteratively mining related terms in a document through relations and patterns of occurrences” (Sundaresan et al.) discloses an automatic and iterative data mining system for identifying a set of related information on the World Wide Web that define a relationship, using the duality concept. Specifically, the mining system iteratively refines pairs of terms that are related in a specific way, and the patterns of their occurrences in web pages. The automatic mining system runs in an iterative fashion for continuously and incrementally refining the relates and their corresponding patterns. In one embodiment, the automatic mining system identifies relations in terms of the patterns of their occurrences in the web pages. The automatic mining system includes a relation identifier that derives new relations, and a pattern identifier that derives new patterns. The newly derived relations and patterns are stored in a database, which begins initially with small seed sets of relations and patterns that are continuously and iteratively broadened by the automatic mining system. However, this patent suffers from several drawbacks:                It depends on human work for providing seed patterns.        Resulting patterns are similar to the original seed patterns.        For each domain or application, new seed patterns have must be provided by an expert; this is tedious and costly process.        Extracting relations and patterns depends only on lexical (words) features which is very limited        
U.S. Pat. No. 6,606,625 entitled “Wrapper induction by hierarchical data analysis” (Muslea et al.) discloses an inductive algorithm generating extraction rules based on user-labeled training examples. The problem is that the labeling of the training data represents a serious bottleneck.
All previous solutions suffer from one or more of the following drawbacks:                They need hand-crafted rules or a large number of human annotated examples for composing the patterns used to extract the relations.        They are domain-specific and designed to solve very specific problems.        They depend on seed examples. The resulting patterns are not general and are very similar to the seed examples.        They are not language independent.        They provide only simple relations and are not capable of providing highly detailed relations.        They do not extract detailed features for each entity and relation.        They do not allow a sophisticated data mining on the extracted information.        They do not provide an efficient visualization for large amount of data.        