1. Field of the Invention
The present invention relates to a system, method, and computer program product for obtaining structured data, and more particularly to a system, method and computer program product for transforming unstructured or semi-structured text into structured criterion-value pairs that can be processed by a structured information system.
2. Discussion of the Background
The power of advisory and information access systems largely depends on the level of structure of the data these systems operate on. Structured data systems process records formatted as a set of criteria and normalized values whereas text-based systems generally operate on the full-text of a document. Full-text information systems are sometimes extended with natural language search capabilities but are mostly geared toward general document retrieval tasks and are limited in their abilities to adequately interpret numerical data. By contrast, fully structured information systems based on a controlled vocabulary can perform powerful data mining tasks such as integrating data from multiple sources, interacting with users when a query is incomplete or ambiguous, discussing tradeoffs and alternative solutions to complex problems, and learning how to extrapolate solutions from previous situations.
For these reasons powerful customer advisory systems are generally based on structured data and domain models created by experts. For example, Case Based Reasoning (CBR) systems provide a mechanism and methodology for building intelligent computer systems that solve new problems based on previous experiences (cases) stored in memory. In post sales diagnostic advisory applications, past experience in the form of structured troubleshooting records can be used to learn relationships between symptoms and solutions in order to extend the scope of problems that can be diagnosed. These structural advisory systems, however, have no means for exploiting textual content.
While some data in corporate databases, merchant catalogs, or troubleshooting records contains structured information, much of the data includes regions of semi-formatted or unstructured text. For instance product descriptions in online catalogs might be associated with a small number of structured fields such as price or product category as well as with more detailed free-text descriptions of the product characteristics. Likewise many sales or troubleshooting records in company databases are semi-formatted and include text that provides precious information on customer interactions (questions, problems, objections, comments, etc.). Therefore, much of the useful data stored by many organizations is text data that is unavailable for application to powerful structure based systems developed by the organization.
There are different ways of combining unstructured and structured data. One way is to combine full-text search with structured retrieval in two distinct phases. Examples of this method are described in Daniels, J. J. and Rissland, E. L (1997) What you saw is what you want: Using cases to seed information retrieval, Proceedings of the Second International Conference on Case-Based Reasoning, Case-Based Reasoning Research and Development ICCBR-97. Providence, R.I. July pp. 325–336. Lecture Notes in AI Series No. 1266. Springer: Berlin; and Lenz, M. and Burkhard, H.-D. (1997) CBR for Document Retrieval: The FAIIQ Project Proceedings of the Second International Conference on Case-Based Reasoning, Case-Based Reasoning Research and Development ICCBR-97. Providence, R.I. July. pp. 84–93. Lecture Notes in AI Series No. 1266. Springer:Berlin. The entire content of these publications is incorporated herein by reference. These approaches in these documents are limited, however, because useful information is left buried in the text and out of the reach of the structural data processing engine.
Another technique is to transform the input text into structured data before feeding these data to the structural data processing engine. Examples of text extraction are disclosed in Cowie, J. and Lehnert, W. (1996) Information Extraction, Communications of the ACM, 39 (1) pp. 80–91, the entire content of which is also incorporated herein by reference. However, text transformation methods are unique to the formatting characteristics of the text to be transformed. For example, high reliability transformation methods that rely on formatting characteristics of the text can be created to transform text having a high degree of contextual regularity. Other less reliable methods that do not rely on formatting characteristics must be created for text having less regularity.
More specifically, text extraction techniques based on Pattern-based rules and grammars can be based on various syntaxes. For instance, regular expressions syntax is powerful but rigid in the sense that all variation of a term must be explicitly mentioned in the rule and that it is not tolerant to any kind of misspelling and syntactic variations not mentioned explicitly in the pattern. Pattern-based rules have been especially popular for parsing semi-structured sources such as web pages and the like. In general however, pattern-based extractors are brittle and must be adapted when the format of the source changes. Methods based on finite-state automata and on lexical parsers are more robust and flexible than pure expressions but have problems finding correct relationships between nodes when information is spread over multiple or ill formatted sentences. In any case, the present inventors have recognized that pattern-based extraction requires the presence of regularities around the entities of interest and are well suited to extract numerical values or semi-structured data; in free-text however such regularities might not always be explicit in a text. An example of the finite automata method is disclosed in D. E. Appelt et.al. FASTUS: A Finite-state Processor for Information Extraction from Realworld Text. In Proceedings of the International Joint Conference on Artificial Intelligence, 1993, the entire content of which is incorporated herein by reference. An example of a full lexical parser can be found in: James Allan, Ron Papka, and Victor Lavrenko. On-line new event detection and tracking. In Proc. ACM SIGIR, pages 37–45, 1998, the entire content of which is also incorporated herein by reference.
In contrast to pattern-based matching methods, Dictionary/ontology-based entity extractors automatically identify and semantically tag regions of text by matching each group of words to a pre-defined vocabulary. For instance, people name extractors use a dictionary of first name and heuristics to classify an entity; company names may use a dictionary of common name parts and extensions (corp., inc. etc.), countries and region are usually directly matched to lists of country and region names. Examples of such method can be found in: Rau, Lisa 1991. Extracting company names from text in Proceedings of IEEE Conference on AI Applications, 1991, the entire content of which is incorporated herein by reference. The advantage of these identification methods is that they tend to be more resistant to syntactic variations such as misspellings and that they are less reactive to formatting changes. The present inventors have recognized however, that the disadvantage of ontology based methods is that one must first acquire a pre-defined vocabulary or face a high degree of failure (low recall). In addition when applied to the generation of structured records from text, these methods can lead to ambiguity because not enough context is provided to decide what value should be assigned to what particular attribute.