Unstructured text contains information which may be more meaningful if it is converted to a structured representation in order to enable effective querying and analysis. For example, addresses, bibliographic information, personalized web server logs, and personal media filenames are often created as unstructured strings that could be more effectively queried and analyzed when imported into a structured relational table. Building and maintaining large data warehouses by integrating data from several independent sources, some of which may provide unformatted strings, requires conversion into structured records before loading the data into relations.
The process involves segmenting unstructured strings into a target relational schema in order to populate a relation. Given a target schema consisting of N attributes, the goal is to partition the string into N contiguous sub-strings and to assign each of the sub-strings to a unique attribute of the schema. For instance, segmenting the input string “Segmenting text into structured records V. Borkar, Deshmukh and Sarawagi SIGMOD” into a bibliographic record with schema [Authors, Title, Conference, Year] requires the assignment of the sub-string “V. Borkar, Deshmukh and Sarawagi” to the Authors attribute, the sub-string “segmenting text into structured records” to the Title attribute, “SIGMOD” to the Conference attribute, and a NULL value to the Year attribute.
Known techniques for automatically segmenting input strings into structured records can be classified into rule-based and supervised model-based approaches. Rule-based approaches, mostly adopted by commercial systems, require a domain expert to design a number of rules and deploy them. This approach does not scale as deployment for each new domain requires designing and deploying a new set of rules. Also, it is hard for a human to be comprehensive. Supervised approaches alleviate this problem by automatically learning segmentation models from training data consisting of input strings and the associated correctly segmented tuples. It is often hard to obtain training data, especially data that is comprehensive enough to illustrate all features of test data. This problem is further exacerbated when input test data is error prone since it is much harder to obtain comprehensive training data that effectively illustrates all kinds of errors. These factors limit the applicability and the accuracy of supervised approaches. Ideally, a segmentation technique should require as little “manual training” effort as possible because it is hard to collect good and comprehensive training data.
Properties of semi-structured text have been exploited in recent work on wrapper induction, allowing these systems to automatically induce wrappers for web pages. Other work seeks to extract names of entities from the natural language text (e.g., names, locations, organizations). Detecting entities in natural language text typically involves disambiguating phrases based on the actual words in the phrase, and the text context surrounding the candidate entity. Explored approaches include hand-crafted pattern matchers, and other machine learning approaches.
Information extraction and named entity recognition research focuses on natural language text. In database attributes, the input strings are short and typically not grammatical. The known techniques used in named entity tagging and wrapper induction are not useful.
Hidden Markov Models (HMMs) are popular machine learning models, and have been used extensively in information extraction and speech recognition. Since the structure of HMMs is crucial for effective learning, optimizing HMM structure has been studied in the context of information extraction and speech recognition. Specifically, the nested HMM structure chosen by Borkar et al. (“Automatic segmentation of text into structured records.” SIDMOD conference 2001) has been theoretically shown to be effective for some tasks if enough training data is available. As discussed earlier, obtaining comprehensive training data that illustrates all characteristics and variety of errors that would be observed in input strings is difficult.
The problem of robustness to input errors has long been a problem in speech recognition. Some approaches include filtering out noise during pre-processing and training the system in artificially noisy conditions (error injection). Noise filtering from speech recognition cannot be adapted to text segmentation directly, since the input errors are not separable from actual content.