The present invention relates to processing data. In particular, the present invention is related to converting text into data records.
Data extraction is the process of converting digital text to digital data records. For example, the text of a web page found on a web site that sells cars may be converted into a set of records, one record for each car that is offered for sale. Each car may be associated with xe2x80x9cvaluesxe2x80x9d for its attributes of make, model, year, color and price. The set of attributes for a particular car make up the record associated with that car. Some of these attributes may have no value; that is, no value will be assigned to that attribute to indicate merely that no value was extracted from the web page text for that car.
A record for a particular car may include the value xe2x80x9cAlpha Romeoxe2x80x9d for the attribute of xe2x80x9cmakexe2x80x9d; xe2x80x9credxe2x80x9d for the value of the attribute xe2x80x9ccolorxe2x80x9d; and xe2x80x9c$1900xe2x80x9d for the value of the attribute of price. The other attributes, xe2x80x9cmodelxe2x80x9d and xe2x80x9cyearxe2x80x9d are left blank.
Converting text information to records is useful because it allows searching, sorting and presenting of the data based on the values of the different attributes. However, not all records come from the same text or are presented in the same text format. Therefore, it is desirable to extract records from a variety of different texts in different formats. Usually, data extraction can be done by using software called a xe2x80x9cdata extractorxe2x80x9d that is tailored to the text format of interest, one extractor for each type of text format. Alternatively, it is possible to develop a data extractor that deduces the format of the source text and then uses that format to guide it in extracting records. These data extractors are referred to as xe2x80x9cautomatic data extractors.xe2x80x9d Automatic data extractors can be used on texts from different sources or on texts from the same source but where that text format may change from time to time.
In order to deduce the format of a newly encountered text, the automatic data extractor may use its knowledge of the attribute values and various other formats. Knowledge of attribute values, referred to as xe2x80x9cdomain knowledgexe2x80x9d may be contained in a vocabulary list stored in the memory of a computer or digital storage device. To use the example given above, domain knowledge of color values of cars may include not only colors such as xe2x80x9credxe2x80x9d, xe2x80x9cbluexe2x80x9d and xe2x80x9cgreenxe2x80x9d but a list of color labels such as xe2x80x9ccolorxe2x80x9d and xe2x80x9ccoloring.xe2x80x9d
Referring now to FIG. 1, which depicts a schematic view of a prior art automatic data extractor, automatic data extractors may use focuser procedures to identify regions of interest in text that is read in. These procedures include vocabularies of xe2x80x9crecognizersxe2x80x9d to identify attribute values and labels in texts. To identify various formats, automatic data extractors use xe2x80x9cfocus parsersxe2x80x9d or xe2x80x9cparsersxe2x80x9d. Parsers identify regions in a source text where the text may contain attribute values. Recognizers can be use to evaluate regions of text identified by parsers. Because different parsers may be more appropriate for a given region of text known to have attribute values, the results provided by different parsers must be evaluated or xe2x80x9cgradedxe2x80x9d for their appropriateness. The focus procedures that grade parsers in this way are called xe2x80x9cfocus gradersxe2x80x9d or xe2x80x9cgraders.xe2x80x9d Thus, focuser procedures include three components: recognizers, parsers and graders. Recognizers are used by parsers to identify attribute values in a source text. Then, graders are used to determine which parser produced results that best fit the text.
For example, suppose the source text is a web page that contains a banner advertisement, several pages of free text, and a table that contains record data. The goal of the focuser procedure is to identify the region of interest; here, the table. First, each focus parser is applied to the text. One focus parser may identify the free text and another may identify the table. The first parser returns the free text region; and the second one, the table region. The focus grader is applied to both regions returned. The graders determine which region contains the most attribute values, or the most attribute values and labels, or the most attribute values and labels per number of words in the region, depending on the grading algorithm. The region that achieves the highest grade becomes the xe2x80x9cregion of interest.xe2x80x9d
Automatic data extractors may also contain segmenter procedures that include segment parser and segment grader components. Segmenter procedures are designed to identify a series of xe2x80x9crecord regionsxe2x80x9d in the text that each contain data for a single record. If the region of interest is a table, each row of the table may include the attribute values of a record and thus be a record region.
After a region of interest has been identified, segmenter parsers are applied to it. The first parser may return each cell as a record region; the second, each row; and the third, each column. Then the segmenter grader is applied. Recognizers are again used by the graders to evaluate the different record regions returned. The graders apply the programmed algorithm which may penalize record regions returned that have fewer than one or more than one value per attribute. As before, the series of record regions that returns the best grade becomes the series of interest.
Once the record regions are identified, automatic data extractors produce records as follows. For each record region, a record is formed with all the attributes initially having no values. Then for each attribute, if there is at least one recognized value in the record region, that value becomes the record value. That first value is xe2x80x9cextractedxe2x80x9d from the text and entered into the data record.
This kind of automatic data extractor relies heavily on the domain knowledge in its vocabulary lists. The more comprehensive the list of recognizers, the better will be the deduction of source text information and the more complete the data records. Therefore, having a larger vocabulary list is better. However, building a large vocabulary list is labor intensive. Furthermore, vocabulary changes in time. New values come into existence and old values become out of date. Thus not only is building a large list labor intensive, so also is maintaining an up-to-date list. Thus there remains a need for a better way to develop and maintain vocabulary lists in automatic data extractors.
The present invention is a method for increasing the vocabulary of an automatic data extractor and it is also an automatic data extractor that automatically learns new vocabulary. The present automatic data extractor increases its vocabulary by learning as it is applied to extract data records from text. An automatic data extractor that learns new vocabulary can extract more data records from text. The present automatic data extractor uses domain knowledge to deduce data structure, then uses both the new structure and domain knowledge to extract new values not previously in its vocabulary and adds them to the records and to its vocabulary.
The method includes procedural components in addition to those in prior art data extractors, namely, field parsers and field graders. Each field parser is applied to a series of record regions to create a candidate series of field lists. Then the field grader uses recognizers to choose a single best series of field lists from the various candidates created by the field parsers. Next, an attribute mapper is applied to the selected series of field lists to determine the positions of the attributes in the list. Once it is known that a particular attribute corresponds to a particular position in the field lists, the fields in that position of the field lists are written as the attribute values to the corresponding record whether they are in the vocabulary list or not. In this way, new values of attributes are deduced or xe2x80x9cgleanedxe2x80x9d from a text source. If a field is not in the vocabulary list, it is added to the vocabulary list. Thus, the data extractor learns new vocabulary values through use.
An important advantage of the present invention is that it is able to produce a more complete record than prior art automatic data extractors because it deduces new attribute values not previously in its vocabulary. Just as importantly, it is able to add these new attribute values to that vocabulary to increase the size of its vocabulary automatically.
These and other features and advantages of the present invention will become apparent to those skilled in the art of data extraction from a careful reading of the Detailed Description of Preferred Embodiments accompanied by the following drawings.