This description relates to methods and apparatus for extracting information contained in a list of items in a document into a relational database table.
Much information exists on the world-wide-web, and much of that information exists in the form of structured data. Structured data is data that is presented in such a way that the presentation itself provides information about the elements of the data and how those elements relate to one another. One common example of structured data is a list. A list is a data structure that contains items of inter-related data elements. Items of a list are often organized on separate rows or lines of the list. For example, a shopping list can contain rows of data elements (items) that are currently needed from a shopping center. Each item of a list can have multiple data elements that are segregated into distinct fields, where each field contains information that is related to the information provided in the other data element fields. For example, each row of a shopping list can have data in two data element fields, one containing the items that are needed from the shopping center as explained above, and the other containing the quantity of items that are needed. Other, more elaborate lists are of course possible.
Another common example of structured data is a relational database table. A relational database table is a data structure that contains rows of data arranged in one or more columns. Each column of the database table defines an attribute of the data that is contained in the rows. Given the structural similarity between lists and relational database tables, structured data in the form of lists in a document can be converted into relational database tables. Once created, the tables can be used to easily extract the information content of the lists using conventional database manipulation techniques. This information content can then be used, for example, as a source of information for synonym discovery or to perform sophisticated web page searching, or to supply missing information in auto-complete schema.
Converting lists into relational database tables is not always a straightforward task, however. First, lists generally are not clearly delineated into columns or cells or fields. Rather, each item or line in a list can consist of largely unstructured text. Moreover, even when delimiters are used to separate the items of a list into fields, the delimiters can be missing in some lines of the list or inconsistently applied in others. Furthermore, information can be missing from an item, and the item can lack any indication that the information is missing or where it should have been provided. Consider, for example, “The 50 Greatest Cartoons” list 700 shown in FIG. 7A. Visual inspection of the list indicates that it contains the following fields: a ranking or identifier (e.g., 1, 2, 3 . . . ), the name of the cartoon, the production company, and the production year. However, some information fields are missing from some of the items of the list. For example, the “Gertie the Dinosaur” line (item 6) is missing the production year. And while many of the lines of the list appear to be well delineated into fields using delimiters such as the period (“.”) and backslash (“/”), some of the delimiters are missing in some of the lines or are used for other purposes. For example, while a period is generally used to delineate the ranking from the name of the cartoon, it is also used to abbreviate the name of the “Warner Bros.” production company in some of the lines. Similarly, while a back slash (“/”) is generally used to delineate the name of the production company from the production year, it is part of the cartoon name in the line for the “Duck Dodgers in the 24½ Century” cartoon.