The present invention deals with processing of unstructured data. More specifically, the present invention deals with processing unstructured data into structured data, such as by populating a predefined schema.
Most data that people work with today is authored, in the first instance, digitally. For example, rather than beginning to write an article on a piece of paper, an author today typically begins by writing it using a laptop computer, a desktop computer, or another type of digital text processing system. Similarly, instead of writing letters on paper, communications are often authored, in the first instance, on a computer, as electronic mail transmissions, as electronic telefacsimiles, or as instant messaging texts. In addition, rather than marking appointments on a paper calendar, many people now enter appointments electronically onto a personal information manager that contains a calendar. In fact, even voicemail messages and multimedia presentations are often created and stored electronically.
According to one source, in the year 2002, over 5 million terabytes or 5 exabytes of new information was created. Approximately 92 percent of that information was stored on magnetic media, mostly hard disc drives. Also, over 400,000 terabytes of electronic mail were sent and stored electronically.
In addition to the creation of electronic data, much text is gathered electronically. For instance, present day hardware and software components provide the ability for computers to connect, download, process and store much more electronic information than has ever been possible before. While this can greatly enhance productivity, it can also create problems.
Much of the information that is authored, accessed, downloaded, or stored in electronic form is in unstructured form. For instance, one domain of information deals with the storage of personal contact information, such as a contact name, address and telephone number. This information is generally created as unstructured data, meaning that it is generated in the form of pure, unannotated text.
This information is then imported into a usable form, such as into a contact list in a personal information manager, or into a contact list in an electronic mail system. In the past, in order to import the information, a relatively naive form of automatic mapping between the unstructured data and structured data has been used. For instance, in the past, in order to map the portions of a telephone number that is entered as unstructured text into the structured fields of “area code”, “access code”, and “number”, handwritten rules have been used. Such handwritten rules can be thought of as a grammar that maps from input data to an output form that has more structure than the input data. However, such handwritten rules have many disadvantages.
The handwritten rules are very expensive to produce and maintain. For instance, to produce the rules, an author must generally take the time to attempt to think of every possible way that a user may enter a phone number, and write a rule to handle the mapping of that way of entering a phone number into a structured format. Of course, in order to maintain these rules, the author may be required to subsequently write additional rules that handle extensions, country codes, or various telephone system complexities that are added later in time.
Another disadvantage associated with handwritten rules is that they often do not cover the full range of possible inputs produced by real users. In other words, the author of the rules can almost never think of every possible way that a user may enter the unstructured data. If the author has not thought of a way that is used by a real user, then when such an input is encountered, the system breaks down because there is no rule to handle that specific form of input.
Yet another disadvantage involves localization. For instance, each time the handwritten rules are to be applied in a new geographic or cultural location, they must be localized. Many foreign countries, for instance, represent addresses or postal codes in drastically diverse ways. A set of rules written to handle addresses and postal codes in one country may very well not adequately handle addresses and postal codes written in a different country. Therefore, each time the system is expanded to a different cultural or geographic location, a new set of rules, or at least additional rules, must be written to handle that particular location's diverse representations of data.