There is a vast wealth of information in documents on the Internet such as telephone directories, stock tables, or product catalogs. This information is far more useful once the information is extracted and then aggregated, managed, filtered and redistributed based upon a particular search query. However, extraction of this information is difficult because the information is formatted primarily for a web user to read. In order to solve this dilemma, a procedure called a wrapper was developed. A wrapper translates the information from the format found on a specific web document to a form that may be collated, managed, and redistributed. However, wrappers are often generated manually. Generating wrappers manually is time-consuming and subject to many errors.
As a result, a process called wrapper induction was created. Wrapper induction automated the creation of wrappers. This process begins by learning the structure of a group of structurally similar documents. Then the mappings of labels to the various sections of the documents are learned though human annotation or another automated process. Once the mappings are learned, wrappers are generated. The wrappers allow the information to be extracted. Wrapper induction allows information to be extracted from structurally similar documents with high precision. However, in order for wrapper induction to perform efficiently, structurally similar documents must first be grouped together.