This specification relates to extracting attribute-value pairs from structured documents. In general, structured documents are documents with underlying structure that defines how the data in the document is interpreted or displayed (e.g., Hypertext Markup Language (HTML) or Extensible Markup Language (XML) documents). The structure of a document is typically defined by various structural elements (e.g., headings, paragraphs, tables, etc.). For example, a structured document can simply define the layout of data. Structured documents need not have structure associated with external sources (e.g., associated with a database scheme that defines the meaning of the data in the document).
Attribute-value pairs are made up of an attribute and a value for the attribute. An attribute is a descriptor for a property of an entity, for example, the “population” of a city, the “birthday” of a person, or the “price” of a piece of chocolate cake. Each attribute-value pair associates an attribute for an entity with a value for the property described by the attribute, for example, the population of Mountain View, Calif. is 70,708, Abraham Lincoln's birthday is February 12, and the price of a piece of chocolate cake is $3.29.