This specification relates to the addition of new attributes to a structured presentation by retrieving and displaying information from an unstructured electronic document collection.
An electronic document is a collection of machine-readable data. Electronic documents are generally individual files and are formatted in accordance with a defined format (e.g., PDF, TIFF, HTML, ASCII, MS Word, PCL, PostScript, or the like). Electronic documents can be electronically stored and disseminated. In some cases, electronic documents include audio content, visual content, and other information, as well as text and links to other electronic documents.
Electronic document can be collected into electronic document collections. Electronic document collections can either be unstructured or structured. The formatting of the documents in an unstructured electronic document collection is not constrained to conform with a predetermined structure and can evolve in often unforeseen ways. In other words, the formatting of individual documents in an unstructured electronic document collection is neither restrictive nor permanent across the entire document collection. Further, in an unstructured electronic document collection, there are no mechanisms for ensuring that new documents adhere to a format or that changes to a format are applied to previously existing documents. Thus, the documents in an unstructured electronic document collection cannot be expected to share a common structure that can be exploited in the extraction of information. Examples of unstructured electronic document collections include the documents available on the Internet, collections of resumes, collections of journal articles, and collections of news articles. Documents in some unstructured electronic document collections are not prohibited from including links to other documents inside and outside of the collection.
In contrast, the documents in structured electronic document collections generally conform with formats that can be both restrictive and permanent. The formats imposed on documents in structured electronic document collections can be restrictive in that common formats are applied to all of the documents in the collections, even when the applied formats are not completely appropriate. The formats can be permanent in that an upfront commitment to a particular format by the party who assembles the structured electronic document collection is generally required. Further, users of the collections—in particular, programs that use the documents in the collection—rely on the documents' having the expected format. As a result, format changes can be difficult to implement. Structured electronic document collections are best suited to applications where the information content lends itself to simple and stable categorizations. Thus, the documents in a structured electronic document collection generally share a common structure that can be exploited in the extraction of information. Examples of structured electronic document collections include databases that are organized and viewed through a database management system (DBMS) in accordance with hierarchical and relational data models, as well as a collection of electronic documents that are created by a single entity for presenting information consistently. For example, a collection of web pages that are provided by an online bookseller to present information about individual books can form a structured electronic document collection. As another example, a collection of web pages that is created by server-side scripts and viewed through an application server can form a structured electronic document collection. Thus, one or more structured electronic document collections can each be a subset of an unstructured electronic document collection.