The present invention relates to data processing by digital computer, and more particularly to information extraction.
The field of information extraction relates to processes that extract information of interest from data stores that typically include information that is not of interest. Information extraction technology can be implemented to facilitate various applications of computing, for example, applications relating to Web pages. The term “Web” refers to the World-Wide Web, the collection of Internet sites that offer text, graphics, animation, and sound resources through the HyperText Transfer Protocol (“HTTP”). The term “Web page” refers to a block of data identified by a URL that is available on the Web. In the stereotypical case, a Web page is a HyperText Markup Language (“HTML”) file stored on a server; however, the file may refer to, rather than contain, content that appears as part of the page when it is displayed by a Web browser, and it may be generated dynamically in response to a request.
Some Web pages include one or more lists. A list usually includes multiple listings, each of which is a meaningful grouping of information. Examples of Web page listings include, by way of example, information about an apartment available for rent, information describing a product for sale, a headline or summary of a news article and information describing an event.
Web pages are generally defined by source code written in a markup language, for example, HTML or Extensible Markup Language (“XML”). The source code defining a Web page is usually stored as one or more documents, which are commonly referred to as HTML documents.