The present invention relates to generating a parser for and to parsing a web page to generate a list of one-to-one relationships, such as parsing a publically available Twitter page to generate a list of followers and a list of photographs or a publically available list of what is “happening now at the Nation” on the music web site ReverbNation. More generally, it relates to a parser that takes parsing instructions from a declaratory template. The declaratory template used to generate lists can be as simple as specifying pattern matches for a subject, predicate and object. In alternative implementations, specification of a predicate could be omitted, if only one type of list were being generated. In other alternative implementations, a string user text can be specified. Another option is to specify annotations to relations, which can be literally annotated or extracted using a pattern match specification. Cardinality of the subject and object can be specified. Multiple declaratory templates can be used to extract multiple lists from the same web page. Query group statements can be used to set a scope in which subject and object pattern matches must occur in order to emit a relationship.
Users of web sites on the Internet and of databases containing hierarchically structured documents in a markup language frequently use scrapers to extract data from web pages and structured documents. For web pages, scraper code generators often are cumbersome and require fairly extensive knowledge of the language in which the scraper code is generated to make the scraper work. For XML and similar documents, complex navigation of DOM trees is often required to extract data from structured documents.
An opportunity arises to simplify extraction of data from hierarchically structured documents using a better, more easily configured and controlled parsing tool.