It is sometimes desirable to process and analyze large volumes of documents. As an illustrative example, construction projects are typically described by plans and specifications (herein, “spec documents”). While the plans give a visual representation of the project, the spec documents give all of the details in textual form. A typical spec document is approximately 500 pages in length and covers everything from the bidding procedures that contractors or subcontractors are to follow before being selected, through the types of products, materials, and methods used during construction, to how the site will be cleaned up when completed. Such comprehensive information about active and planned projects makes these spec documents a valuable source of marketing intelligence and sales leads for businesses serving the construction industry.
As a result, various publication services exist that collect plans and spec documents from various sources. To the extent necessary, the publishers may also digitize hard copies and process them with optical character recognition (OCR) software. Some publishers also annotate the spec documents at a project level with metadata (such as the estimated size and cost of the project, key contacts, the type of construction, and so on). Finally, the publishers aggregate the spec documents in a database and disseminate subsets of the spec documents to subscribers. The subscribers to such services may be, for example, building products manufacturers that use the spec documents for marketing intelligence and sales leads.
Because a national feed from one of the larger publishers is approximately fifty million pages per year, this is too much information for a single person (or even a reasonably sized team) to analyze to find actionable information or to synthesize new information. The problem is further compounded for manufacturers that subscribe to feeds from more than one publisher.
Various attempts have been made to process spec documents in a computer-assisted fashion. One technique that has been employed is to use text search with the documents and provide a user with a list of documents that match. For example, a user may be interested in searching for a cleaning product named “409”. In basic searching systems, documents containing any copy of those 3 numbers will be returned to the user as matches, although many of those matches will not be for the cleaning product. In places it may be a page number, a section number, an area code in southeast Texas, or other unrelated reference. In an attempt to alleviate this problem, some systems have been built that use a hand labeled table of contents to allow for searches to be limited to specific sections of documents.
While existing systems for processing and analyzing large volumes of documents have proved useful, further enhancements are needed.