1. Technical Field
The present disclosure generally relates to computerized searching, and more particularly, to methods for multiple pass symbol and components-based visual object searching of electronic documents.
2. Related Art
The creation, distribution, and management of information are core functions of business. Information or content can be presented in a variety of different ways, including word processing documents, spreadsheets, graphics, photographs, engineering drawings, architectural plans, and so forth. In electronic form, these are generally referred to as documents, and may be generated and manipulated by computer software applications that are specific thereto. A typical workflow in the enterprise involves various personnel, oftentimes across disparate geographic locations, collaborating to create, review, and/or edit such documents.
Due to the existence of many different computing platforms having a wide variety of operating systems, application programs, and processing and graphic display capabilities, it has been recognized by those in the art that a device-independent, resolution-independent file format was necessary to facilitate such exchange. In response to this need, the Portable Document Format (PDF), amongst other competing formats, has been developed.
The PDF standard is a combination of a number of technologies, including a simplified PostScript interpreter subsystem, a font embedding subsystem, and a storage subsystem. As those in the art will recognize, PostScript is a page description language for generating the layout and the graphics of a document. Further, per the requirements of the PDF storage subsystem, all elements of the document, including text, vector graphics, and raster (bitmap) graphics, collectively referred to herein as graphic elements, are encapsulated into a single file. The graphic elements are not encoded to a specific operating system, software application, or hardware, but are designed to be rendered in the same manner regardless of the specificities relating to the system writing or reading such data. The cross-platform capability of PDF aided in its widespread adoption, and is now a de facto document exchange standard. Although originally proprietary, PDF has been released as an open standard published by the International Organization for Standardization (ISO) as ISO/IEC 3200-1:2008. Currently, PDF is utilized to encode a wide variety of document types, including those composed largely of text, and those composed largely of vector and raster graphics. Due to its versatility and universality, files in the PDF format are often preferred over more particularized file formats of specific applications. As such, documents are frequently converted to the PDF format.
One of the significant advantages of working with electronic documents such as those in the PDF format is the ability to search a large volume of information in a short period of time. With non-electronic or paper documents, searching for an item of information, even with the best of cataloging and other indexing tools, proved to be an arduous and painstaking process. In general, the searching of conventional electronic documents has been limited to text-based methods, where the user enters a simple word query and the locations where that queried word or words are found are identified. Additional search parameters such as formatting can also be specified. Boolean and natural language searching techniques are also known, though typically utilized for searching across databases of documents, web pages on the World Wide Web, and so forth. Ultimately, however, these involve text-based queries.
The information/subject matter stored in and exchanged as PDF files is becoming increasingly complex, and a wide range of documents are being digitized as part of the trend toward paperless offices. Indeed, engineering diagrams, construction plans, wiring diagrams, and so forth are oftentimes saved in, and shared via, PDF documents. With the increasing use of graphics in documents, particularly in those types listed above, querying for such elements is a desirable feature. For example, construction drawings contain various symbols that variously provide pertinent reference information to the viewer not immediately apparent from the drawings, link to other parts of the drawing or the document, and so forth. Such links associated with the symbols may be made active, or a count of a particular symbol may be necessary. Presently, this is performed manually, which is extremely time-consuming.
Rather than searching the contents of the graphics itself, another conventional technique involves associating metadata with the graphic and using a text-based search thereof. A variety of information can be specified in the metadata, such as subject matter or content keywords, category keywords, location keywords, and so forth. In a catalog of different images or graphics, such text metadata searching may be adequate. But cataloging every graphic in a large document may not be possible, particularly if the document data structure is not accommodating thereof.
Accordingly, there is a need in the art for multiple pass symbol and components-based visual object searching of electronic documents.