1. Technical Field
The present disclosure generally relates to computerized searching, and more particularly, to methods for color and size based pre-filtering for visual object searching of documents.
2. Related Art
The creation, distribution, and management of information axe core functions of business. Information, or content can be presented in a variety of different ways, including word processing documents, spreadsheets, graphics, photographs, engineering drawings, architectural plans, and so forth, in electronic form, these are generally referred to as documents, and may be generated and manipulated by computer software applications that are specific thereto. A typical workflow in the enterprise involves various personnel oftentimes across disparate geographic locations, collaborating to create, review, and/or edit such documents.
Due to the existence of many different computing platforms having a wide variety of operating systems, application programs, and processing and graphic display capabilities, it has been recognized by those in the art that a device-independent, resolution-independent file format was necessary to facilitate such exchange. In response to this need, the Portable Document Format (PDF), amongst other competing formats, has been developed.
The PDF standard is a combination of a number of technologies, including a simplified PostScript interpreter subsystem, a font embedding subsystem, and a storage subsystem. As those in the art will recognize, PostScript is a pap description language for generating the layout and the graphics of a document. Further, per the requirements of the PDF storage subsystem, all elements of the document, including text, vector graphics, and raster (bitmap) graphics, collectively referred to herein as graphic elements, are encapsulated into a single file. The graphic elements are not encoded to a specific operating system, software application, or hardware, but are designed to be rendered in the same manner regardless of the specificities relating to the system writing or reading such data. The cross-platform capability of PDF aided in its widespread adoption, and is now a de facto document exchange standard. Although originally proprietary, PDF has been released as an open standard published by the international Organization for Standardization (ISO) as ISO/SEC 3200-1:2008. Currently, PDF is utilized to encode a wide variety of document types, including those composed largely of text, and those composed largely of vector and raster graphics. Due to its versatility and universality, files in the PDF format are often preferred over more particularized file formats of specific applications. As such, documents are frequently converted to the PDF format.
One of the significant advantages of working with electronic documents such as those in the PDF format is the ability to search a large volume of information in a short period of time. With non-electronic or paper documents, searching for an item of information, even with the best of cataloging and other indexing tools, proved to be an arduous and painstaking process. In general, the searching of conventional electronic documents has been limited to text-based methods, where the user enters a simple word query and the locations where that queried word or words are found are identified. Additional search parameters such as formatting can also be specified. Boolean and natural language searching techniques are also known, though typically utilized for searching across databases of documents, web pages on the World Wide Web, and so forth. Ultimately, however, these involve text-based queries.
The information/subject matter stored in and exchanged as PDF files is becoming more complex, and a wide range of documents are being digitized as part of the trend toward paperless offices. Indeed, engineering diagrams, construction plans, wiring diagrams, and so forth are oftentimes saved in, and shared via, PDF documents. With the increasing use of graphics in documents, particularly in those types listed above, querying for such elements is a desirable feature. For example, construction drawings contain various symbols that variously provide pertinent reference information to the viewer not immediately apparent from the drawings, link to other parts of the drawing or the document, and so forth. Such links associated with the symbols may be made active, or a count of a particular symbol may be necessary. Presently, this is performed manually, which is extremely time-consuming.
Rather than searching the contents of the graphics itself another conventional technique involves associating metadata with the graphic and using a text-based, search thereof. A variety of information can be specified hi the metadata, such as subject matter or content keywords, category keywords, location keywords, and so forth. In a catalog of different images or graphics, such text metadata searching may be adequate. But cataloging every graphic in a large document may not be possible, particularly if the document data structure is not accommodating thereof.
When human beings search for occurrences of specific graphical information on a document, a description based on the set of features of that graphic is intuitively formulated. These features are typically the size, shape, and color of the object, as well as the relationship between such object and the other graphics contained within a document. That description of features, which is generally referred to as a template, is compared against different segments of the document to identify match candidates. There are significant challenges associated with implementing such seemingly intuitive but complex mental processes as discrete steps that can be executed by a data processor. Various techniques and algorithms have been developed, but they tend to involve mathematically intensive operations on a large amount of data. A significant factor in improved accuracy and speed is therefore attributable to improvements in raw data processing capabilities.
One technique for visual searching is contemplated in co-pending U.S. patent application Ser. No. 13/018,299 entitled “A method for multiple pass symbol and components-based visual object searching for documents,” also assigned to the present assignee and the entirety of the disclosure of which is hereby wholly incorporated by reference herein. This involves the selection and definition of a raster template for which the document is searched. Raster image representations of the document are generated, and match candidates are generated and narrowed at successively detailed levels.
The human mind can fill in certain omitted or obstructed details, so it is possible to identify graphic elements even when partially hidden. However, in some use cases of the aforementioned raster image based searching, these partially bidden graphic elements may not be identified. In the architecture, engineering, and construction industries, the typical PDF document generated may contain several overlapping layers of information. Furthermore, these industries tend to involve highly collaborative workflow processes where multiple users comment and place various annotations on the document. A search of a rasterized image of the document may not successfully identify obstructed such content. Additionally, these complex documents tend to yield data-intensive raster images that tend to slow down the aforementioned visual search modality.
Accordingly, there is a need in the art for methods of color and size based pre-filtering for visual object searching of documents with improved speed and accuracy.