Organizations frequently maintain repositories for storing, accessing, and managing digital legacy documents. Each repository is typically equipped with a search engine for parsing the repository and returning a relevant set of documents based on a given query. However, it is often the case that users are not searching the repository for an entire document, but rather, for a relevant piece of content contained within a document. Furthermore, this relevant piece of content could be either textual or non-textual. Non-textual content, referred to as rich media, often includes items such as graphics (e.g, images, charts, graphs, diagrams, maps, etc.), video, audio, etc. Increasingly, users are searching document repositories in an effort to locate rich media contained within a document in order to repurpose that rich media for use in a new application.
For example, in the context of a commercial business setting, an employee charged with completing a new project will likely search their organization's document repository to locate documents that were previously created for similar projects. Oftentimes, the employee will perform the search with the goal of locating a particular type of rich media (e.g., a graphic) suitable for reuse in their current project. However, existing document management systems typically force the employee to engage in a tedious and inefficient process in order to obtain the desired rich media for reuse.
For example, using existing document management systems, an employee would first have to locate a document containing the desired rich media. This exercise is often difficult in itself given the fact that most document repository search engines operate by comparing the queried term(s) against the text of a document and/or textual metadata appended to a document. Because the sought-after rich media is often non-textual by its very nature, the rich media embedded within a document is rarely even considered during the search, leading to less relevant search results. This creates a scenario in which the employee must vet a potentially voluminous set of returned documents in order to identify the particular document(s) actually containing the desired rich media. In the event that the employee is fortunate enough to locate a relevant document, they next have to manually parse that document in order to locate the desired rich media contained therein. As documents can be quite expansive in size, this is often a time-consuming and insipid task.
One example of an existing document management system is MediaBin from Interwoven. MediaBin is a system capable of, among other things, presenting multiple Microsoft PowerPoint presentations in a window to help users assemble new presentations from existing presentation elements. However, this system suffers from the drawback that users must first identify which documents (e.g., PowerPoint presentations) will contain the desired rich media for reuse. Also, systems such as MediaBin require users to manually parse documents in order to locate the desired rich media content contained therein. Further still, systems such as MediaBin do not classify the rich media content contained within documents into semantically meaningful taxonomies, thereby forcing users to repeat the aforementioned process each time they want to locate a reusable piece of rich media.
Another existing document management system is Documill Visual Search. Documill provides a system for visualizing document content (Microsoft Office and PDF files) in document repositories. Documill operates by comparing the text entered as a search query against the textual content of documents stored within a repository. Only those pages of a document that contain text corresponding to the search query are displayed as results. Each page that is returned following the search is represented as a thumbnail (i.e., a reduced-size depiction of the actual page) on the display screen. Within each thumbnail, the text matching the search terms is highlighted, permitting a user to make a prompt visual relevancy determination.
However, Documill also suffers from a number of drawbacks. First, the determination of which pages to display on the results screen is based on keyword matching. Consequently, non-textual rich media residing within a document is not considered during the search, leading to less relevant search results. Furthermore, systems such as Documill display thumbnails of entire pages of a document, even if only a small portion of the content on a given page is actually relevant to the search. This can result in an information-overload situation in which a user is required parse through each individual page that is displayed to find the desired content. Further still, as with the MediaBin system, Documill does not classify the rich media contained within documents into semantically meaningful taxonomies, thereby forcing users to repeat the aforementioned process each time they want to locate a reusable piece of rich media.
Yet another problem facing existing document management systems is their inability to generate a reusable piece of rich media (e.g., a graphic) from a document by assembling the reusable piece of rich media from its discrete components. For example, a particular graphic in a document may consist of a combination of natural-type graphics (e.g., identified/extracted graphics, graphical construct elements, and/or candidate reusable graphic components) and/or synthetic-type graphics (e.g., identified/extracted graphics, graphical construct elements, and/or candidate reusable graphic components).
A natural-type graphic refers to a graphic that exists as a unified whole without any particular conscious assembly of individual graphic elements. That is to say, the largest sub-component of a natural-type graphic is a single pixel. For example, a natural-type graphic could be saved .bmp format, .jpg format, .tiff format, .png format, or any other suitable image format where the largest subcomponent of the image is a single pixel. A natural-type graphic might include, for example, a digital photograph of Mt. Everest, a bitmap image created using the freehand drawing tool, or a digital reproduction of a hand-drawn cartoon.
Conversely, a synthetic-type graphic refers to a graphic exhibiting order and/or symmetry that is created entirely by digital means, such as an icon, map, figure, chart, diagram, stencil-shape in Microsoft Visio, etc. That is to say, the largest sub-component of a synthetic-type graphic is more than a single pixel. For instance, a bar-graph located on a slide of a Microsoft PowerPoint presentation is exemplary of a synthetic-type graphic. Similarly, a stencil-shape located on a Microsoft Visio drawing is representative of a synthetic-type graphic. In one example, a synthetic-type graphic may be recognized, and thus extracted, through the use of, for example, Microsoft Office API (application programming interface).
Many reusable graphics are comprised of both natural and synthetic graphics. While existing systems are capable of recognizing either natural or synthetic graphics individually, they are incapable of determining that a given synthetic graphic and a given natural graphic should be combined to generate a single reusable graphic.
It is therefore desirable to provide techniques for searching, retrieving, synthesizing, storing, and classifying graphics contained within documents.