1. Field of the Invention
The inventive concept relates generally to the transformation of content stored in a repository. More particularly, it relates to searching repositories for specific content, performing an analysis to determine what content should be transformed, and managing this process.
2. Background
Organizations have invested heavily in File Systems, Document Management Systems (DMS), Content Management Systems (CMS) or Enterprise Content Management (ECM). These may be understood, in a non-limiting manner, to be examples of what is meant by the term, “repository.” The rationale behind that investment includes but is not limited to storing content, allowing controlled access to that content, and allowing for quick and easy retrieval of that content.
Content stored in these repositories can be any datatype. Examples of documents with different datatypes include Word format documents, PDF documents and image documents. This short list of datatypes is not meant to be exhaustive, however, and repositories can store files of any datatype. For files that are of datatypes in which the content includes data in human understandable form (e.g., words, numbers, images etc.), text based search is commonly used to find particular content for subsequent retrieval. For this to be successful, the text in the content has to be discoverable by some means. That is to say, the content has to be text searchable.
FIG. 7 illustrates this situation. Repository content of a wide variety of types enters a repository, but many of these types are not text searchable.
Many business processes result in non-text searchable documents being stored in a repository. This can occur with scanned images saved as TIFF or image-based PDF, emails having TIFF or image-based PDF attachments, electronic faxes saved as TIFF or PDF, legacy documents retained over many years and documents from business acquisition or other file ingestion.
One problem facing organizations is the risks associated with storing these non text searchable documents. Such risks include the possible failure to find a critical document required to comply with e-discovery orders/litigation, time and effort wasted recreating content due to failure to locate a document, misfiling a document and never finding it again, and also repository users losing confidence in the ability of their systems to find and retrieve content. These risks, if not mitigated, can include, but are not limited to, monetary, productivity and/or reputational impact.
Ad hoc approaches to avoiding the risks of non text searchable content being stored in a repository focus on performing Optical Character Recognition (OCR) on these documents during the creation workflow. A creation workflow may be understood as a process of getting documents into the repository. Examples include carrying out an OCR process as documents are scanned, or carrying out an OCR process when receiving documents that are previously created elsewhere. These ad hoc approaches attempt to ensure all documents are text searchable at the time they enter the repository.
These ad hoc approaches have a number of undesirable qualities.
One such disadvantage is that some documents will inevitably make it to the repository without having been first made text searchable. The focus in an ad hoc approach is on processing the documents before they enter the repository, but human and machine factors often intervene to frustrate this goal. OCR is often not performed on documents when it should have been. Examples of such factors include users being able to avoid OCR at the scan workflow step, and being motivated to avoid the step because they feel that it takes too long or is too complicated to effect the OCR process.
Another example involves emails with attachments. These are increasingly stored directly in repositories, with attachments escaping assessment for text searchability. Mobile devices such as iPads, iPhones and Blackberries collect, create and then store content into repositories, and such mobile devices are often outside the normal workflow processes that are capable of screening for text searchability.
Yet another example involves bulk import of data from third parties where text searchability is not guaranteed. Such scenarios can result in the importation of substantial numbers of files that lack text searchability.
Another disadvantage of ad hoc approaches arises when OCR processes are unnecessarily run on predominately text searchable documents, for example when there is a bulk import of existing PDF's into a repository. Ad hoc approaches sometimes make no assessment of the text searchable state of a PDF. The result is that documents are subjected to OCR processing even though they are already text searchable, thus leading to waste. This will result in excessive time lost to OCR processing and file management and could also affect the quality of the text searchability of the document. That is to say, it is not unusual that a subsequent OCR process on an already text-searchable document could result in the deterioration of the quality of the text searchability of a document.
Yet another disadvantage of the ad hoc approaches is the impact OCR processes can have when applied to image based documents, including changing the content in such a way as to reduce its value in intended future use. OCR processes frequently include operations meant to improve the quality of an image with respect to text searchability. Such operations may include de-skewing, de-speckling, image enhancement, and resolution adjustment on the source document. These manipulations often result in a new document being created and that document is a close facsimile of the original but is not the same in every way. Differences may include some substantial, visually perceptible changes to the image. Moreover, there is also the potential loss of annotations such as comments on PDF, attachments to PDF, form fields and values. In addition, metadata stored in the original file, as custom properties, can be lost. All of this may affect the future interpretation of a document or reduce its value as a valid record of the original.
One more disadvantage of ad hoc approaches is the impact of the additional processing time required for OCR processing when done at creation time or at document reception time. An OCR process is CPU and RAM intensive, and, depending on the length and nature of the input document, may take many minutes or even hours to complete. To OCR at the time of creation will be visibly slow to the person involved in the workflow, especially if the OCR process has to complete before the workflow can progress (i.e., if OCR must complete before the document can be saved into a document repository).