In computerized search systems, when objects are given to a search engine for indexing, they often need to be converted, analyzed and transformed as part of the indexing process. There may be many different tasks that need to be performed in order to prepare the objects to be indexed. For instance: files within an archive (such as a “ZIP” file) may need to be extracted and decompressed; document files may need to be filtered to extract their full text content; encrypted information may require decryption; encoded data may need to be converted to another encoding format; image files may need to have OCR (Optical Character Recognition) applied to convert the image into text; the text strings need to be broken into words (or tokens); audio files may need to have voice identification tags added; and so forth. There may be a large number of these steps, depending on the type of object that is being presented for indexing.
It is possible for one or more of these steps to fail. For example, the tasks may be only partially completed, or they may fail entirely. If one of the preparation tasks is only partially completed, it may nevertheless be sufficiently complete to allow the indexing of the object to proceed. In this case, the object may be searchable, but perhaps not perfectly so. For instance, some portion of an object's text, but not all of the text, may have been derived for the object, so that a search engine may be able to find some, but not all of the terms contained in the object.
While it may, in some cases, be desirable to be able to perform searches based on this incomplete information, this may not always be true. For example, in a litigation discovery process, a user may employ a search system to find all objects which are related to a specific patent application. If the objects were not perfectly indexed, the user may not know that the search results contain errors or are incomplete. As a result, the user may miss important documents.
Conventionally, errors and other problems in the process of preparing objects to be indexed are addressed a couple of ways. One approach is to simply stop the process when an error condition arises. Because this all-or-nothing approach produces the desired output (e.g., text conversions of object files) without user intervention only under perfect conditions, it is obviously inefficient. A second conventional approach is to provide dedicated error fields in which indications of error conditions can be stored. For instance, a truncated data field may be provided to indicate whether the text for an object had to be truncated during the process, a junk data field may be provided to indicate whether the data in the object appears to be random (as opposed to meaningful text), and so on. The use of dedicated fields to store indications of error conditions is problematic, however, for a number of reasons. For example, because there may be many different fields in which the error conditions are stored, a user may have to look in many different places to find the indicators of these conditions. Further, because each of the dedicated error condition fields has associated overhead (e.g., storage space), there may be a high resource cost associated with these fields.
In addition to the problem of text conversion errors, search systems may encounter errors in handling metadata associated with the objects. For instance, a search system may be configured to receive a numerical value representing a date, and to store this value in a numerical field. If, however, the system receives the date in a different format (e.g., with the month spelled out), it may not be able to process this information. Conventionally, this causes the system to simply stop, or at least stop processing the corresponding object.
It would therefore be desirable to provide means to process objects for implementation of a search engine that overcome these problems.