The present invention relates to data processing and in particular to the science of electronic records management and file management including the process of automatically capturing and classifying a record within a records file plan as evidence of the conduct of business processes.
To file or transform an electronic document properly into an official record has traditionally required an end user to decide that the document should become an official record. Following that decision, the user must then associate or file the official record in a particular records subject category within a records file plan or organization. This association is based on the meaning and understanding of the document content, relative to the meaning and understanding of the particular records subject category to which the document should be associated once it has been declared an official record. This association is significantly distinguishable enough from other potential records subject categories in the file plan to provide the user with only one choice.
Increasingly, documentation and written communications forming official corporate records and working documents originate in or are reduced to electronic form. For example, businesses that receive and exchange inquiries and conduct business by telephone and mail now, with increasing frequency, receive and exchange electronic communications and conduct business in the electronic forum including electronic mail or the capture of existing paper records into electronic form through imaging. Typically, these electronic communications, or captured documents, are organized into document and database filing systems for subsequent document or record retrieval to permit review and reproduction of the document when required at some later point in the future.
These computer readable forms of documents are stored in document collections on computer systems for easy access by the users of the computer system on which the document collections are stored. Such document collections, which are managed as official records, are unique in that they combine the official record electronic document with some very specific key data elements that adequately describe the record. The specific key data elements that describe an official record can be termed metadata and, typically, the metadata is stored in one or more databases. With each official record, there is an associated records subject category to specify the formal business rules relating to how the record should be maintained. Computer systems that provide access to such record collections include computer network based systems that permit authorized users to access the records collection and database over the enterprise or corporate network are typically termed records management systems. Where the records collection is available over an enterprise or corporate network, authorized users frequently also have the ability to obtain access to the records collection and database from a remote location. Remote location access is effected by establishing communications between the user desiring access to the data and the computer system which makes the stored records collection or database data available.
It is inherent in enterprise records systems, whether electronic or paper based, that a particular document may become lost or unavailable within the organization or corporate entity due to reorganizations and the ongoing reassignment of functions and responsibilities within the organization or corporation. Consequently, the need to reorganize document collections to reflect new organizational structures and functions and to ensure that documents can be made available for future retrieval has resulted in increasing reliance on automated systems which can adapt to the volume of documents or records maintained by an organization. One approach is to formulate a file plan as part of an electronic record keeping system or ERS. In ERS systems, a file plan specifies the framework for maintaining the organizational documents and electronic records and determines how long the records are maintained.
Under a file plan, organizational documents and document collections in the ERS are assigned attributes to meet organizational and legal requirements. For example, one of the attributes is a retention time specifying how long particular types of records are to be maintained. In a file plan, documents are frequently classified according to the functional unit of the organizational structure to which they relate. For example, human resources related records include such documents as those that provide employee and job applicant information. Unsolicited rxc3xa9sumxc3xa9s, job performance evaluations and the like are the types of documents that will be maintained by a human resources department. Similarly, documents, which relate to the design and production of services or goods offered by the organization, are kept by the appropriate organizational unit responsible for the specific functions of the operational unit of the organization.
Even with an ERS file plan, there is risk that important documents will be lost for reasons other than the disappearance of the document itself. A document may become misplaced in the enterprise filing system or miss-classified. Such miss-classified documents present a liability to an organization because the appropriate records management rules to meet organizational and legal requirements will not be accurately applied to the documents. Also, with increasing frequency, important documents originate in a wider variety of different forms beyond traditional sources within an enterprise. For example, paper based mail systems, facsimile correspondence, electronic mail and electronic data exchange all can form sources of important corporate or enterprise records. Naturally, the selection or mix of record sources will vary with each different organizational unit within the enterprise. Consequently, electronic forms of documents or records occur with increasing frequency within an enterprise organization. This trend, coupled with increasing diversity in the sources of records and changing systems and departmental requirements, makes maintaining a file plan or a current and reliable classification system for electronic records keeping systems increasingly vital.
In the past, automated document classification systems have been proposed but which do not provide a boundary between what can be classified reliably by a machine and what required human intervention and review. For example U.S. Pat. No. 5,463,773 to Sakakibara et al provides a document classifying system that is based on a recursive keyword selection algorithm that is used to build a document classification tree. The system of Sakakibara builds a classification tree which may or may not relate to the functional organizational units of an enterprise which has established systems and pre-existing classification categories for existing documents into which like documents created in the future are to be classified or filed. Automated classification tree structure creation and maintenance is not beneficial to an enterprise, which seeks to classify large volumes of documents, such as received e-mail, into existing enterprise classifications for record handling and storage.
Other prior art document classification systems and methods include those described in U.S. Pat. No. 5,727,199 to Chen and U.S. Pat. No. 5,251,131 to Masand, which develops a set of document classification rules based on a training set. In Masand, probability weighting is used to classify natural language. In U.S. Pat. No. 6,026,399, Kohavi teaches the production of a numeric discrimination or purity factor to discriminate between relevant and non-relevant records. In U.S. Pat. No. 6,044,375 to Shmueli, a neural network is used to extract metadata from computer readable documents.
It is an object of the present invention to provide for the automatic classification or categorization of computer readable electronic records or forms of documents. Consequently, the inventive system eliminates the need for the end user to identify data as a record and to associate the record accurately to a particular record subject category. The inventive system does this through the use of software defining a boundary between automated classification or association and when such classification or association requires the intelligence of human understanding of the meaning or context of the candidate electronic record. Preferably the process to implement the automated classification or association of a record to a particular record subject category within a file plan can itself exhibit features of the intelligence of human understanding of the meaning or context of the candidate electronic record.
The classification or record subject category assigned to a record is taken from a pre-defined or pre-existing classification assignment. The inventive system assigns a particular instance of a pre-existing classification or category to a record presented to the system for classification. In one embodiment of the invention, the computer readable records or documents to be classified are text based. The records presented to the system to be classified include text (TXT), format records or records in hypertext mark-up language (HTML) format. Other computer readable text based document formats can be used.
The inventive system operates in two basic modes, training mode and classification mode. The first mode is referred to as the training mode and entails processing a pre-defined classification list and a training set of several, at least three or five and preferably twenty to twenty-five or more, documents for each instance or entry in a classification list. The training mode processing involves a classifier or classification agent that processes the records already stored or organized within the classification list and training set to establish an association or correlation between the content of the training documents with each pre-specified associated classification. Once the training mode processing is complete, the second mode of operation is available. The second mode is termed the automatic classification mode. In the automatic classification mode, further documents are provided to the classification agent for classification. For each document presented after training, the classification agent will produce or output a corresponding classification instance or group of classification instances and a confidence factor for each instance. In the classification mode, the processing of a document will result in the classification agent producing a classification instance, or several classification instances, each with an associated confidence factor. In the preferred embodiment, the confidence factor ranges between 0 and 100% and represents the level of confidence that the category agent has found exact match (in the case of 100%) or closely matches (in the case of a value less that 100%) to a predefined category.
The classification instance and confidence factor output of the classification agent for the document to be classified is provided to the decision control table and compared with an action to be taken for the given confidence factor within the decision control table. The confidence factor decision control table has a plurality of actions or cases for classification of the document. The action or case to be taken in relation to the document to be classified will commence based on the classification instance and confidence factor returned by the classification agent. The action or processing of the document is controlled or decided by user provided settings contained in the confidence factor table based on the classification instance and confidence factor returned by the classification agent. The action or processing of the document includes either further processing by computer or requesting input from an operator or user of the system to classify the document. The confidence factor output from the classification agent is compared to a user configurable list of ranges provided in the confidence factor decision control table. The processing options or actions to be selected or taken in respect of the document processed are selected or determined by the entries in the confidence factor decision control table. Preferably, the ranges specified in the confidence factor decision control table are discrete contiguous segments. That is, the ranges are non-overlapping and without gaps.
The invention also provides a mode of operation to retrain the classification agent by using the classification agent to process a xe2x80x9cretraining setxe2x80x9d of records in conjunction with a classification group containing all instances of all possible classifications. The retraining set preferably provides more document instances per classification instance than the minimal document instance of count of three to five documents per classification instance required for initial classification agent training and can include the entire document collection and associated classifications. Retraining mode is beneficial for adapting the classification agent to current document collections periodically to improve classification agent performance provide a basis for the user to set confidence factor table ranges.
In one of its aspects, the invention provides a computer based system for automated classification of electronic document records comprising a source of electronic records and an electronic document server operably connected to at least one electronic document database and including means to communicate an electronic document and means to receive user control input. The system further includes a classification agent in communication with the electronic document server, and the classification agent is operable in a training mode and a classification mode and includes: means to receive an electronic document; means to receive a classification instance; and output means to provide a result. The system has decision control means accommodating at least two processing actions each processing action having a user configurable activation criteria responsive to said classification agent result.
In another of its aspects, the invention provides a computer based system for automated classification of electronic document records comprising an electronic document server operably connected to at least one electronic document database and including means to store an electronic document and means to receive user control input and a source of electronic records operably connected to the electronic document server. The system further includes a classification agent in communication with said electronic document server, the classification agent is operable in a training mode and a classification mode and includes: means to receive an electronic document; means to receive a classification instance; and output means to provide a result. A decision control means is included to accommodate at least two processing handlers selected from the group comprising: means to assign a classification instance to an electronic document; means to produce a list of at least two classification instances for an electronic document; means to assign a review classification instance to an electronic document; and means to assign a null classification instance to an electronic document.
The invention will now be described with reference to the drawings in which like referenced numerals have been used to depict like features of the invention throughout.