The present invention relates to computer-implemented document management systems and more particularly to a method for analyzing externally generated documents for use in such a system.
1. Field of the Invention
Business and government processes are based on agreements—between buyer and seller; producer, middleman, and user; prime contractors and subcontractors; employer and employee, etc. Those agreements are usually memorialized as written contracts. Such contracts tend to be written in a specialized language of legal terms developed over many decades of legal practice and terms of art that have become specially defined over time during the development of particular business environments. In recent years, businesses have begun to move from paper-based contracts to agreements that are negotiated, created, e-signed, administered and archived using computers (“digitized contracts”).
Unlike some other types of documents, where a text search using standard text searching algorithms provides sufficiently accurate access to elements of the document, the information required of a contract is usually a finely granular data point required for a specific use. A single contract data element is narrowly and uniquely determined by physical location and descriptive text modifiers within the contract.
For example, searching for “royalty rate” in a sound recording agreement is not sufficient—the correct royalty rate to be found might be for a specific song recorded in a specific option term, sold in a specific territory, incorporating a specific sales volume escalation, for one specific payee among many, depending on the date of the use, etc. Accessing such information requires each significant piece of data, and often information about its links to other data, to be identified, which can be done using “metadata tags”, which specifically define each data element.
2. Description of Prior Art Including Information Disclosed Under 37 CFR 1.97 and 1.98
Although contracts can be one-of-a-kind documents created for a special situation (e.g., building a piece of a space shuttle), most contracts address recurring problems, for example real estate leases or non-disclosure agreements. Because all instances of contracts within a contract “type” are addressing the same problem, they tend to be made up of similar data elements, i.e., the metadata tags are the same. The variability between any two instances of contract types is found in the order in which the elements are presented, the words used to name, describe or modify the various data elements, and in the values assigned to any data element.
The collection of “metadata tags” for a certain “type” of agreement forms a “template” for that type of agreement. That is, the template will contain the data elements usually found in the particular type of agreement, although perhaps in a different order from the original document. Many computer-based contract creation processes rely on a template-based system, from which actual agreements can be generated and/or interpreted from a pre-existing template mapping to the particular type of agreement.
In those systems, a practical difficulty arises in analyzing and/or administering contracts when a user encounters a contract that was generated without reference to his system's own template (“native template”) for that type of agreement. Commonly this situation occurs in the form of a contract that another party to an agreement has created using his own template (“foreign template”).
We are aware of several patents and patent publications that relate to the same general subject matter as the present invention. However, none of those references teach a system with the capabilities of the present invention. In particular, none teach a method for analyzing documents, such as contracts, generated using a template foreign to the system that is managing the documents.
U.S. Pat. No. 5,909,570 entitled “Template Mapping System for Data Translation,” issued to Webber on Jun. 1, 1999, describes a method for translating (“mapping”) from an incoming dataset into a second format. It is demonstrated in terms of an EDI message format (a well defined data protocol, which uses segment markers to define the meaning and boundary of each data element), although it claims that it would work for non-EDI formats as well. In either case, however, the incoming data is highly structured. The incoming format structure, in terms of fixed data location, is generally pre-known, although possibly incompletely known.
However, and key to the system disclosed in the Webber patent, the incoming format structure (logical layout) must be completely known via accompanying components, which take the form of 1) mapping template rules, and 2) table field layouts (i.e., field names and sizes in bytes, etc.). The system builds and uses SQL queries to process the records of the incoming data and extract the data based on the given table field layouts. Then, using a user-defined mapping, a second file is built in a second format, and the data is placed into the second format.
A second aspect of the system disclosed in the Webber patent is a method for allowing a non-technical user to develop a mapping from the known format of the incoming data to an arbitrary second format using (so-called) non-programming, “macro” style commands.
By way of contrast, the present invention addresses specifically incoming data files whose structure is previously unknown to the system, either by physical layout or by logical table layout (i.e., database field names). Further, to the extent the incoming data set of the present invention has any structure, it is not a table structure specifically developed with data processing in mind, as in the Webber patent, but instead is a text document intended for human processing.
In addition, the Webber patent does not have the “learning” capabilities of the present invention which enable the system to become increasingly more accurate as more and more documents of a particular type are processed. Further, the system of the present invention is not limited to SQL or other database-style lookup commands in working with the second document. It is based on a full-text search lookup.
U.S. Pat. No. 5,557,780 entitled “Electronic Data Interchange System for Managing Non-Standard Data,” issued to Edwards et al. on Sep. 17, 1996, relates to EDI data exchange. That patent describes a system that attempts to handle an incoming dataset that is in an unknown format. It does so by trying various other predefined formats, which is has stored in a database, which might apply to that type of EDI document to see if any fit. Such formats are defined by reading EDI-protocol defined data segment identifiers, unique alphanumeric code starting characters, and a segment terminator.
This Edwards et al. process is not really comparable at all to the present invention, except that, at a very basic level, both try to parse a new document of unknown format and both employ a pattern matching of characters (i.e., a most basic computer operation).
Edwards et al. assumes that the incoming dataset does have a template/pattern, and that template/pattern is in accordance with pre-established EDI-protocols, and the pattern has been previously mapped out and entered into the pattern database. The process simply needs to find the correct pattern by trying each in sequence and evaluating which one works.
An important feature of the present invention is that it assumes that the foreign document has no previously mapped out pattern that itself requires accordance with a predefined protocol. The template/pattern in Edwards et al. is defined in terms of computer-specific codes, and a limited, strict defined format, not natural/contract language in an unlimited prose format, as in our invention.
U.S. Pat. No. 6,067,531 entitled “Automated Contract Negotiator/Generation System and Method,” issued to Hoyt et al. on May 23, 2000, describes a process for negotiating and generating a contract between multiple parties. It requires access to a database of contract components distributed among several users, who are able to assemble a contract based on their respectively assigned access levels. It discloses using a point system to control access to optional contract components. Also, it employs various web technology and (Graphical User Interfaces) GUI's for accomplishing the processes.
The system described in the Hoyt et al. patent bears no relevance to the system of the present invention other than the fact that both involve manipulating contracts via computer.
U.S. Pat. No. 6,304,892 entitled “Management System for Selective Data Exchanges Across Federated Environments,” issued to Bhoj et al. on Oct. 16, 2001, describes a process for managing a computer network and services via referral to a database of Service Level Agreement terms.
The Bhoj et al. system is not relevant to the present invention. Although the process does parse a contract, put the data into a computer-readable form, and then act on that data to manage computer networks, the parsing process is not a focus of the Bhoj et al. patent.
Patent Application Publication No. 2003/0018481, entitled “Method and Apparatus for Generating Configurable Documents,” Zhou et al. applicants, describes a process for generating a contract from template. It distinguishes between text components and “compensation” components, which may be combined in a generated document.
Except for the fact that the Zhou et al. system represents a contract via a “template”, it is not related to the present invention.
Patent Application Publication No. 2003/0074633, entitled “Apparatus and Methods for Generating a Contract,” Boulmakoul et al. applicants, describes a process for generating variations of a contract for different markets (i.e., different countries) that, via a system of rules, automatically makes sure that the variant contract conforms to the rules of the new country. For example, if a certain clause is required in a contract in a particular country, the system makes sure it is included. It has nothing to do with the present invention, although it does discuss automated contract elements.
Patent Application Publication No. 2004/0060005 entitled “Systems, Methods and Computer Programs For Analysis, Clarification Reporting On and Generation of Master Documents For Use in Automated Document Generation,” Vasey applicant, describes a method dealing generally with legal documents, including contracts.
The Vasey process works with legal documents that are ‘semantically identical although syntactically different,” which is the focus of the present invention. The objective of the Vasey disclosure is to ensure that no errors are present in the computer representation of a type of document so that it is suitable for further processing not closely supervised by human expert assistance. The objective of the present invention is to provide just a “rough cut’ of completely unformatted data, with many errors a given. The difference is that Vasey starts with a document already highly formatted (in a style meaningful to a computer) while the present invention starts with a document with no such formatting; in other words, well upstream from the Vasey starting place.
The Vasey system specifically includes both (1) “data representing a first mark-up notation or style” and (2) “data representing a second mark-up notation or style” and (3) “data representing a . . . document written in the first mark-up notation of style.” In other words, in Vasey, the format of the documents is already well known before the document can be used by the system and is translated into a pre-defined format meaningful to a computer.
That is a fundamental difference between Vasey and the present invention. The focus of present invention is to provide the ability to address specifically pure text (i.e., no mark-up notation) documents whose format is completely unknown.
To the extent that a “mapping” exists in Vasey, Vasey envisions “meta-level definition of types” as a pre-requisite. Since this meta-level description is precisely what is missing in the document to be analyzed in the system of the present invention, no such mapping is possible.
Further, Vasey uses as its basic analysis technology text characters (meta-level tags “outside” the non-technical use of the document) whose main purpose is functional regarding the form of the documents. In the present invention, the actual words of the document are employed as operators, although it, since it is a computer, matches those meanings on an exact character to character (non-denotional) basis to make a match.
It is, therefore a prime object of the present invention to provide a method for analyzing externally generated documents in a document management system that is capable of parsing plain text documents, that is, documents that have not been previously been “marked-up” or set out in database or machine friendly format or are accompanied by descriptions of the physical location of data or otherwise formatted or pre-analyzed with computer interaction or machine processing in mind.
It is another object of the present invention to provide a method for analyzing externally generated documents in a document management system that employs recognition of text coincidence, which acts as a form of connotative “meaning” of actual text.
It is another object of the present invention to provide a method for analyzing externally generated documents in a document management system that recognizes that certain special use documents (e.g. business contracts) can often be sorted into types or families of documents with a common purpose, each of which uses a relatively limited and specialized vocabulary, and that a text translation process can take advantage of that circumstance.
It is still another object of the present invention to provide a method for analyzing externally generated documents in a document management system that includes a technique for using a document's structure and section topic headers as elements in determining the meaning of the document.
It is still another object of the present invention to provide a method for analyzing externally generated documents in a document management system that weighs multiple elements of meaningful text inherent in the types of documents targeted, including various synonymic descriptors, variety of proximal domains, and evaluation of expected range or allowed data elements, to develop a weighted probability of the most likely location of document elements.
It is still another object of the present invention to provide a method for analyzing externally generated documents in a document management system that has the capacity to “learn” over multiple exposures to additional similar documents to increase its ability to recognize a previously unseen variation.