1. Field of the Invention
The present invention relates to systems and methods for automated text processing, and for automated content and context analysis. In particular, the present invention relates to automated systems and methods of identifying sentences near a document citation (such as a court case citation) that suggest the reason(s) for citing (RFC).
2. Related Art
In professional writing, people cite other published work to provide background information, to position the current work in the established knowledge web, to introduce methodologies, and to compare results. For example, in the area of scientific research, a researcher has to cite to demonstrate his contribution to new knowledge. As another example, in writing court decisions, a judge has to cite precedent legal doctrine to comply with the common law tradition of stare decisis. However, the citing in the legal profession is more precise than that in the scientific research community.
Courts deal with legal issues such as points of law or facts in dispute. Issues arise over differences of opinion as to definition, interpretation, applicability of specific facts and acts, prior decisions, legal principles or rules of law. Every court decision or case involves one or more issues (the reason a law suit was brought). In addition, in most cases there are usually several sub-issues that arise from the detailed analysis and consideration of the issues. Thus, almost every case discusses multiple issues.
However, these multiple issues are often not intrinsically related as one might expect in scientific literature. Rather, the issues only occur together in a given case because they have a bearing on the specific factual situation dealt with in that case. Discussion of each issue or sub-issue is usually supported by citing relevant legal authorities, which may not be related to one another.
For example, People v. Surplice, 203 Cal. App.2d 784, is frequently cited for the general issue of how the court should exercise its judicial discretion when the law allows it. But, it is also frequently cited for the more specific issue that says that it is reversible error when a judge fails to read and consider a probation officer's pre-sentence report.
As a result, when a citing case criticizes a cited case, the citing case is usually not criticizing the whole case. Most of the time, the criticism is on a specific legal issue. Similarly, a citing case may reference a cited case for a specific, supportive point of law.
It is not unusual to read a citing case that both agrees with the cited case on one issue, and disagrees with it on a different issue. Traditional content analysis techniques that apply statistical models on whole documents run into difficulty in pinpointing the exact reason a case is cited.
Thus, there is a need in the art to provide a technique that can extract the reason for citing (RFC) at a local region where the citing instance occurs. However, there do not appear to be any conventional systems for performing the required task of finding text near a citing instance that indicates the reason a document is cited. It is to fulfill this need, among others, that the present invention is directed. In fulfilling this need, the invention provides new applications of techniques that are known in the art, such as word stemming, informetrics and vector space information retrieval, which are now briefly discussed.
Porter in [Porter 1980] describes a word stemming algorithm that strips suffixes from words. This conventional word stemming algorithm handles many types of suffixes and is not limited by the length of a word. However, this approach is not computationally very fast and does not perform well on document sets containing many long words, such as court opinions and medical journal articles. However, Applicants have recognized that it is desirable to use stemming to find morphological variations of words—that is, words that have different suffixes. Applicants have recognized that, because many input documents (especially court opinions) contain many long words, it is valuable to provide a stemming method that simply shortens them to their first N letters (where N is a positive integer such as six). Such an inventive stemming method is described in the Detailed Description.
Informetrics is a term whose definition is somewhat ambiguous in the literature. It appears to have been first introduced in 1979 as general term covering both bibliometrics and scientometrics [Brookes, 1991]. All three terms have been used loosely to mean more or less the same thing. Informetrics can be perceived in its broadest sense as “the study of the quantitative aspects of information in any form” [Brookes, 1991, p. 1991], or as “the search for regularities in data associated with the production and use of recorded information” [Bookstein et al., 1992].
Small [Small 1978], a bibliometrics researcher, found that if one examines the text around citing instances of a given scientific document, one can determine the ‘particular idea the citing author is associating with the cited document’. He goes on to say that the citation of a cited scientific document becomes a symbol for the ideas expressed in the text of the citing instance. However, court case opinion citation differs from that of the scientific community in two fundamental ways.
First, in the legal profession, a citing instance is normally for single point-of-law, definition, or fact pattern that is precisely stated near the citing instance. In contrast, in the scientific community, a citing instance is often for very general principles or ideas that are normally not precisely stated near the citing instance.
Second, in the legal profession, two citing instances of a particular case are often for differ points of law, definitions, or fact patterns [Morse 1998]. In contrast, in the scientific community two citing instances are generally for the same principles or ideas that are not clearly stated or imprecisely stated near the citing instance.
Therefore, bibliometrics methods that use just the frequency of citation of documents do not generally work as well when applied to legal citations as they did when applied to scientific citations. As an example, take co-citation analysis [Small 1973], which is the analysis of the frequency that two citations appear in the same document. One conclusion that co-citation analysis produces is that two documents citing the same two other documents have a high probability of being about closely related topics. But in the legal profession, this is not true as often as it is in the scientific community.
For example, if both of two case law documents D1 and D2 cite People v. Surplice, and both documents cite another case for an issue related to “a probation officer's pre-sentence report”, then co-citation analysis would conclude that these two cases have similar topics. But, if D1 cites People v. Surplice for the first very general reason (how the court should exercise its judicial discretion), and D2 cites it for the 2nd very specific reason (dealing with a probation officer's pre-sentence report), then D1 and D2 could be about very different topics.
Accordingly, something more than mere co-citation frequency counts is needed to determine if two cases are similar in topic. It is to fulfill this need, among others, that the present invention is directed.
Concerning vector space information retrieval, the “Smart” system [Salton 1989] is an example of an information retrieval system based on the vector processing model. The goal of the Smart system is to find the documents that are similar to a “query” (a list of words). Both queries and documents are represented as word vectors. In the simple case, each element of a word vector is the frequency that a specific word appears in the document collection.
A simple method of determining the similarity of a document to a query is to compute the dot product of the document's and query's word vectors. The dot product is the sum of the products of corresponding elements from the two word vectors, where corresponding elements contain the frequency counts of a given word, either in the document set or the query. Normally this similarity metric is normalized by taking into account the lengths of the document and query. The present invention provides, among other advantages, a new application of the vector processing model and similarity metric like the one described above.
U.S. Pat. No. 5,918,236 (Wical; hereinafter “the '236 patent”) may be considered relevant. The '236 patent discloses a system that generates and displays “point of view gists” and “generic gists” for use in a document browsing system. Each “point of view gist” provides a synopsis or abstract that reflects the content of a document from a predetermined point of view or slant. A content processing system analyzes documents to generate a thematic profile for use by the point of view gist processing.
The point of view gist processing generates point of view gists based on the different themes or topics contained in a document. It accomplishes this task by identifying paragraphs from the document that include content relating to a theme for which the point of view gist is based. The '236 patent's Summary of the Invention discloses that the point of view gist processing generates point of view gists for different document themes by relevance-ranking paragraphs that contain a paragraph theme corresponding to the document theme that was determined by analyzing document paragraphs and the whole document.
However, the '236 patent's relevance-ranking does not solve the problem solved by the present invention—determining which sentences near a citing instance to determine which sentences are the best ones to represent the reason for citing (RFC). Thus, there is a need in to art to provide a system that relevance-ranks sentences near a citing instance based on the similarity of each such sentence to typical context of many citing instances for a given document. Furthermore, there is a need to provide a system to determine typical context by analyzing the context of many citing instances for the same case. It is to fulfill these various needs, among others, that the present invention is directed.