It is often useful to perform a trend analysis on one or more documents within a given domain in order to discover current trends and challenges within that domain. For example, one may wish to partition a collection of documents into a taxonomy, or a set of mutually disjoint classes of documents. For example, one could use a method called intuitive clustering that breaks out classes corresponding to the most frequently occurring key terms in order and then rebalances the clusters with k-means clustering.
After constructing a taxonomy, one could examine the most typical and least typical examples in each class and/or perform a qualitative trend analysis of each class relative to the trend of the domain. One could also construct a landscape graph that represents a trend analysis of named classes of similar documents.
However, this classification of documents for trend analysis depends on at least one similarity measure. A typical similarity measure is distance in a vector space in which the dimensions correspond to vocabulary terms and projection on a dimension corresponds to the number of occurrences in the document. However, ambiguity of term usage makes this a poor measure of similarity. More specifically, due to various ambiguities, one must edit a taxonomy generated by a random seed process or intuitive clustering process or build it up laboriously, one class at a time, based on detailed knowledge of subject matter experts. Randomization removes robustness (the repeatability of results). Ad hoc editing to remove ambiguity also removes robustness. Moreover, robustness is also removed by the ad hoc building of classes from subject matter expertise because each expert is likely to classify slightly differently from each other expert.
For example, some of the terms used for determining the classes may represent stylistic choices rather than meaningful technical distinctions, thus resulting in style-based, rather than content-based clustering. Even when each key term used is technical, the distinctions may be a hodgepodge of unrelated criteria presenting a confusing final trend analysis to the user.
Thus, current classifications tend to be based on a hodgepodge of unrelated criteria because they are based on the ambiguous occurrence of vocabulary terms. As such, it is difficult to extract robust useful features that provide classification based on consistent (unambiguous) term usage.
As a specific example, the structured information in patent documents provides a number of useful features: assignee, issued patent versus patent application, United States Patent and Trademark Office (USPTO) classification, etc. However, the USPTO classification is, itself, more of a hodgepodge than an optimal classification based on a small dimension. It also depends on user selection with little apparent consistency of such selections among closely related patents.
In addition to the need for a consistent, repeatable classification based on unambiguous usage or small dimension, at least in the patent domain, it is often preferable to obtain classifications based on the purpose of the invention as opposed to the technical details of the method steps (i.e., the means). Although one may also use a classification based on the means; it is often desirable to avoid mixing purpose and means in one classification.
For example, one patent document might describe the use of Complementary Metal Oxide Semiconductor (CMOS) technology to produce an image sensor. Another patent document (with a very different purpose) might describe the use of an image sensor in an inventive process for CMOS device manufacturing. In the former case, the image sensor is the purpose; in the latter case, the image sensor is a means. Likewise, the phrases “provide an image sensor in a manufacturing process” and “provide a new solar cell” both contain the term “provide,” but only the latter instance indicates the purpose of the invention.
Thus, there exists a need for extracting from a document a feature of unambiguous usage, which may serve as a summary of the document. In particular, for patent document trend analysis, it would be highly desirable to be able to extract a feature that represents the purpose of the invention as opposed to the means of achieving that purpose.