The amount of information being generated and made publicly available in the private, governmental and business sector has been tremendously increasing over the last decades. Meanwhile, in most spheres of business, it is not any more possible to keep up-to-date by reading all the documents available on a particular subject. This problem, also known as the problem of “information overload”, has led to the development of several computer aided methods facilitating and accelerating the retrieval, organization and evaluation of all the available and relevant data on a particular subject.
The optimal method for determining the relevance of documents for a particular question depends heavily on the structure of the data objects comprising the information of interest. A continuum of structuredness exists reaching from highly unstructured data structures such as natural language text stored for example in the form of web pages to highly organized data forms, e.g. entries in relational databases, wherein data is stored in tables according to a particular, structured database schema.
Data being organized in highly structured data sources such as databases can be interpreted and processed by computers e.g. by applying appropriate retrieval requests such as SQL queries. However, it is a time consuming task for humans to develop a database schema suitable for the data that shall be represented and stored by said database and to construct appropriate queries for each particular subject field a user may be interested in. For this and other reasons, many documents which may be of relevance for a particular subject are never stored in a structured way and are stored as plain text instead, e.g. as html page available via the world wide web. In addition, not all relevant information of data objects may be explicitly present in the database but may be information implicitly derivable from the connectivity of document data objects relative to each other.
Plain text documents represent the other end of the continuum: natural language text is, although semantically rich, highly unstructured. It requires sophisticated natural language processing methods to enable a computer to extract meaningful information from plain text and to efficiently rank the relevance of text documents based on the plain text information. Due to these difficulties, methods trying to rank such highly unstructured documents often abstain from analyzing the documents syntactically or semantically and rather rely on evaluating topological properties of the network of documents. The topological information consists of links, e.g. citations. Such links are usually directed. Commonly, links are established by a document, the ‘source document’, citing one or multiple other documents, here referred to as ‘destination documents’.
A data object representing a document may comprise additional meta-information. The meta-information comprises additional information on the document and may include pointers connecting the document data object to other document data objects, the pointers thereby acting as links.
In the following, the term ‘linkage information’ will be used to denote information on which document data object is linked to any other document data object. Links may be stored separately from the linked data objects, may be contained in the plain-text section or the meta-information section of the source document data object, destination document data object or both of them. A well known example for links within plain-text sections of documents are hyperlinks, e.g. URL hyperlinks. A Hyperlink is a reference to a document or a text section the user can directly follow, e.g. by clicking on an icon or a text phrase providing the hyperlink functionality (the hypertext).
The linkage information has been used to determine the relevance of documents, in particular of documents having only little meta-information and lacking a common, semantically rich data structure allowing a more advanced way of quantifying the relevance of documents represented by the data objects examined.
A method described in U.S. Pat. No. 7,058,628, also known as Google's ‘page rank algorithm’, assigns importance ranks to nodes in a linked database, such as any database of documents containing citations or the World Wide Web. The rank assigned to a document is calculated from the ranks of documents citing it. In addition, the rank of a document is calculated from a constant representing the probability that a browser through the database will randomly jump to the document.
A further technique to retrieve, rank and display data objects is described in U.S. Pat. No. 7,376,649. A global ranking value is herein assigned to a data object based on a combination of the object's link-based and text-based (e.g., word frequency) ranks. A ‘link-based’ rank is derived from a vector-space cluster analysis, a ‘text-based’ rank is derived from text features such as word frequency.
US2008243813 describes a method and system for calculating the importance of documents based on transition probabilities from a source document to a target document. One type of document being of particular relevance for many companies and corporate consultants are intellectual property documents, e.g. patent documents, patent applications, utility patents and utility patent applications.
Various methods for evaluating the relevance of intellectual property documents are known which have, however, severe methodological shortcomings and lead to wrong or incomplete results. For example, Trajtenberg, M., 1990, describes in “A penny for your quotes: patent citations and the value of innovations” published in the RAND Journal of Economics 21(1), obstacles arising from the use of patents in economic research. The obstacles are caused by the fact that patents vary enormously in their importance or value. Hence, simple patent counts cannot be informative about the innovative output of a company. Trajtenberg proposes to weight the patent counts by citations as indicators of the value of innovations, thereby overcoming the limitations of simple counts.
Hall, B. H., A. Jaffe, et al., 2005, explores in “Market Value and Patent Citations” published by the Rand Journal of Economics 36(1): 16-38 the usefulness of patent citations as a measure of the “importance” of a firm's patents. Hall comes to the conclusion that each extra citation per patent boosts the market value of that patent by 3%.
Harhoff, D., F. M. Scherer, et al., 2003, describe in “Citations, family size, opposition and the value of patent rights” published in Research Policy 32(8), 1343-1363 that the number of citations a patent receives is positively related to its value. References to the non-patent literature are informative only in some particular technology fields. Patents which are upheld in opposition and annulment procedures and patents representing large international patent families are particularly valuable.
US 20070073748 describes a method for probabilistically quantifying a degree of relevance between two or more citationally or contextually related data objects, such as patent documents, non-patent documents or web pages. The relevance between two or more citationally or contextually related data objects is visualized by using iterative selforganizing maps (“SOM”) generating a visual map of relevant patents which are to be explored, searched or analyzed.
U.S. Pat. No. 5,991,751 describes a data processing system maintaining first databases of patents and second databases of non-patent information of interest to a corporate entity. The system also maintains one or more groups comprising any number of the patents from the first databases. The system processes the patents in one of the groups in conjunction with non-patent information. Accordingly, the system performs patent-centric and group-oriented processing of data. A group can also include any number of non-patent documents. The groups may be product based, person based, corporate entity based, or user-defined. Other types of groups are also covered, such as temporary groups.
U.S. Pat. No. 6,556,992 provides a statistical patent rating method and system for independently assessing the relative breadth, defensibility and commercial relevance of individual patent assets and other intangible intellectual property assets. Said rating method provides means for patent valuation by experts, investment advisors, economists and others to help guide future patent investment decisions. It is described a statistically-based patent rating method and system whereby relative rankings are generated using a database of patent information by identifying and comparing various characteristics of each individual patent to a statistically determined distribution of the same characteristics within a given patent population.