There is an old adage that states, “Knowledge is power.” It is perhaps not surprising, then, that records of varying kinds have been kept for thousands of years. From writings or paintings on walls in caves to widely disseminated paper encyclopedias to the vast library of resources available over global networks like the Internet, there has always been a desire to capture information in some form, for both instrumental reasons and because knowledge is an intrinsic good that has value in itself.
As more and more information is recorded, however, the challenge of how to organize and search through it also grows. In many instances, it is desirable to search through documents that may or may not be related to one another, e.g., to uncover related aspects that may be of some relevance. Literature searches are quite common, for example, in science, social science, and other endeavors.
Another area where information retrieval comes into play is in the intellectual property domain. For instance, prior art searches for freedom to operate, invalidity, clearance, and/or other reasons have become a part of everyday business for IP departments at private companies, law firms, and other firms offers such services to others. Prior art searches sometimes come up during the processing/handling of patent ideas generated in a company, when a company has been accused of infringement, when a company is considering whether to undertake a new endeavor, etc.
Some IP professionals use commercially available search systems like the Derwent Patent Research Database by Thomson Innovation, or publicly available search sites like the offerings by USPTO or the European Patent Office. Other tools have come online in recent years including Google Patent Search.
These sources deal almost exclusively with the texts of patents and patent applications. The search engines can therefore tailor their search facilities to the well-known structures of those documents, e.g., to search through sections or headers like “classification”, “inventor”, “claims”, etc. Recently, the Derwent engine has added less-structured texts of major scientific regular publications for larger scientific organizations, like the IEEE.
It may be strategically unwise to rely exclusively on prior art searches through either patent texts or limited scientific texts. Thus, for a mixture of well-structured patent texts and less-structured scientific texts, search techniques have been developed that are based on working with texts on a semantic basis. At first blush, the Latent Semantic Indexing (LSI) technique seems to be well suited to performing such searches. See, for example, S. Deerwester, et al, “Improving Information Retrieval with Latent Semantic Indexing”, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, 1988, pp. 36-40, the entire contents of which are hereby incorporated herein by reference. See also U.S. Pat. No. 4,839,853, the entire contents of which are hereby incorporated herein by reference. LSI is increasingly being used for electronic document discovery (e-discovery) to help enterprises prepare for litigation. LSI's ability to cluster, categorize, and search large collections of unstructured text on a conceptual basis are of value in e-discovery.
Searching by currently available mechanisms typically is limited to either well structured data (like patents) or at least to scientific documents that have been pre-classified a priori, perhaps by applying LSI for semi-automated semantic categorization.
For the institutions that operate professionally and/or scientifically in this domain, the thousands of pages of non-scientific, non-structured, non-legal texts published typically by companies seem uninteresting, as the texts appearing in manuals, marketing brochures, company web pages would typically use not “patentese” or scientific jargon, but rather either plain English or company lingo—especially when it comes to marketing slogans. Owners of databases of well-maintained bodies of patent texts or classified scientific texts may be afraid that their assets would be “spoiled” by including poorly classified text bodies with words varying in their meaning over the years. In addition, the expected data volume would block precious space for more “reasonable” texts, and also could cause problems of accessing company-specific texts like brochures, manuals, and the like. Thus, oftentimes there is little to no motivation for professional or scientific institutions to provide mechanisms to add the “poor relatives” of scientific or legal texts to the general body of searchable assets.
Companies, however, and especially their legal departments and legal counsel, might have different demands. Due diligence (e.g. in cases of litigation or freedom to operate analysis) may require a cross-domain search, involving the commercially uncharted territory of one's own or one's competitors' marketing brochures, manuals, conference speeches, etc. As indicated above, however, those texts typically are outside of the scope of the above-mentioned approaches, as the company-specific requirements cannot normally be generalized to be of use for the general public.
Thus, it will be appreciated by those skilled in the art that there is a need for improved techniques of searching through potentially large collections of structured and unstructured documents to retrieve information of potential interest.
One aspect of certain example embodiments relates to techniques for comparing and clustering text documents in a database with a text retrieval component.
Another aspect of certain example embodiments relates to a system for the automatic analysis of structured and/or unstructured documents. Structured and/or unstructured documents may be input into a database, and a text retrieval module may be in communication therewith. Analysis rules may be input and stored. An analysis module may be provided for performing text comparisons on the documents residing in the database, e.g., using the stored rules. The results of such analysis may be displayed to a user according to user parameters.
In certain example embodiments, a method for analyzing documents is provided. A plurality of documents and/or document portions is imported into a database, with at least some of the documents and/or document portions being structured and with at least some of the documents and/or document portions being unstructured. The imported documents and/or document portions are organized into one or more collections. A selection of at least one of said one or more collections is received (e.g., via a user interface). An index of words and/or groups of words is built based on each said document or document portion in each said selection. A document-word matrix, including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said document or document portion in each said selection, is built. One or more clusters of documents is generated, via at least one processor, using the document-word matrix.
According to certain example embodiments, a weighted document-word matrix, including a weighted value indicative of a number of times each said word in the index of words appears in each said document or document portion in each said selection, is built, with the weighting being performed in accordance with a semantics-based algorithm. In such a case the one or more clusters of documents may be generated using the weighted document-word matrix. According to certain example embodiments, the semantics-based algorithm involves Latent Semantic Indexing (LSI).
According to certain example embodiments, (a) the index of words and/or groups of words prior to the building of the document-word matrix and/or (b) document-word matrix, is/are refined, with the refining being performed in accordance with one or more predefined rules that may be stored in a database of rules. The one or more predefined rules may include rules, for example, defining semantic and/or non-semantic stopwords and specifying how the defined semantic and/or non-semantic stopwords are to be handled; for standardizing transliterations or suspected transliterations; and/or for applying a stemming algorithm to reduce the size of the index of words and/or groups of words prior to the building of the document-word matrix and/or the document-word matrix.
According to certain example embodiments, the same index or separate indexes may be built for words and for groups of words (e.g., for word pairs). The word pairs may include a given word and one word immediately to the left of the given word.
According to certain example embodiments, linkages between documents in a given cluster may be removed when threshold values indicative of degrees of similarity between two documents are not met. Clustering may be based on a cosine similarity calculation in certain example instances. In some cases, the threshold value may be user specified (e.g., at 80%).
In certain example embodiments, a method for analyzing documents is provided, with the method comprising: organizing a plurality of assets including structured and unstructured documents and/or document portions into a plurality of user-defined collections; enabling a user to select one or more of said user-defined collections for subsequent analysis; building an index of words and/or groups of words based on the documents and document portions in each said selected collection; building a document-word matrix including a value indicative of a number of times each said word and/or groups of words in the index of words and/or groups of words appears in each said document or document portion in each said selection; refining the index of words and/or groups of words, and/or the document-word matrix based on predefined rules stored in a database of rules; weighting entries in the document-word matrix based on a semantic indexing approach; and clustering together, via at least one processor, documents in dependence on the weighted entries.
In certain example embodiments, an asset analysis system is provided. A database is configured to store a plurality of imported assets in one or more collections, with the assets being documents and/or document portions, where at least some of the documents and/or document portions are structured and at least some are unstructured. An asset splitting module, executable via at least one processor, is configured to split the documents and/or document portions into sub-documents and/or sub-document portions automatically and/or based on user input, and store generated sub-documents and/or sub-document portions as assets in the database. A user interface is configured to enable a user to select one or more collections of assets for analysis. An index builder, under control of the at least one processor, is configured to access from the database assets belonging to the one or more selected collections and generate a word and/or group of words index, with the word and/or group of words index including a listing of words appearing in the accessed assets. A rules database is configured to store a plurality of user-defined rules for refining the word and/or group of words index and/or the document-word matrix. A matrix builder, under control of the at least one processor, is configured to build a document-word matrix including a value indicative of a number of times each said word and/or word/pair in the index of words and/or groups of words appears in each said accessed asset. An index refining module, under control of the at least one processor, is configured to refine the word and/or group of words index and/or the document-word matrix, based on rules stored in the rules database. A weighting engine, under control of the at least one processor, is configured to weight entries in the document-word matrix based on a semantic indexing approach. A clustering engine, under control of the at least one processor, is configured to cluster together documents in dependence on the weighted entries.
Non-transitory computer readable storage mediums tangibly storing instructions for performing the above-summarized and/or other methods also are provided by certain example embodiments.
These aspects and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.