This invention relates to a system for enumerating local alignments, which are pairs of character strings that are similar between two documents, and more particularly, to a system for enumerating local alignments using the Smith-Waterman method.
Long documents are rarely similar in their entirety, but have partially similar parts. Take similarity between books as an example. Books often have more than one similar part. When cases where words each consisting of several characters match between the books are also considered, the number of similar parts between the books may be very large. The similar parts between two documents (pairs of similar character strings) are called local alignments. When the local alignments can be enumerated, the grounds for the similarity between the two documents can be grasped just by reading the portions around the local alignments, as opposed to reading the two entire documents.
As an example, in the examination process in patent examination or the like, identity and similarity of contents need to be judged between the application to be examined and a patent document or a non-patent document. When local alignments between the documents to be judged can be enumerated, the identity and similarity between the target documents can be judged just by reading the portions around the local alignments, as opposed to reading the entire documents, which facilitates the examination process.
In similarity search, when a character string is input, documents similar to the input character string are ranked and presented in order of similarity. In this case, a user can examine the documents that are likely to be relevant to the input character string in turn from the above. However, it is often the case that the grounds for the ranking are hard to understand, and hence the user needs to read the presented document itself in order to judge the relevance between the input character string and the presented document. When the document is long, the time needed to read the document is also long.
On the other hand, in full text search (based on exact string match), the labor of reading the entire document is reduced by presenting the portion around the character string that matches the input character string.
Therefore, also in the similarity search, by enumerating similar parts (local alignments) between the input character string and the document relevant to the input character string and presenting the enumerated local alignments, the relevance of the document can be judged without reading the entire document.
Further, when the local alignments are enumerated between claims and the specification of the patent application, an embodiment corresponding to a claim can be found at once.
A relevant art for enumerating the local alignments is the Smith-Waterman method (“Algorithms on Strings, Trees, and Sequences” (pp. 232-234), Gusfield, D., Cambridge University Press, 1997). The Smith-Waterman method efficiently searches for the local alignment having the maximum score by dynamic programming. As used herein, the term “score” refers to the similarity between partial character strings.
By enumerating the local alignments having scores that are equal to or larger than a predetermined value from a score matrix generated in the Simith-Waterman method, more local alignments can be enumerated exhaustively. However, in this method, whether or not a portion is a local alignment is judged based solely on the score, and hence a large number of similar local alignments are enumerated around the local alignments that have already been enumerated. Therefore, it is necessary to sort out only representative ones of the local alignments. In other words, both the representativeness and the exhaustiveness need to be satisfied.
JP 2004-038329A describes the method for improving the efficiency of enumerating the local alignments while suppressing the reduction in accuracy of enumerating the local alignments in the Smith-Waterman method as much as possible. Specifically, pairs of exactly matching character strings are enumerated, and character strings within a predetermined gap in the enumerated pairs of character strings are connected.