Over the past decade "database" has transitioned from an application used by a relatively small number of users in highly structured corporate data processing environments, to one at the center of mainstream computing. This occurred in large part because of the decade's striking advances in connectivity. The mid 80's emphasis on local area networks has been replaced with the world Internet. At the same time the set of computer users accessing databases has grown from a somewhat homogeneous and geographically localized collection to a highly diverse group spanning the globe and speaking many languages.
The present invention is a software method, performed by software instructing a computer, which addresses a central problem that is emerging as a result of these changes. That is, the problem of "robust semistructured text retrieval" for small and medium size databases. The crux of the invention resides in a function that compares two text strings returning a numerical indication of their similarity. Typically one of these strings is the user's query and the other is a string from the database. Because this function is very fast it is possible to compare the query with thousands or even hundreds of thousands of database fields while still delivering acceptable response time. Also included is a high-speed heap data structure used to track the "best matches" encountered during a search.
Semistructured text lies between fully structured databases and unconstrained text streams. A fully structured database might, for example, represent a person's name as many fields corresponding to its parts. The semistructured approach represents the name in a less formal way using a single text field. Other examples of semistructured text fields are: addresses, item names or descriptions (as in an online catalog), book or paper titles, company or institution names. Several such fields might be combined into one. For example, the query might be "problmoptimldictionry" and the record is Anderson, Optimal Bounds on the Dictionary Problem LNCS, 401, 1989.
All three of the words in the query above are misspelled, occur in the wrong order and there are no spaces separating the words. Nevertheless, the desired record is identified, using a preferred embodiment of the invention, from a listing of 50,360 paper descriptions in the field of theoretical computer science found in J. Seiferas' "A large Bibliography on Theory/Foundations of Computer Science at ftp://ftp.cs.rochester.edu, 1996-7. Author(s) name, paper title, and related information are combined into a single database text field.
Considerable variation is possible in the description of an item using a semistructured field. A person's last name might be listed first or last. The middle name might be excluded or abbreviated. The ordering of a complex name's parts is not always well determined. In principle, a policy might be established to regularize representation but, in practice, such policies rapidly become complex and confusing. Instead, the problem of directly dealing with these variations is handled by increasing the sophistication of the software that is used to compare queries with semistructured fields. Similar variations occur in user queries where the problem is perhaps greater.
An important benefit of the invention is that the queries are simple free-form expressions of what the user is looking for. There is no query language, and the comparison function is rather robust with respect to typical errors, missing or extra information, and overall ordering. Also, a preferred embodiment of the invention includes no natural-language specific considerations. It operates on byte strings and as such may be used across languages and perhaps for applications that have nothing to do with language (such as DNA comparison).
Using a 200 Mhz Pentium-Pro processor and the preferred embodiment of the invention, processing one byte of database information typically requires roughly 0.5 .mu.s. So 100,000 fields of 30 characters can be processed in 0.15 seconds. It is in some sense a fourth generation implementation of this general approach.
Algorithms of the general type used in the present invention were introduced in the master's thesis of P. N. Yianilos entitled "The definition, computation and application of symbol string similarity functions," Emory University, Department of Mathematics, 1978, and were later used in the commercial spelling correctors of Proximity Technology Inc., and Franklin Electronic Publishers. The linguistic software components of these companies were ultimately used under license in word processing programs from hundreds of publishers, in typewriters, and in tens of millions of hand-held spelling devices.
The PF474 VLSI chip was a special purpose pipelined processor that implemented such an algorithm. The chip is described in an article entitled "A dedicated comparator matches symbol strings fast and intelligently," in Electronics Magazine, McGraw-Hill, December 1983 and in an article by S. Rosenthan entitled "The PF474--a coprocessor for string comparison," in Byte Magazine, 1984, and in U.S. Pat. No. 4,490,811 by Yianilos and Buss entitled "String Comparator Device Systems Circuit and Method." Today's software matches and even exceeds the performance of this devices--although the comparison is not entirely fair since the PF474 was clocked at only 4 Mhz. The same design implemented today would still result in a 1-2 order of magnitude hardware advantage.
The Friendly Finder software utility described by M. J. Miller in an article entitled "First look--friendly program doesn't need exact match to find database search objects," in Info World, 1987 and first introduced in 1987 by Proximity Technology, Inc. implemented the algorithm together with software accelerations and special treatment for bigrams. The result was that small database could be searched on early personal computers without using the PF474 chip. The computational heart of Friendly Finder was also made available under license and called "P2. "
A transition to the bipartite matching viewpoint took place with two articles by Buss and Yianilos, one entitled "Linear and o(n log n) time minimum-cost matching algorithms for quasi-convex tours," in the Proceedings of the 5th Annual ACM-SIAM Symposium on Discrete Algorithms, 1994, pp. 65-76, and another entitled "A bipartite matching approach to approximate string comparison and search," technical report no. 95-193, from NEC Research Institute, Inc., Princeton, N.J., and the algorithms were both improved and in some cases simplified. The result is entirely new algorithms that are still of the same family.
The present invention is the first implementation based on these new developments. The algorithms of Yianilos and Buss lead to linear time algorithms for a large class of graph cost functions including the simple linear costs used by LIKEIT. LIKEIT is a software system implementing the multistage method forming the present invention. Linear time matching algorithms for this particularly simple special case were first present in an article by R. M. Karp and S.-Y. R. Li entitled "Two special cases of the Assignment Problem" in Discrete Mathematics, 13 (1975), pp. 129-142.
A portion of the disclosure of this patent document (LIKEIT SOFTWARE) contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. v,1-83/2
An alternative approach to string comparison computes "edit distance", described in an article by Hall and Dowling entitled "Approximate String Matching," in Computing Surveys, 12, 1980, pp. 381-402 and in an article by Sankoff and Kruskal entitled "Macromolecules: The Theory and Practice of Sequence Comparison," Addison-Wesley, 1983, that is the minimum-cost transformation of one into the other via some elementary set of operations. The most common form of the approach uses weighted insertion, deletion, and substitution operations and the distance computation is a straightforward dynamic program. There are two problems that led to the present invention. First, the algorithm runs in O(m.multidot.n) time where m, n are the string lengths. The present method runs in O(m+n). Second, the edit distance approach is highly sensitive to global permutation, e.g. changing word order. Humans frequently are not so sensitive and the invention deals well with this issue.
A similar approach, also used in effect by Friendly Finder, is to build an optimal weighted matching of the letters and multigraphs in the query, and those in the each database record. Words receive no special treatment. In this sense it is related to the document retrieval approach of M. Damashek, in an article entitled "Gauging similarity with n-grams: Language-independent categorization of text," in Science, 267, 1995, pp. 843-848 and S. Huffman and M. Damashek entitled "Aquaintance: a novel vector-space n-gram Technique for Document Categorization," in Proc. Text Retrieval Conference (TREC-3), Washington, D.C., 1995, NIST, pp. 305-310.
The automaton based approach to fast string matching introduced in an article by Knuth, Morris and Pratt entitled "Fast pattern matching in strings," in SIAM Journal on Computing, 6, 1977, pp. 323-350, deals with exact matches only. A natural generalization relaxes the requirement of exact equality and allows a bounded (and in practice small) number of errors. Each such error is typically restricted to be either an insertion, deletion, substitution, or sometimes a transposition of adjacent symbols. Given a query string, it is then possible to build an automaton to detect it, or any match within the error bounds, within a second string. The recent work of Manber and Wu, in an article entitled "GLIMPSE: A tool to search through entire file systems," in Proceedings of the Winter 1994 USENIX Conference, 1994, pp. 23-32 and in an article entitled "Fast test searching allowing errors," in Communications of the ACM, 35, 1993, pp. 83-91, demonstrate that text can be scanned at very high speeds within this framework for comparison. The present invention's framework can satisfy queries that do not fall within the practical capabilities of the automaton approach because they are too different from the desired database record. The invention and related approaches are important and effective tools for medium-size textual databases, yet still small enough to scan in their entirety for each query.