The present invention concerns a search system for information retrieval, particularly information stored in form of text, wherein a text T comprises words and/or symbols and sequences thereof S, wherein the information retrieval takes place with a given or varying degree of matching between a query Q, wherein the query Q comprises words and/or symbols q and sequences P thereof, and retrieved information R comprising words and/or symbols and sequences thereof from the text T, wherein the search system comprises a data structure for storing at least a part of the text (T), and a metric M which measures the degree of matching between the query Q and retrieved information R, and wherein the search system implements search algorithms for executing a search, particularly a full text search on the basis of keywords KW; and a method in a search system for information retrieval, particularly information stored in the form of text T, wherein a text T comprises words and symbols s and sequences S thereof, wherein the information retrieval takes place with a given or varying degree of matching between a query Q. wherein the query Q comprises words and/or symbols q and sequences P thereof, and retrieved information R comprising words and/or symbols and sequences thereof from the text T, wherein the search system comprises a data structure for storing at least a part of the text T, and a metric M which measures the degree of matching between the query Q and retrieved information R, and wherein the search system implements search algorithms for executing a search, particularly a full text search on the basis of keywords kw, wherein the information in the text T is divided into words and word sequences S, the words being substrings of the entire text separated by word boundary terms and forming a sequence of symbols, and wherein each word is structured as a sequence of symbols in the word forming sequence; and the use of the search system.
A tremendous amount of information in various fields of human knowledge is collected and stored in computer memory systems. As the computer memory systems increasingly are linked in public available data communication networks, there has been an increasing effort to develop systems and methods for searching and retrieving information for public or personal use. Present search methods for data have, however, limitations that seriously reduce the possibility of retrieving efficiently and using information stored in this manner.
Information may be stored in the form of different data types, and in the context of information search and retrieval it will be useful to discern between dynamic data and static data. Dynamic data is data that change often and continuously, so that the set of valid data varies all the time, while static data only changes very seldom or never at all. For instance will economic data, such as stock values, or meteorological data be subject to very quick changes and hence dynamic. On the other hand archival storage of books and documents are usually permanent and static data. The concept the volatility of the data relates to how long the information is valid. The volatility of data has some bearing upon how the information should be searched and retrieved. Large volumes of data require some structure in order to facilitate searching, but the time cost of building such structures must not be higher than the time the data is valid. The cost of building a structure is dependent on the data volume and hence the building of data structures for searching the information should take both the data volume and the volatility into consideration. The information collected are stored in databases and these may be structured or unstructured. Moreover, the databases may contain several types of documents, including compound documents which contain images, video, sound and formatted or annotated text. Particularly structured databases are usually furnished with indexes in order to facilitate searching and retrieving the data. The growth of the World Wide Web (WWW) offers a steadily growing collection of compound and hyperlinked documents. A great many of these are not collected in structured databases and no indexes facilitating rapid searching are available. However, the need for searching documents in the World Wide Web is obvious and as a result a number of so-called search engines has been developed, enabling searching at least parts of the information in the World Wide Web.
With a search engine it is commonly understood one or more tools for searching and retrieving information. In addition to the search system proper, a search engine also contains an index, for instance comprising text from a large number of uniform resource locators (URLs). Examples of such search engines are Alta Vista, HotBot with Inktomi technology, Infoseek, Excite and Yahoo. All these offer facilities for performing search and retrieval of information in the World Wide Web. However, their speed and efficiency do by no means match the huge amount of information available on the World Wide Web and hence the search and retrieval efficiency of these search engines leaves much to be desired.
Searching a large collection of text documents can usually be done with several query types. The most common query type is matching and variants of this. By specifying a keyword or set of keywords that has to be present in the queried information the search system retrieves all documents that fulfils this requirement. The basic search method is based on so-called single keyword matching. The keyword p is searched for and all documents containing this word shall be retrieved. It is also possible to search for a keyword prefix pj and all documents where this prefix is present in any keyword in the documents, will be retrieved. Instead of searching with keywords, the search is sometimes based on so-called exact phrase matching, where the search uses several single keywords in particular sequence. As well-known by persons skilled in the art, the exact matching of keyword phrases in many search systems may be done with the use of Boolean operators, for instance based on operators such as AND, OR, and NOT which allow a filtering of the information; e.g. using an AND phrase results in that all documents containing the two keywords linked by the AND operator will be returned. Also a NEAR operator has been used for returning just the documents with the keywords matching and located xe2x80x9cnearxe2x80x9d to each other in the document text. In many structured database the documents contained in the database have been annotated, e.g. provided with fields which denote certain parts or types of information in the document. This allows the search for matches in only parts of the documents and is useful when the type of queried information is known in advance.
When searching in text documents the data are structured and most likely present in some natural language, like English, Norwegian etc. When searching for documents with a certain context it is possible to apply proximity metrics for matching keywords or phrases that match the query approximately. Allowing errors in keywords and phrases are common method for proximity, using a thesaurus is another common method. A proximity search requires only that there shall be a partial match between the information retrieved and the query. International published application WO96/00945 titled xe2x80x9cVariable length data sequence matching method and apparatusxe2x80x9d (Dxc3x6ringer and al.) which has been assigned to International Business Machines, Corp., discloses the building, maintenance and use of a database with a trie-like structure for storing entries and retrieving at least a partial match, preferably the longest partial match or all partial matches of a search argument (input key) from the entries.
The main object of the present invention is to provide a search system and a method for fast and efficient search and retrieval of information in large volumes of data. Particularly it is an object of the present invention to provide a search system suited for implementing search engines for searching of information systems with distributed large volume data storage, for instance Internet. It is to be understood that the search system according to the invention by no means is limited to searching and retrieving information stored in the form of alphanumeric symbols, but equally well may be applied to searching and retrieving information stored in the form of digitalized images and graphic symbols, as the word text used herein also may interpreted as images when these are represented wholly or partly as sets of symbols. It is also to be understood that the search system according to the invention can be implemented as software written in a suitable high-level `language on commercially available computer systems, but it may also be implemented in the form of a dedicated processor device for searching and retrieving information of the aforementioned kind.
The above-mentioned objects and advantages are realized according to the invention with a search system which is characterized in that the data structure comprises a tree structure in the form of a non-evenly spaced sparse suffix tree ST(T) for storing suffixes of words and/or symbols s and sequences S thereof in the text T, that the metric M comprises a combination of an edit distance metric D(s,q) for an approximate degree of matching between words and/or symbols s;q in respectively the text T and a query Q and an edit distance metric Dws(S,P) for an approximate degree of matching between sequences S of words and/or symbols s in the text T and a query sequence P of words and/or symbols q in the query Q, the latter edit distance metric including weighting cost functions for edit operations which transform a sequence S of words and/or symbols s in the text T into the sequence P of words and/or symbols q in the query Q, the weighting taking place with a value proportional to a change in the length of sequence S upon a transformation or dependent on the size of the words and/or symbols in sequences S;P to be matched, that the implemented search algorithms comprise a first algorithm for determining the degree of matching between words and/or symbols s;q in the suffix tree representation of respectively the text T and a query Q, and a second algorithm for determining the degree of matching between sequences S;P of words and/or symbols s;q in the suffix tree representation of respectively the text T and the query Q, said first and/or second algorithms searching the data structure with queries Q in the form of either words, symbols, sequences of words or sequences of symbols or combinations thereof, such that information R is retrieved on the basis of query Q with a specified degree of matching between the former and the latter, and that the search algorithms optionally also comprise a third algorithm for determining exact matching between words and/or symbols s;q in the suffix tree representation of respectively the text T and the query Q and/or a fourth algorithm for determining exact matching between sequences S;P of words and/or symbols s;q in the suffix tree representation of respectively the text T and the query Q, said third and/or fourth algorithms searching the data structure with queries Q in the form of either words, symbols, sequences of words, or sequences of or combinations thereof, such that information R is retrieved on the basis of the query Q with an exact matching between the former and the latter.
In an advantageous embodiment of the search system according to the invention the suffix tree ST(T) is a word-spaced sparse suffix tree SSTws(T), comprising only a subset of the suffixes in the text T.
The above-mentioned objects and advantages are also realized according to the invention with a method which is characterized by generating the data structure as a word-spaced sparse suffix tree SSTws(T) of a text T for representing all the suffixes starting at a word separator symbol in the text T, storing sequence information of the words s in the text T in the word-spaced sparse suffix tree SSTws(T), generating a combined edit distance metric D(s,q) for words s in the text T and a query word q in a query Q and a word-size dependent edit distance metric Dws(S,P) for sequences S of words s in the text T and a sequence P of words q in the query Q, the edit distance metric Dws(S,P) being the minimum sum of costs for edit operations transforming a sequence S into the sequence P, the minimum sum of costs being the minimum sum of cost functions for each edit operation weighted by a value proportional to the change in the total length of the sequence S or by the ratio of the current word length and average word length in the sequences S;P, and determining the degree of matching between words s,q by calculating the edit distance D(s,q) between the words s of the retrieved information R and the word q of a query Q, or in case the words s,q are more than k errors from each other, determining the degree of matching between the word sequences SR; PQ of retrieved information R and a query Q respectively by calculating the edit distance Dws(SR,PQ) for all matches.
Advantageously the method according to the invention additionally comprises weighting an edit operation which changes a word s into word q with a parameter for the proximity between the characters of the words s;q, thus taking the similarity of the words s;q in regard when determining the cost of the edit operation in question.
In the method according to the invention the number of matches is limited by calculating the edit distance Dws(SR,PQ) for restricted number of words in the query word sequence PQ.
In another advantageous embodiment of the method according to the invention the edit distance D(s,q) between word s and a word q is defined recursively and calculated by means of a dynamic programming procedure; and the edit distance Dws(S,P) between sequences S and a sequence P is correspondingly recursively defined and calculated by means of a dynamic programming procedure.
The above-mentioned objects and advantages are also realized with the use of the search system according to the invention in an approximate search engine.