The present invention concerns a search engine with two-dimensional linearly scalable parallel architecture for searching a collection of text documents D, wherein the documents can be divided into a number of partitions d1, d2, . . . dn, wherein the collection of documents D is preprocessed in a text filtration system such that a preprocessed document collection Dp is obtained and corresponding preprocessed partitions dp1, dp2, . . . dpn, wherein an index I can be generated from the document collection D such that for each previous preprocessed partition dp1, dp2, . . . dpn a corresponding index i1, i2, . . . in is obtained, wherein searching a partition d of the document collection D takes place with a partition-dependent data set dp,k comprising both the preprocessed partition dpk and the corresponding index ik, with 1≦k≦n, and wherein the search engine comprises data processing units which form sets of nodes connected in a network.
Most prior art search engines work with large data set and employ powerful computers to perform the search. However, searching is a partitionable data processing problem, and this fact can be used to partition a search problem into a large number of specific queries and let each query be processed simultaneously on a commensurate number of processors connected in parallel in a network. Particularly searching can be regarded as a binary partitionable data processing problem, and hence a binary tree network is used for establishing a multiprocessor architecture such as disclosed for instance in U.S. Pat. No. 4,860,201 (Stolfo & al.) and international patent application PCT/NO99/00308 which belongs to the applicant and hereby is incorporated by reference. The present applicant has developed proprietary technologies for searching within regular text documents. These technologies are i.a. based on a search system and a method for searching as described in international patent application PCT/NO99/00233 which belongs to the applicant and hereby is incorporated by reference. The search system is based on efficient core search algorithms which may be used in the search engine according to the invention.
However, it has become increasingly important to cater for a growing number of documents to be searched and also to be able to handle an increased traffic load, i.e. the number of queries per second which shall be processed by the search system. This, apart from the ability to handle a large number of queries simultaneously on processor level, implies that a search engine should be implemented with an architecture that allows for preferably linear scalability in two dimensions, viz. both with regard to the data volume and the performance, i.e. the ability to handle a very large number of queries per second. Considering the development of the World Wide Web, a scalability problem in the search engine architecture will be extremely important as there presently is an enormous growth rate in both the number of documents and the number of users on the Internet.
Prior art search engine solutions for Internet are able to scale to a certain level, but almost always this is achieved in a manner that requires a high cost increase of the search engine system relative to the growth in data volume or data traffic. Very often the system costs scale as the square of the data volume or the traffic, a doubling of the data volume thus leading to quadrupled system costs. Furthermore all the major Internet search engines presently are based on very expensive server technology, often coupled with brute computing force-approaches and accompanied with disadvantages such as slow server turnaround, requirements for special hardware to provide fault tolerance etc. The system costs can e.g. be measured as the amount of hardware required to implement a search engine solution or the actual aggregated price of the system.