A search engine is an information retrieval system that allows users of a computer system to specify criteria about an item of interest, “search terms”, and to have the search engine find the matching items. In the case of text search engine, e.g. Google, the search query is typically expressed as a set of words.
In order to speed up the search process, a search engine will typically collect metadata about the group of items beforehand in a process referred to as indexing. An index typically requires a smaller amount of computer storage and provides a basis for the search engine to calculate the item relevance.
Desktop search is the name given to search tools which search the content of a user's hard drive rather than the Internet. Such tools may find information including web browser histories, email archives, text documents, sound files etc. Such search tools can be extremely fast but may not search the entire hard drive. For example, only operating system specific applications may be searched (e.g. Microsoft documents, folders) and information contained in email or contact databases may not be included.
Since a significant proportion of company data can be stored within unstructured data (e.g. user created directory structures) it is important that a desktop search engine work be able to search within all areas of the computer.
Desktop search engines build and maintain an index database to optimise search performance. Indexing takes place when the computer is idle and the search engine generally collects information relating to file/directory names; metadata such as titles, authors etc; and, the content of the supported data items/documents. An example of a desktop search tool is “Windows Search”, an indexed desktop search platform released by Microsoft for the Windows operating platform.
Web search engines provide an interface to search for information on the Internet. A web search engine works by storing information about a large number of web pages which are retrieved by a web crawler, an automated web browser that follows every link it sees. The contents of each page are then indexed and stored in an index database for use in later queries. When a user enters a query into a search engine, e.g. by the use of key words, the web search engine examines its index and provides a listing of best matching web pages according to its criteria. Most search engines support Boolean operators AND, OR and NOT to further specify the search and some engines provide a proximity search which allows users to specify the distance between keywords.
Given the current size and speed of growth of the Internet it is important that the initial search query is relevant in order that relevant search results are returned. The usefulness of a search engine also depends on the relevance of the result set it gives back and one of the main problems with current search engines is the tendency of the results set to contain duplicate search results.
De-duplication of search results is currently handled by means of hash algorithms in which each chunk of data is processed by a hash algorithm thereby generating a unique number which is stored in an index. When a piece of data receives a hash number, that number is compared with the index of other existing hash numbers. If the hash number is already in the index then the piece of data is considered a duplicate and is not stored. Otherwise the new hash number is added to the index and the new data is stored. In some cases, however, the hash algorithm may produce the same hash number for two different chunks of data. When such a hash collision occurs, the system will not store the new data because it sees its hash number already exists in the data index. Such false positives can result in lost data. It is also noted that hash algorithms are complex.
A further drawback of known search engines is a limitation in the types of data source they can search. Traditionally, search engines index and search unstructured data resources. This therefore leaves large amounts of data that is tied into structured data stores, e.g. databases, that cannot be accessed by the traditional search engine. If the structured data is indexed separately then this index may be made available to the search engine but this creates a further data store for data which is already indexed within its own structure.
It is therefore an object of the present invention to provide a search engine that overcomes or substantially mitigates the above problems with the prior art.