The World Wide Web (WWW) provides accesses to a large body of information. Compared with traditional databases, Web information is dynamic and structured with hyperlinks. Also, it can be represented in different forms and is globally shared over multiple sites and platforms. Hence, querying over the WWW is significantly different from querying data from traditional databases, e.g. relational databases, which are structured, centralized and static. Traditional databases can cope with a small number of information sources; but it is ineffective for thousands.
Most Web documents are text-oriented. Most relevant information is usually embedded in the text and can not be explicitly or easily specified in a user query. To facilitate Web searching, many search engines and similar programs have been developed. Most of these programs are database based meaning that the system maintains a database, a user searches the web by specifying a set of keywords and formulating a query to the database. Web search aids are variously referred to as catalogs, directories, indexes, search engines, or Web databases.
A search engine is a Web site on the Internet which someone may use to find desired Web pages and sites. A search engine will generally return the results of a search ranked by relevancy.
A competent Web search engine must include the fundamental search facilities that Internet users are familiar with, which include Boolean logic, phrase searching, truncation, and limiting facilities (e.g. limit by field). Most of the services try more or less to index the full-text of the original documents, which allows the user to find quite specialized information. Most services use best match retrieval systems, some use a Boolean system only.
Web search engines execute algorithms having internal processes which are repetitive tasks with independent entry data. A classical step by step processing of all processes and decisions on one entry data before processing the next entry data is inefficient since it takes too much time to process all the data. Thus, it is common to perform a search of a pattern within each file of a disk. The main repetitive processes to perform are: load file, open file, scan each word and compare for matching with a pattern, append the result in a temporary file, close file.
One way to improve the performance, and in particular to improve the search response time, is to achieve parallel processing by parallelizing the search mechanism in the database or index table. Such software parallelization will be more optimized but is nevertheless limited insofar as the software processing, even if parallelized, requires a minimum of time which cannot be reduced.