Today, the gathering of information using publicly available sources, such as those available over the Internet, is of principal importance in the performance of everyday tasks. In addition, similar activities involving the identification of desired information resident in nonpublic sources, such as LEXIS-NEXIS, private corporate libraries, consulting firm databases, law firm document databases and so forth are at the center of many educational, commercial and other tasks.
Much effort has been devoted to search engines with the objective of enabling the identification and retrieval of content relevant to user-specified topics of breadth ranging from the very specific to the relatively broad.
Typically, text searching on the Web is done by keyword searching. These keywords may be user input and specified in connection with a Boolean algorithm, or they may be extracted by the search engine. After this has been done, the search engine can do a text query and retrieval using keywords. Generally, the search engine looks for keywords and tries to find documents containing the keywords. However, in addition to searching for words, various algorithms are used to implement such items as what is referred to as “ranking,” “relevancy,” and so forth. The occurrence of keywords in a document among a collection of documents being searched can be used as an indication of the content of a document.
However, the mere appearance of one or more keywords in a document does not necessarily indicate that the document is relevant to the inquiry. For this reason, search engines typically give way to document publisher indications of keywords, for example metatags. Because search engines operate according to algorithms, and are not rational human beings, they do not deal with meaning. In an attempt to overcome this, particularly weight may be given to parts of a document. For example, headings, or the title of a document may be given more weight. Likewise, words which appear near the beginning of a document may also be given more weight by a search engine. Repetition of the same keywords may also result in additional weight being given to the document.
The shortcomings of such a system are apparent. Accordingly, even “full-text” indexing systems generally do not process commonly occurring stop words such as “a,” “an,” “the,” “is,” “but,” “and,” “or,” and “www.” Search engines may also differentiate between capitalized and lower case words in an attempt to infer meaning.
Still another problem is when keywords are relatively common, resulting in the retrieval of large numbers of documents. Likewise, if every keyword has multiple meanings, much of the search results may be of little or no value.
Another expedient is so-called Page ranking, under which the link to a particular document improves its rank. If there are many links, or if the links are from recognized quality sites, the links add still more weight. Effectively, links are endorsements that attest to importance, reliability, popularity, or the like, all of which would suggest a higher likelihood that the linked document would be relevant to the inquiry.
At the other side of the process, organizations wishing to drive traffic to their website can take advantage of the limited nature of the search algorithms to use the engine to increase the rank of the website.