Comparing two texts to determine whether they have words in common is a relatively simple problem using today's technology. A harder problem is comparing two texts to determine whether they have similar meanings. Two texts can be compared to determine whether they contain common words simply by extracting the words (or word stems) in each text and finding the intersection of the two sets. In theory, one could determine how similar, in substance, two texts are by extracting the meaning from each text and comparing the meanings. However, it is difficult to extract the meaning from a text algorithmically.
Because it is hard to extract the meaning from a text, many applications that compare texts use word comparison as a proxy for meaning comparison. A search engine is the canonical example of a text comparison application: a search engine compares one text (the query) with another text (each document in a corpus of indexed documents). Documents that contain the query words appear in the search results. However, the words alone might not indicate what the searcher is looking for, since the same words can refer to several different concepts. For example, the word “lima” refers to a vegetable, and also to the capital city of Peru (although the vegetable is actually named after the city). Thus, a query such as “cooking lima” might refer to recipes for lima beans or cooking classes in Peru. If one enters this query into a search engine, the search engine is likely to return results that contain a high percentage of the terms “cooking” and “lima”, but the search engine may not be able to differentiate between the sites that are about bean recipes and those that are about Peruvian culinary schools. It is noted that a search engine is the canonical text-comparison problem, although the same issues arise in other text-comparison applications—e.g., finding articles that are similar to each other, comparing students' term papers to see which ones are similar enough to suggest plagiarism, etc.
Real-world search engines employ some form of relevance ranking. Thus, among those documents that contain the query terms, documents may be given higher or lower scores based on the number of inbound links, the percentage of the document that is devoted to the query terms, the provenance of the document, etc. However, these types of relevance rankings generally try to place documents from well-regarded sources near the top of the results, without regard to the underlying meaning of the text. There are some circumstances in which a mere comparison of words does not produce the results that are sought.