1. Field of the Invention
The present invention relates to a method and system for calculating a relevance between words based on a document set, and more particularly, to a method for measuring a frequency of words of the document set according to various characteristics, obtaining statistical information based on the measured frequency, standardizing the obtained statistical information, and calculating a relevance between words based on the standardized statistical information, and thereby expressing the relevance as a numerical value, and a system for implementing the method.
2. Description of Related Art
Generally, people can understand the common relation between words by intuition. For example, people know there is a very close relation between ‘soccer shoes’ and ‘a soccer ball’, but there is no particular relation between ‘soccer shoes’ and ‘vehicle’. Therefore, while reading a document, people know that the document is associated with some particular words. Although the words are excluded from the document, people may figure out some related words.
However, computer systems, such as search engines and the like, cannot understand the common meaning between words. Thus, an operation of classifying words or documents associated with a predetermined document set must be performed through manual processes. Even though a particular document is retrieved from the document set in association with a query, a document that is unassociated with the query, i.e., a document that simply includes contents of the query, may be retrieved and provided as a search result.
If a relevance between words can be indicated as a numerical value, computer systems may classify words or documents based on the relevance between the words. The relevance may be used for document searching. For example, the relevance between ‘soccer shoes’ and ‘a soccer ball’ can be set to 0.95, the relevance between ‘soccer shoes’ and ‘nike’ can be set to 0.3, or the relevance between ‘soccer shoes’ and ‘a vehicle’ can be set to 0.001.
In this instance, if people make a direct decision about the relevance between words, it will require a great amount of time and effort. Also, the relevance between words may not be objective since people may input their own subjective concepts in the course of decision making. For example, for 200,000 words, word relevance must be calculated 40 billion times. Therefore, although one word relevance per second may be determined through a manual operation, a great amount of time would be required since 40 billion seconds is 1,268 years. It is also difficult to make an objective decision regarding assigning how many points to the relevance between ‘a vehicle’ and ‘hyundai motors’. Specifically, the determined relevance between words may not be totally reliable.
Accordingly, there is a need for a method and system capable of quickly and objectively calculating the relevance between words.