1. Field of the Invention
The present invention relates to an information retrieval apparatus and an information retrieval method that enables more accurate and speedier retrieval of information that matches users' preferences.
2. Description of the Related Art
Information devices with functions that retrieve broadcast programs that match the preferences of users such that the users can easily view or record the programs have been conventionally suggested in the form of personal computers (PC) allowing viewing of television programs and video images and in the form of personal video recorders (PVR: a recording device having a HDD or a DVD drive).
The functions of these types of information devices are realized in a manner in which the information device searches an electronic program guide (EPG) in order to retrieve programs that a user favors by using as a retrieval key the preferences of the user. The information device suggests the retrieved programs as recommendations to the user, or records the retrieved programs automatically.
The above user preferences are extracted by the information device through an analysis of the behavior of the user. For example, information that is common among the programs that a user often views or records is extracted, and the extracted information is used as a retrieval key that corresponds to the user's preferences.
As techniques for retrieving broadcast programs by using retrieval keys, the Boolean method and the vector space method are suggested.
The Boolean method is a method of retrieving information in which information including a retrieval key is handled as “True”, and information not including a retrieval key is handled as “False”.
The vector space method is a method in which target information and a retrieval condition consisting of at least one retrieval key are arranged over a vector space, and retrieval is performed by using the degrees of similarity between their vectors. The respective axes on the vector space correspond to the retrieval keys (retrieval key information) such as respective key words, date and time, and the like. In other words, the respective elements in a vector correspond to the retrieval keys included in the retrieval target, and the element values (weights) correspond to the frequencies with which the retrieval keys are included in information. It is generally thought that the vector space method allows for highly accurate retrieval.
However, in the vector space method, “n×m” number of retrieval processes are required, where the number of pieces of information functioning as the retrieval targets is n and the number of retrieval keys included in each piece of information is m; this causes the retrieval time to increase geometrically with the information amount, which is problematic. Accordingly, in the vector space method, the number of retrieval keys included in the retrieval target information has to be reduced before the retrieval process.
As a method of reducing the amount of retrieval key information, a method in which retrieval key information having small element values in a vector is removed, cluster analysis, and principal component analysis can be employed.
A method in which retrieval key information having small element values is removed is a method in which the retrieval key information that is to be removed is determined, when creating a retrieval index, on the basis of the element value. However, when the retrieval key that has been removed is one of the keys included in the retrieval condition, the retrieval using that retrieval key cannot be performed in such a manner that the retrieval accuracy decreases, which is problematic.
Cluster analysis and principal component analysis are similar to each other, and each is a method in which a plurality of pieces of retrieval key information that are all included in one piece of information and that are similar to each other in meaning and concept are put together into one piece of information. For example, when there are terms (retrieval key information) such as “news”, “press”, and the like that are similar to each other, these term are put together into one piece retrieval key information (for example, “press”). However, the cluster analysis and principal component analysis have a problem in that immense processing time is required for the calculation of putting similar terms together into one.
Also, the vector space method has an additional drawback to the above drawbacks that the cluster analysis and principal component analysis have: in the vector space method, the statistical characteristic of the amount of retrieval key information included in the retrieval target information affects the retrieval accuracy.
Generally, the amounts of retrieval key information in retrieval target information differ from each other, and information including a large amount of retrieval key information and information including a small amount of retrieval key information are included in the same group. In the vector space method, the larger the amount of retrieval key information included in retrieval targets, the more the retrieval targets tend to be ranked highly in the list of retrieval results, and the smaller the amount of retrieval key information included in retrieval targets, the more rarely the retrieval targets are retrieved.
However, a retrieval target that includes much retrieval key information is not always important information obtained as a retrieval result. When a user tries to retrieve information, it is only the information that the user wants that is “important information”, and the retrieval target including a large amount of retrieval key information in the vector space is not always information that is important to the user.
Actually, respective pieces of information on EPGs contain different amounts of information, and some programs have large amounts of information consisting of program names or detailed contents of the programs, while other programs have small amounts of information consisting only of program names. When a search is performed on a group including these programs, the programs having large amounts of information are ranked highly in the list of the retrieval result, and the programs having small amounts of information are not retrieved.
However, even programs that only include the programs' names and do not include the contents of the programs such that they do not have large amounts of information as described above can be programs that the user wants to be retrieved as the retrieval result. This is a factor in decreasing the retrieval accuracy.
In order to solve this problem, some methods have been suggested such as cosine normalization in which variations in information are leveled by normalizing vectors (as is seen in, for example, “Information retrieval and language processing” (Patent Document 1) written by Kensin Tokunaga and published in 1999 by University of Tokyo Press) and pivoted normalization (as is seen in, for example, “Pivoted Document Length Normalization” (Patent Document 2) written by Amit Singhal, Chris Buckley, and Mandar Mitra, SIGIR 1996).
FIG. 1 is a block diagram showing a system for creating retrieval indexes for the above-described conventional techniques. As shown in FIG. 1, a retrieval-information acquisition unit 1 acquires, from an EPG 2 that is the information source, retrieval information that is the retrieval target.
Next, a retrieval-information vectorization unit 3 arranges the above-acquired retrieval information on a vector space 4 formed on an area in a memory unit, and vectorizes the retrieval information.
Then, a number-of-effective-elements reduction unit 5 determines retrieval keys to be removed by using the element value (weight as the retrieval key) of the retrieval information vectorized on the vector space 4. Thereafter, the number-of-effective-elements reduction unit 5 reduces the number of effective elements included in the retrieval information.
A normalization unit 6 normalizes, by using the cosine normalization or the pivoted normalization, the vector of the retrieval information whose number of effective elements has been adjusted. Thereby, the retrieval information is arranged on the vector space 4 as a normalized vector, and the retrieval index is obtained.
A function of retrieving broadcast programs that match users' preferences has to fulfill at least the three requirements described below.
The first requirement is that the function has to be a function that retrieves programs matching user's preferences highly accurately. High accuracy used herein is a high probability that the retrieval result includes the information that the user wants; in other words, a high relevance factor with respect to the user's preferences.
The second requirement is that the function has to speedily retrieve programs that users want from among the programs that are about to begin being broadcast. This function is carried out by understanding the current preferences of the user.
The third requirement is that the function has to be a function that does not burden users to retrieve programs. In order to avoid burdening users, the function has to be a function that can automatically retrieve programs without requiring users to perform preparations (such as the creation of indexes) for retrieval or to perform the setting of retrieval conditions.
However, as described above, there is a problem in which, when a system that retrieves information on the vector space employs a conventional method for reducing the amounts of retrieval key information included in the retrieval target, the retrieval accuracy decreases and immense processing time is required for the calculation.
Also, the above cosine normalization has a characteristic in which the smaller the amount of retrieval key information included in the information, the larger the weight (element value) of the retrieval key information becomes via the normalization. Accordingly, the smaller the amount of retrieval key information included in the information, the more that information tends to be ranked highly in the list of retrieval results regardless of whether or not that information is important for the user. This is also a factor causing a decrease in the retrieval accuracy.
Pivoted normalization allows an appropriate leveling; however, it requires users to perform preliminary evaluation tests in order to adjust the parameters of slope and pivot from the set of parameters including slope, pivot, and old-normalization. This greatly burdens users, and is problematic.
On the basis of the above discussions, it is concluded that none of the conventional techniques disclosed in Patent Documents 1 and 2 fulfill the above three requirements.