1. Field of the Invention
The present invention relates to a learning system used to learn learning data.
2. Description of the Related Art
Both active learning and knowledge discovery are known as learning methods of learning data using a learning algorithm executed as a computer program. Active learning is a learning method in which effective data is actively shown, in addition to passive learning of the learning data, to improve precision. A method of using conventional active learning or a method of using selective sampling are disclosed in Japanese Laid Open Patent Application (JP-A-Heisei 11-316754), in “Query by Committee” by Seung et al., (Proceedings of the fifth annual ACM workshop on calculational learning theory of the international conference of the publication, 1992, pp. 287-294), and in “Support Vector Machines for Active Learning in the Drug Discovery Process” by Warmuth et al., (Journal of Chemical Information Sciences, Vol. 43, 2, 2003, pp. 667-673). In the conventional active learning technique, candidate data with the most data, or the largest number of data values or data points, is selected and is predicted as a data in a boundary. Also, knowledge discovery is a technique in which re-learning is carried out based on given learning data. A method of using the conventional knowledge discovery is disclosed in Japanese Laid Open Patent Application (JP-P2001-229026). This conventional knowledge discovery selects and uses a candidate data with the most data points.
In the learning method (active learning and knowledge discovery), a class or a label containing a function value is used. The class shows the existence or non-existence of a state of an event, and the function value uses a numerical value to show a state of an event. With active learning, the label is set to or associated with the learning data and the selected candidate data, and with knowledge discovery, the label is set to or associated with the learning data and the candidate data.
When the learning data is learned, ensemble learning is sometimes carried out. The ensemble learning is a technique in which a plurality of hypotheses are generated from a given learning data to improve the precision of the passive learning, and these hypotheses are integrated. As typical ensemble learning techniques, “bagging” and “boosting” are known, and are disclosed in “Bagging Predictors” by Breiman (Machine Learning, Vol. 24, No. 2, 1996, pp. 123-140), and “A decision-thoretic generalization of on-line learning and an application to boosting” by Freund, (Proceedings of the second european conference on calculational learning theory, 1995, pp. 23-37).
When the learning data is learned, an attribute selection is sometimes carried out. The attribute selection is a technique in which the attribute of the learning data is selected to reduce the dimension of the learning data so that precision is improved. The attribute selection is disclosed in “The Random Subspace Method for Constructing Decision Forests” by Ho, (IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, No. 8, 1998, pp. 832-844). In this conventional technique, whether each attribute should be selected is determined based on a predetermined probability so that the number of attributes is reduced.
However, there are the following problems in the conventional techniques in which the learning data is learned.
The first problem is that a desired result cannot be predicted. For example, it is not possible to carry out the learning (active learning and knowledge discovery) efficiently when the candidate data is unbalanced data or data of different values. For example, the unbalanced data are data in which a distribution of data is shifted from a uniform distribution when attention is paid on the label (class and function value), and for example, the number of data is different between the classes. Also, the data of the different values are a data when a cost determining the label of the data (class and function value) is different and when data with a specific label (class or function value) is valuable. The data of the different values are used for extracting a high value data.
The reason for the first problem is as follows. That is, the candidate data with a lesser data quantity is not selected when the candidate data with the most data quantity is selected. Therefore, various types of candidate data cannot be obtained to belong to the label of the learning data (class or function value) and it is becomes difficult to learn based on the learning algorithm. To prevent this situation, a cost could be set to the learning algorithm or the candidate data could be increased or decreased. However, it is difficult to determine an appropriate cost and the increase or decrease of the appropriate candidate data before the learning. When the values of the candidate data are different, the learning system which can predict a desired result by selecting the candidate data of the highest value is needed.
The second problem is that the prediction precision is not stable. The reason for the second problem is that a criterion to select the candidate data is previously determined (i.e., to select the candidate data with the most data values). Therefore, the selected candidate data are biased and the hypothesis is not generated to reflect the whole candidate data.
The third problem is that the prediction precision is not improved if an appropriate selection probability is predetermined when an attribute selection is carried out for the above-mentioned ensemble learning. The reason for the third problem is that when the probability of the attribute selection is predetermined to be 0.5, the number of the selected attributes becomes approximately half. However, in actuality, when there are many unnecessary attributes, the probability of the attribute selection is not 0.5. It is difficult to determine the appropriate number of attributes before the learning. Also, it takes a long time for the calculation, and the calculation cost becomes enormous if the number of the attributes is determined based on the calculation (the selection probability) while learning.
The above-mentioned problems will be described, using the screening of a new medicine as an example. In case of the screening of the new medicine, for example, a class shows the existence or non-existence of the combination of some compound as the existence or non-existence of the state of some phenomenon. A function value shows a combination ability of a compound in a numerical value as the size of a state of some phenomenon.
In the screening of the new medicine, the number of active compounds (positive examples) is very few, compared with the number of inactive compounds (negative examples). Therefore, if the candidate data with the most data quantity is selected, many inactive compounds are selected and various active compounds are not selected, as illustrated in the above second problem.
Also, in order to discover the active compound as a medicine candidate through the experiment, it is important to discover the active compound in the least number of times of the experiment. It is not possible to learn (for example using either active learning or knowledge discovery) efficiently in the case of the criterion of the most data quantity, as indicated in the above first problem. Therefore, there is a case that the better hypothesis cannot be generated (to predict a desired result).
In conjunction with the above description, a fuzzy knowledge acquiring apparatus is disclosed in Japanese Laid Open Patent Application (JP-A-Heisei 5-257694). In the fuzzy knowledge acquiring apparatus of this conventional example, classification rules with fuzziness are acquired as determination trees expressed with nodes and links from a training case group in which many training cases containing assigned attributes and classification results are collected and a fuzzy set which expresses the attributes is calculated. An initial node producing section adds a conviction degree to each of the training cases given from the training case group and outputs all of them as one node. A process stack stores the node from the initial node producing section. A node taking-out section takes out or removes the non-processed node that can still be divided of the nodes stored in the process stack as the node of a process object. A node evaluation section determines whether the removed node should be divided, and separates this removed node into a part of the node not to be divided and a part of the node to be divided. An end stack stores the node part not to be divided from the node evaluation section. A division attribute selection section takes the node part to be divided from the node evaluation section, determines whether the attribute contains a numerical value and a fuzzy value based on attribute values written in the training cases which are contained in the removed node, calculates a mutual data quantity when the node part is divided with the attribute, based on a summation of the conviction degrees of the training cases which are coincident with attribute values as items, if the attribute does not contain the numerical value and fuzziness as the attribute value, calculates a mutual data quantity when the node part is classified with this attribute, based on a belonging degree to the item, if the attribute contains the numerical value and the fuzziness as the attribute value, and selects the attribute giving a maximum mutual data quantity. A node dividing section stores the removed node which is selected by the division attribute selection section in the end stack, generates a new node as a fuzzy subset based on the item of the selected attribute and stores it in the process stack. A determination tree generating section generates a diagram showing classification codes based on the node stored in the end stack.
A pattern recognition apparatus is disclosed in Japanese Laid Open Patent Application (JP-A-Heisei 6-180773). In this conventional example, the pattern recognition apparatus recognizes a character pattern or an image pattern inputted to an input section. A pattern analysis section analyzes the similarity of a learning pattern to the input pattern inputted to the input section. A hypothesis pattern generating section generates a pattern near the input pattern inputted to the input section as a hypothesis. A comparing section compares the input pattern and the hypothesis pattern generated by the hypothesis pattern generating section. The comparison result determined by the comparing section and an analysis result determined by the pattern analysis section are described in an attribute description area. An estimating section refers to the description of the input pattern described in the attribute description area, carries out control to change an operation of the pattern analysis section based on the referred to description, to start the hypothesis pattern generating section, and to produce a final determination result from the final contents of the attribute description area, and sends a result to another apparatus.
An inference method is disclosed in Japanese Laid Open Patent Application (JP-A-Heisei 8-129488). In this conventional example, the inference method executes using a set of rules in a condition section in which a condition necessary to infer is generally described, and a conclusion section in which a conclusion corresponding to the condition is described. In a rule extraction process, the rule having the condition matching a given or input data is extracted. The case corresponding to the extracted rule is extracted in a case extraction process. The similarity with the input data is evaluated based on the extracted rule and the extracted case in an evaluation process. One of the rules or the cases which has a higher degree of similarity with the input data is selected in a selection process. An inferring operation is executed based on the result section of the selected case or the conclusion section of the selected rule.
A data mining apparatus is disclosed in Japanese Laid Open Patent Application (JP-A-Heisei 9-297686). In this conventional example, the data mining apparatus contains a correlation rule generation section which generates a correlation rule between attributes of data in a database based on selection criterion data in which a support degree and a conviction degree as criterion for selection of correlation rules are stored. A job knowledge base is a set of correlation rules whose effectiveness is previously confirmed. A hypothesis correlation rule generation section generates hypothesis correlation rules as hypotheses of the correlation rules from the job knowledge base and the correlation rules generated by the correlation rule generation section. A hypothesis correlation rule verification section checks a probability that the data of the database matches the hypothesis correlation rule generated by the hypothesis correlation rule generation section, and adopts the hypothesis correlation rule, if its probability is above the conviction degree of the selection criterion data, as a supplemental correlation rule.
A combination optimizing method is disclosed in Japanese Laid Open Patent Application (JP-A-Heisei 10-293756). In this conventional example, when a search is carried out based on a simulated annealing method, various evaluation values to each adjacent solution generating function which is used for the search are stored in an adjacent solution selection function evaluation value storage section provided on a storage device. The evaluation values are updated one after another, according to the progression of the search. Thus, the evaluation values of each function can be dynamically provided from the initial step of the search to the final step. By using these values, the selection probability of each function is determined, and a search method can be selected according to the state of progress of the search.
An attribute selector is disclosed in Japanese Laid Open Patent Application (JP-A-Heisei 11-259447). In this conventional example, an object is expressed with a string of attribute values to attributes expressed by symbols. A plurality of cases are given in such a way that a special attribute, referred to as a class, and a discrete value as a label which is an attribute value of the special attribute, are allocated to the string of attribute values. A part of the string of attribute/attribute values is selected to generate a new case in a recursive concept learning apparatus which requires the description to define a necessary and sufficient condition that the class is a specific label. A separating section separates, into two subsets, a plurality of cases to be given to the recursive concept learning apparatus. A first section calculates a conditioned probability that the class is the specific label when an attribute takes a specific attribute value in each subset. A second section calculates a difference between the probabilities through a weighted average based on respective generation frequency for each of the label and the attribute value. A determination section determines the usefulness of the attribute based on the weight-averaged values of the attribute values of the attributes.
An experiment planning method is disclosed in Japanese Laid Open Patent Application (JP-A-Heisei 11-316754). In this conventional example, an inference is carried out based on input/output data obtained through experimentation to determine a correct or approximate “function relation” between input and output. At the time of selection of input/output points of the next experiment, input/output data is learned by a lower level algorithm as training data by using a predetermined learning algorithm to infer a function relation from past input/output data using a predetermined expression form as the lower level algorithm, in a learning process. The learning precision in the learning process is improved by a boosting technique in a boosting process. Function values for input candidate points randomly generated are predicted by using a final hypothesis obtained as an average with weighting of a plurality of hypotheses in the boosting process. An input candidate point that has the smallest difference between a weighted summation of output values in which a summation of weights is maximum, and a weighted summation of output values in which a summation of weights is next maximum, is selected as the input point.
A method of knowledge discovery is disclosed in Japanese Laid Open Patent Application (JP-2000-353095A). In this conventional example, a knowledge having a condition section of a series of tests, and a conclusion section is extracted from a set of cases. A prediction precision of each knowledge is predicted based on precision of the knowledge which has the condition section from which a part of the series of tests is removed and precision of the knowledge which has the part of the series of tests in the condition section. The unexpected situation level of the knowledge is evaluated based on the difference between the actual precision and the predicted precision. Only the knowledge which have the unexpected situation level higher than a predetermined level are shown.
A knowledge discovery system is disclosed in Japanese Laid Open Patent Application (JP-2001-229026A). In this conventional example, high level knowledge is extracted by loading data sampled from a data base, stored in large-scaled data storage, into a main memory of a computer. An input section inputs a learning algorithm and data to be learned by the learning algorithm. A learning section controls the learning algorithm to learn a plurality of partial data sets produced from the data stored in the main memory through sampling as training data to obtain a plurality of hypotheses. A data selection section predicts function values to a plurality of sample candidate points read out from the large-scaled data storage by using the plurality of hypotheses, estimates data quantities of the sample candidate points based on the predicted values, selects one or more candidate points with larger data quantities, and adds the selected points to the main memory. A control section controls the learning section and the data selection section to repeat the selective data storage and knowledge discovery until a predetermined stop condition is met, and stores a plurality of resultant hypotheses in the main memory as final hypotheses. A predicting section predicts a label value of data with an unknown label inputted to the input section through an average, or an average with weighting, of the plurality of hypotheses.