1. Field of the Invention
The present invention relates to a data extracting method for extracting information useful for marketing, and the like from a record as an enormous processing object such as a utilization history of a credit card.
2. Description of the Related Art
A data mining technique has been noted as a technique for extracting knowledge from a large-scale database. Various techniques such as decision tree, neural net, finding of an association rule, and clustering have been proposed as data mining techniques. It is expected that a characteristic knowledge hidden in the database is extracted by such techniques and applied to various fields such as marketing.
Customer management is performed using a card in retailers such as a supermarket. When the card is used, sales information such as a customer who has bought an item can electronically be obtained. When the sales information is analyzed, properties of the customer and item can be known, and the information can effectively be utilized as marketing information. A clustering technique is applied to such a situation, and used in clustering customers having similar purchase tendencies.
An example of clustering will briefly be described. In the example, a certain retailer deals in three items (x, y, z), and the sales information of four customers A, B, C, D who have bought any one of the items is used for clustering.
It is assumed that the customer A buys the item x, the customer B buys the items y, z, the customer C buys the item y, and the customer D buys the items x, z.
A three-dimensional space is assumed for the three items, each dimension of the three-dimensional space corresponding to each of the three items. The four customers are represented by points in the space, and the customers are clustered. That is, respective dimensional values are set to binary values “0” and “1”. The dimensional value is set to “1” when a certain customer buys a certain item, and set to “0” when the customer does not buy the item.
In such an x-y-z space, a coordinate of the customer A is (1, 0, 0), coordinate of the customer B is (0, 1, 1), coordinate of the customer C is (0, 1, 0), and coordinate of the customer D is (1, 0, 1). When a distance between the customers is represented by a Hamming distance on three-dimensional vectors, the distance between the customers having similar tendencies to buy a item is reduced, and a cluster is formed. For example, a Hamming distance between the customers B and C is “1”, and a Hamming distance between the customers A and D is “1”. Since the distances are smaller than those of other combinations, two clusters (B, C) and (A, D) are formed.
When a four-dimensional space is assumed for the four customers, each dimension of the four-dimensional space corresponding to each of the four customers, the three items are represented by the points in the space. The item can be clustered similarly as described above.
There is originally no concept of order or distance in customer data of the customers A to D. Moreover, there is originally no concept of order or distance in item data of the items x to z.
Therefore, it is necessary to select sales information having a relation which can be assumed beforehand, such as the sales information corresponding to an item type and an age group of customers from the sales information before performing the clustering. For example, item data classified as audio items are selected from the total item data, and the customers are clustered into the spaces of, for example, speaker, CD players, cassette players and so on, each dimension of the space corresponding to each of the audio items. Moreover, items which high school students prefer to buy are clustered. In this case, only the high school students are selected beforehand from all of the customer data, and the items are clustered in the space of the high school students, each dimension of the space corresponding to each of the high school students.
In clustering, when the customers A and B often buy audio items, but their purchasing habits (buying tendencies) completely differ in other types of items, the distance between the customers A and B increases, and the customers A and B do not belong to the same cluster in some cases. Similarly, when items X and Y are often bought by high school students but their buying tendencies are different from each other by any other age group of customers, the items X and Y do not belong to the same cluster in some cases.
When a relation between data can be clearly defined like the relation between an item category and a customer age group, clustering is effective. However, when the relation between the data cannot be defined beforehand, there is a problem that clustering cannot be applied.
Moreover, when the number of dimensions of the space for performing the clustering is enormous, it is disadvantageously difficult to extract similarities in the purchasing habits of the customers regarding an item group belonging to a specific category, or similarities of the purchasing habits of the items regarding a specific customer group.