The conventional techniques classify users of a website into members and non-members based on whether the users register at the website. When a member registers at the website, the website may require the member to submit user attributes data, such as age, registration date, gender, location, registration source, industry, etc. The website stores user attributes data corresponding to an identification of the member at a database. Generally, a record stores the user attribute data of various attributes of the member as shown in Table 1.
TABLE 1UserRegistrationIdentificationAgeDateGenderLocationMember A29Nov. 17, 2011FemaleBeijingMember B36May 1, 2010MaleShanghaiMember C19Mar. 5, 2009FemaleTianjin
In Table 1, each row represents a record. Each field of the record stores attribute data that the member submits for one user attribute. For example, the “age” field in each record stores attribute data that the members submit for the user attribute “age.”
As there may be huge differences among the attribute data that the members submit for their user attributes, the website may classify the members based on the attribute data of the user attributes. Generally, the members are classified into two classifications. One is a main classification and the other is a secondary classification. For example, the members may be classified as active members and non-active members. The active members are the main classification and the non-active members are the secondary classification. Corresponding services may then be provided to the members based on their classifications.
The conventional techniques, when classifying the members, obtain multiple attribute data interval of each user attribute based on a large volume of already-classified members' attribute data. For example, the attribute data intervals of the “age” user attribute may be three, such as [10, 20], (20, 40], (40, 60]. The attribute data intervals of the “location” user attribute may be four, such as {Beijing, Shanghai, Tianjin, Chongqing}, {Hebei, Henan, Shanxi}, {Fujian, Jiangxi, Zhejiang}, and {Anhui, Gansu, Shandong}. The attribute data intervals of the “registration date” user attribute may be three, such as [1 Jan. 2001, 31 Dec. 2005], (1 Jan. 2006, 31 Dec. 2010], (1 Jan. 2011, 31 Dec. 2015]. After the multiple attribute date intervals for each user attribute are obtained, with respect to each user attribute, a Boolean characteristic is assigned to each attribute data interval. Each Boolean characteristic has its unique characteristic identification.
The conventional techniques, when classifying a member for classification or a member to be classified in real time, determine an attribute data interval of attribute data of the member for classification for each user attribute. The corresponding Boolean characteristic of the determined attribute data is assigned value 1 and the characteristic identification of the Boolean characteristic whose value is 1 is stored. For each user attribute, after the corresponding characteristic identification of the Boolean characteristic is extracted, based on weight values of each of the Boolean characteristics, a probability that the member for classification is classified into the main classification is calculated. If the probability is higher than 50%, the member for classification is classified into the main classification. If the probability is not higher than 50%, the member for classification is classified into the secondary classification.
FIG. 1 illustrates a flowchart of an example method of determining an attribute data interval of a user attribute in accordance with the conventional techniques.
At 102, a large volume of attribute data of classified members is extracted as training data. At 104, for each user attribute, each attribute data corresponding to the user attribute is treated as a separate attribute data interval. At 106, based on a Maximum Posteriori Probability (MAP) Bayes estimate rule, an evaluation value of attribute data interval classified at 104 is calculated. At 108, the adjacent intervals are merged to obtain multiple data attribute intervals and another evaluation value of the merged attribute data interval is calculated.
At 110, if the evaluation value obtained at 106 is smaller than the evaluation value obtained at 108, the attribute data intervals obtained at 104 are determined as the final attribute data intervals of the user attribute.
At 112, if the evaluation value obtained at 106 is bigger than or equal to the evaluation value obtained at 108, the attribute data intervals obtained at 104 are retained and the adjacent interval are continuously merged until reaching a classification of attribute data intervals with a smallest evaluation value. The obtained attribute data intervals from the classification with smallest evaluation value are determined as the final attribute data intervals of the user attribute.
The conventional techniques classify the attribute data intervals of the user attribute based on the attribute data from the training data. The conventional techniques then determine the attribute data interval of the attribute data of the member for classification. However, when the member registers at the website, he/she may not fill in the attribute data for some user attributes. For example, if the member does not submit the attribute data for the user attribute “age,” the attribute data of the “age” user attribute is missing in the record of the user stored at the website. In future classification, the attribute data interval of the attribute data of the user for such user attribute cannot be accurately determined, and thus the member cannot be accurately classified. Therefore, the accuracy rate of the conventional techniques to classify attribute data intervals is low.