Search engines are typically configured to record user page views. For a large-scale search engine, in a certain period of time, user page views can include a huge amount of data and a high proportion of user query keywords are repeated queries. For example, if there is a recent news event, query keywords queried by different users are usually similar and even the same. Search engine service providers will typically process user page views in order to provide better service. One basic processing technique is to merge the same query keywords, which would reduce a large amount of memory or disk space occupied by data storage. For example, assuming that recently there have been 2000 queries with the query keyword of “Alibaba”, and the resulting merged data form is “Alibaba 2000”, where “Alibaba” stands for user query keywords, and “2000” stands for the number of appearances of the query keyword in query log in a period of time. How to sample query keywords for the preliminary processed statistic data to make sampled data close to the real distribution of query keywords, however, remains unsolved.
In existing systems, to perform statistical analysis for data in the format of “query keyword, Page Views (PV)”, the system first calculates the proportion of each query keyword in all the query keywords. PV stands for the number of appearances of certain query keywords in the search platform. Taking the query data “Alibaba 2000” for example, the system first calculates the sum of PV values of all query keywords in the query keyword collection. Assuming that the sum of PV values is one million, which means that there are one million query keywords for all users in the system, and then calculating the proportion of the query keyword “Alibaba” in all query keywords and obtaining the proportion is 2000/1000000=0.0025, which means that the randomly sampled probability of the query keyword “Alibaba” in all query keywords is 0.0025. When sampled probabilities of all the query keywords are determined, a specific query keyword in the collection of all query keywords is sampled according to sampled probability of the query keyword. Based on the analysis of the final sampled data of corresponding query keywords, the distribution of user query keywords can be determined. For example, in a collection with a total number of one million PVs, 10000 PVs are selected as samples to be analyzed. The specific query keyword sampling, process is as follows: determine the sampled size of the query keyword according to its sampled probability, that is, [the sample size of the query keyword]=[the expected sample size]*[the sample probability of the query keyword], wherein the sample size of the query keyword and the expected sample size are positive integers. For example, assuming that the sample probability of the query keyword “Alibaba” is 0.0025, then 10000*0.0025=25 queries with keyword “Alibaba” are selected as query keyword samples. In the same way, the sample sizes of other query keywords are obtained according to the formula above; the sum of sample sizes of all query keywords is 10000. Compared with one million page views, the workload and calculation steps of analyzing and processing 10,000 sample page views for data analysts are much reduced and the efficiency is increased.
There are a number of problems with existing technology. For example, if the data size to be sampled is large, the sampling analysis method used in existing technology can simulate real data distribution to a certain extent. But when the data size to be sampled is on a medium or small scale, great distortion will be generated between the sampled result and the real distribution of data. This is because many statistic distributions of data have the characteristics of “long tail”, i.e., a lot of entities or data with low frequencies of appearance. Specifically, when users use search engine to query keywords, there are many query keywords queried by users appearing only a few times, such as some query keywords appear once or twice. Although the appearing frequency of some query keywords are very low, the sum of the query keywords with low appearing frequency occupies a big proportion of the total size of query keywords. For the long tail distribution, if existing sampling analysis method is used, then low frequency query keywords are not effectively sampled. For example, a goal of an application is to sample 2000 query words with the sum of PV values is one million. For a query keyword, take “ecommerce 1” for example, its sampled probability is one per one million and such low-frequency query keywords cannot be sampled by the method above. Due to the great difference between the distribution of sampled data obtained by the existing sampling analysis method and the distribution of real data, user demand information and market trends are often not mastered by sampling analysis of query keywords in search engine, and online commerce is constrained.