Users usually have a certain purpose and intention to browse websites. For the websites, it is important to understand the true intention of visit of the user. The websites usually classify the users visiting the websites through the method of the behavior trajectory construction model of the users browsing the websites for training a classifier, or describe the user requirement by the popularity of queries in the websites.
The intral-website searching manner is the behavior that a user actively seeks information, and can describe the user requirement to a certain extent. The traditional website query clustering technology performs calculation through the literally overlapping between words depending on the Query itself. The implementation scheme is generally as follows: Step 1: keywords are literally dismantled (including word for word or word segmentation), the dismantled keywords can be expressed as a sequence string with a phrase (word) as a unit; Step 2: then the similarity of each pair of keyword pairs (jaccard or edit distance, etc.) is calculated one by one, that is, the degree of overlap of the string of words of two queries is compared, and the metric of similarity is returned; Step 3: it is clustered with a clustering algorithm. The clustering algorithm includes k-means clustering or hierarchical clustering, etc., and the implementing manners of different clustering algorithms are different but are the same in essence. Since the traditional technology is to establish contact through the degree of literal overlap of keywords, which does not meet the actual situation and just rigidly constructs a relevant dependence relationship, the user requirement cannot be accurately explained. For example, there is not any literal match between the Chinese name of Samsung Inc. “” and the Chinese name of Apple Inc. “”, but the correlation should be high, while the Chinese characters “” and “” are two types of words completely unrelated, but still have a relevant dependence relationship literally. Moreover, the existing website query clustering technology needs to calculate the similarity between each two keywords, which is high in complexity and does not apply to large-scale data mining.
There is no effective solution for the problem in the related art that the method for analyzing webpage data only relies on the degree of literal overlap of queries so that the data analyzing results cannot accurately explain the user requirement.