A social network refers to a social relationship structure created by the mutually depending relationships between nodes as persons and groups on the Web, in which profiles of each user are searched and new connections and information communication are supported. As the social network expands, types of services are diversified and the necessity of customized social network services providing various services depending on the characteristics of users is increasing.
That is, a group-tailored social network service of classifying individuals to group users having similar characteristics and supporting services matching with the characteristics of the groups needs to be supported. Types of the people in the social network can be recognized by analyzing patterns of behaviors repeated in the social network.
Techniques for collecting and grouping data having the repeated similar characteristics have been studied in various ways, and in particular, suffix tree indexing verified in the field of information search is one of indexing schemes for effectively grouping similar words and phrases when they are repeated.
The suffix tree indexing is an effective technique when the frequency of suffixes having a common prefix is high, for which diverse algorithms have been proposed.
However, the existing algorithms have a structure in which they are inserted into a sub-tree of a disk, so while a tree is built, frequent random access of the disk can be generated. Further, although some algorithms include a concrete buffering strategy for effectively using cache by configuring every sub-tree during accessing at the first stage, but such an access is effective only when a query is short compared to the entire sequence. Namely, when a query is long, since the entire tree needs to be allocated to a memory, the performance becomes poor. In addition, as the size of the entire sequence increases, a pre-processing cost with respect to each suffix additionally occurs, and in case of some divisions, a data skew is generated.
The data skew is a problem occurring as the generation frequency of suffixes sharing each prefix is not uniform when a string is divided with prefixes each having the same length. For example, in case of human gene, when the length of a longest common prefix (LCP) is 1, each prefix A, C, T, and G are divided by the rates of about 30%, 20%, 20%, and 30%, respectively, so some divided sub-trees may have a large size. In addition, when the LCP value is great, a many partitions are generated to cause a resource load, and when the LCP value is small, a sub-suffix tree larger than a memory is generated to generate an additional disk I/O.
Thus, in a suffix tree algorithm for solving the data skew problem, a suffix tree of a variable length scheme is generated to divide and merge partitions based on variable prefixes to thereby build a large amount of DNA sequences within a short time in a memory. However, the suffix tree algorithm has a problem in that a large memory and disk space are required in building a tree and a disk I/O is generated to merge sub-trees having the common prefix.