In the field of telecommunication, solutions have been devised for enabling relevant and potentially attractive services that have been adapted to different service consumers according to their interests and needs in different situations. Some examples are directed advertising and personalised TV. There are also solutions for analysing users in a telecommunication network and identifying segments or “clusters” of users with common characteristics, such that services and marketing activities can be adapted to these user groups accordingly. This analysis work can be executed based on traffic data generated in communication networks.
Traffic data is generally available from Charging Data Records (CDR) generated and stored for the networks mainly to support the charging of users for executed calls and sessions. Traffic data can also be obtained by means of various traffic analysing devices, such as Deep Packet Inspection (DPI) units and other traffic detecting devices, which can be installed at various communication nodes in the network. The traffic data may refer to voice calls, SMS (Short Message Service), MMS (Multimedia Message Service), e-mails and game sessions, and to circuit-switched as well as packet-switched communication, in this description collectively referred to as “calls”. The traffic data may also contain further information on the calls related to the time of day, call duration, location of users and type of service used. As a result, the user data that can be derived from such traffic data is often “multi-dimensional” in the sense of involving plural user-related parameters for each user.
For operators and service providers, it has thus become important to understand the consumers in order to provide adapted or “tailor made” services and products. It is therefore useful to identify certain segments of similar users based on their behavioural characteristics. Information on the users can be derived from the available traffic data which thus can be further analysed by using various techniques for so-called “data mining”. For example, Machine Learning Algorithms (MLA:s) and tools can be used for processing the traffic data, which may be utilised by operators and service providers when developing and adapting services.
A Data Mining Engine (DME) may further be employed that collects traffic data and extracts user information therefrom using various data mining and machine learning algorithms. FIG. 1 illustrates an example of how data mining can be employed for a communication network, according to the prior art. A DME 100 typically uses various MLA:s 100a for processing traffic data TD provided from a data source 102, and further to identify clusters of users. The data source 102 collects CDR information and DPI information from the network which is then provided as traffic data TD to the DME 100. After processing the traffic data, the DME 100 provides the resulting cluster information as output data to various service providers 104 to enable adapted services, products and marketing activities.
Analysing large quantities of traffic data and other information is typically a very complex and time-consuming process sometimes involving millions of calls, in order to identify different groups of similar users based on their previously executed communications. In the field of data mining, quantization of various parameter values is widely used to facilitate the analysis work and to reduce the computing and processing capacity required. Quantization means that a measured parameter value is approximated to a representative value in discrete steps of intervals.
FIG. 2 is a schematic diagram illustrating how such quantization can be made in the case of two parameters X and Y of which values are obtained for a plurality of users in a communication network. For example, parameter X may be the costs in SEK (Swedish currency) for executed calls while parameter Y may be the duration in seconds of executed calls.
In this example, the quantization is made such that obtained parameter values within predefined ranges are approximated according to a numeric scale 1-8 of finite intervals for each parameter X and Y, thereby forming cells of equal size each covering a pair of intervals of the scales. Each cell is assigned an identity which is representative for any pair of measured parameter values falling within the ranges defined by the cell. The cell identity may be any type of code and each cell covers values within a predefined interval “m” for each parameter. Although this particular example refers to cells in a two-dimensional quantization scheme or “parameter space”, it can be readily understood that this kind of cell representation is also applicable for any other multi-dimensional quantization schemes. Basically, this cell model implies that cells in a quantized parameter space are of equal length in all the dimensions although the actual parameters may well be measured in different entities, such as SEK and seconds, respectively.
In this example, a cell is marked that has been given an identity of (8,5) which explicitly indicates the representative value ranges of 8 and 5 for parameters X and Y, respectively. Hence, the quantization is made such that any pair of measured X and Y values that falls within that cell is represented by the cell identity of (8,5). This will simplify the following computation made based on the representative cell identity, as compared to using the original values which are typically more detailed and precise, e.g. 1234 SEK and 56789 seconds, and therefore more complex to compute.
When cells have been defined for a parameter space in the above manner, each pair of obtained parameter values is represented by a cell identity, which is done by reading or “scanning” input data related to multiple users in the communication network. FIG. 3 illustrates an example of how different cells are populated with users falling within their respective cells, as marked in the cell diagram. In this case, three different clusters 300, 302 and 304 of populated cells can be discerned in the diagram, while some “stray” cells 306 have also been populated with users that cannot be considered to fall within any discerned cluster, which are therefore disregarded.
Quantization is thus used in clustering, to achieve short processing time which is typically dependent merely on the number of cells in each dimension in the quantized space. Some clustering algorithms known as “WaveCluster”, “WaveCluster+” and “STING” use quantization in preparation to determine clusters in data. However, it is a problem with the quantization model described above, that a uniform size or granularity of cells must be selected in a trade-off between the accuracy, or “quality”, of the clusters and the computational complexity, i.e. the finer cell granularity and higher accuracy/quality, the greater complexity, and vice versa.
This has been addressed in the field of data mining by using adaptive quantization, i.e. varying the cell size depending on the distribution of input data and/or accuracy needed. Adaptive quantization techniques tend to use a small cell size in areas of the multi-dimensional parameter space with relatively dense data distribution, while a larger cell size is used in areas with more sparse data distribution, which is illustrated by an adaptive quantization scheme shown in FIG. 4. Two areas 400 and 402 with dense data distribution are thus given a fine cell granularity, while other areas with more sparse data distribution are given a “coarser” cell granularity.
Hence, adaptive quantization can provide good accuracy or quality in areas where it is really needed while lesser accuracy can be tolerated in other areas. However, in adaptive quantization, the input data must be read or scanned at least twice according to prior solutions: firstly to find out which areas in the multi-dimensional parameter space are densely populated for defining the adaptive cell sizes in different areas, and then secondly to quantize the input data according to the adaptive cell definitions. It can be readily understood that reading the typically large amounts of input data more than once can be very time-consuming, also requiring great processing capacity. Moreover, reading the data twice requires that all data has been received, resulting in substantial delays.