Today, large amounts of information on customers, subscribers and consumers are maintained in databases which can be retrieved for different purposes, e.g. for creating and offering relevant and attractive services that have been adapted to different needs and preferences of those customers. In order to understand the customers' needs and preferences, their behavior can be studied by employing a process known as “machine learning” on stored data relating to various activities of the customers. This analysis work can thus be executed based on traffic data generated in communication networks, which is typically stored in huge databases as Call Detail Records (CDR) relating to executed calls and sessions.
The traffic data may refer to voice calls, SMS (Short Message Service), MMS (Multimedia Message Service), downloading sessions, e-mails, web games, etc. This type of information can be used to analyze the customers' behavioral characteristics in terms of communication habits and service usage, and Machine Learning Algorithms (MLA:s) can be used for processing the traffic data. A Data Mining Engine (DME) may further be employed that collects traffic data and extracts information therefrom using various data mining and machine learning algorithms.
FIG. 1 illustrates an example of how data mining and machine learning algorithms can be employed for a communication network, according to the prior art. A DME 100 typically uses various MLA:s 100a for processing traffic data TD provided from a database 102, and further to identify segments or clusters with customers having similar characteristics. The database 102 collects traffic data from the network and the DME 100 uses the traffic data TD as input data to one or more MLA:s 100a. After processing the traffic data, the DME 100 provides the resulting segment information as output data to various service providers 104 to enable adapted services and targeted marketing activities. Machine learning procedures with model training are also often used for stored data relating to a range of application fields, such as transactions at a company or enterprise, research study results, analysis of users, natural language processing, pattern recognition, search engines, fraud detection, and so forth.
The machine learning algorithms known today are usually configured to create a model of the stored data by employing iterative computation of records in a dataset to derive different characteristics of interest from the stored data having some unknown underlying probability distribution. In a typical machine learning process, the data model is thus “trained” in order to reflect complex patterns inherent in the stored data. In this process, a so-called “back-end application” or similar is often used having functionality for fetching raw data from the database and training the model by applying the raw data to the model multiple times in an iterative manner.
This means that the raw data is applied to the model over and over again until the model has “converged” in some sense, wherein the model is updated after each iteration to minimize or at least reduce the divergence or difference between the raw data and the model. When this divergence has stabilized, i.e. does not notably change any more upon further iterations, the model is said to be converged and is stored as processed data. For example, the well-known “K-means clustering algorithm” can be employed where a squared error function is minimized.
By way of example, k initial “means” are first randomly selected from the dataset. Then, iteratively, k clusters are created by associating every observation from the data set with the nearest mean, and the “means” are updated by setting them equal to a centroid of each of the k clusters. This is then repeated with multiple iterations over the dataset, and in each iteration the “means” change their position step by step until no more noteworthy changes occur and then the model is deemed to have converged.
However, there are some serious drawbacks with using a back-end application in the above manner, particularly when a very large dataset with many data records is involved. Firstly, the fetching operation can be quite time-consuming e.g. depending on limited bandwidth of the communication link used. In some cases, there may be millions of records in a dataset and it may be necessary to train the model hundreds of times before it is reasonably converged, each time fetching all the records from the database.
FIG. 2 illustrates that a back-end application 200 repeatedly fetches the same dataset from a database 202 in an action 2:1, in order to train a data model iteratively in an action 2:2, using a model training function 200a, thus repeating actions 2:1 and 2:2 multiple times which can be very tedious. A suitable query language is used for the fetching operation, typically the well-known query language SQL (Structured Query Language) which is a database computer language designed for managing data in relational database management systems.
On the other hand, the fetching operation can be significantly rationalized by fetching the data only once and store all records locally in the back-end application for use in the training process. FIG. 3 illustrates an example of this, where a back-end application 300 fetches the dataset only once from a database 302 in an action 3:1 and stores the records in a local memory 300a. A data model is then trained in an action 3:2, using a model training function 300b that reads the records from memory 300a for each iteration, which is much less time-consuming than fetching the records repeatedly for each iteration from database 302. However, this solution typically requires a memory with capacity for storing large datasets and the records therein must still be fetched from the database 302, if only just once.
It is thus a problem that a back-end application for training a data model in a machine learning process is either required to fetch large amounts of data records multiple times, or must be equipped with a memory of great storing capacity. Conventional back-end applications for machine learning are equipped with a RAM (Random Access Memory) for temporary storage of data. The large amounts of data for machine learning do mostly not fit into a RAM and it may therefore be possible only to take a limited amount of samples from the dataset that can be accommodated in the RAM for the training operation, naturally resulting in lost information and insufficient accuracy. This becomes a problem in machine learning where it is necessary to iterate over the entire dataset in order to achieve adequate model training.