Telecommunication data represent a branch of what is commonly referred to as big data. They are generated, for example, by the everyday use of mobile phones and collected by operators of telecommunication networks. In particular telecommunication control data include information about the mobile phone and its user, where the user is typically identified by means of the International Mobile Subscriber Identity (IMSI). This information is stored at the Subscriber Identity Module (SIM), typically a small card (SIM card), which needs to be integrated in a mobile device for correct use. By this, telecommunication control data can be used to track SIM cards. As typically a mobile device is carried by a person, tracking SIM cards enables, indirectly, the tracking of the movements and activities of persons. Besides persons, it is also possible to track other mobile devices, for example in the context of the next generation industrial automation.
In its raw state, the telecommunication control data are organized in terms of single telecommunication events. The corresponding data sets include information on the time of the event, the location of the antenna involved and the identifier that was transmitted by the mobile device. This information is used, at first, for operating the network and carrying out telecommunication services. However, it can also be used for other secondary purposes such as, for example, to monitor actions of single users, devices or a crowd thereof and, thus, to monitor traffic, material or work flows. To this end, it is beneficial and, in many cases, also necessary to condition the raw telecommunication control data. This is typically done in several steps.
A first step can be to enrich the raw telecommunication control data by additional demographic, device and network attributes. These attributes can be obtained, for example, from the information that is provided by a user in a respective service contract. Such an enrichment of the raw data represents an auxiliary step. It is often performed because it allows for more useful queries and analyses, while, at the same time, the cost is relatively low.
The next step is critical for many purposes. It deals with the problem that personalized information on the telecommunication activities of single users can reach very far into their privacy. Accordingly, the use of telecommunication data, in particular for secondary purposes, are often severely restricted by regulations for data protection. A typical measure to achieve the necessary compliance with data protection regulations is to scramble the data in a way that it cannot be traced back to a single person. The corresponding processes are often carried out in close accord with a third, independent party such as, for example, the federal commissioner for data protection and freedom of information or an independent technical inspection authority. After such scrambling, the scope of application that is in compliance with data protection regulations has become much broader, in particular the data may now be stored on much longer terms and also used by third parties.
Typically, it is also advantageous to presort the data in order to speed up typical queries. An example is to sort the telecommunication events with respect to single devices and arrange them in chronological order. Thus, movement or telecommunication profiles can be obtained. Such presorting is beneficial if the focus of the subsequent analysis is on tracking and analyzing the movement of a single device or a crowd of devices. Other forms to organize the data are also possible. If the focus is, for example, on the operation and stability of the network, it is better to presort the events with respect to single antennas rather than single devices. This already shows that the processing of the data and the resulting data structures depends on the nature and characteristics of the subsequent analyses and queries.
After conditioning the data, more involved crowd analyses can be performed. To this end, algorithms need to be developed that search and characterize the data points in addition to the raw characteristics that are provided by the user or the operator of the telecommunication network. These algorithms establish causal links between the events and can be used to organize the data points in larger structures. Such parsing can be based on both deterministic and probabilistic methods, where the statistics of the latter is often significant due to the mere number of devices and telecommunication events. Parsing is used to reduce the complexity of the data. It reveals or highlights correlations and, thus, facilitates a condensed view on the problem posed by the purpose of the crowd analyses.
As an example, different telecommunication events can be linked in a straightforward way by analyzing the locations of the involved antennas. Events associated with the same antenna indicate that the device was not leaving the area or sector of a single antenna for a certain time (speed of movement equals 0 m/s), while different locations indicate that the device was moving. A more involved analysis is necessary if one is interested in the means of transportation that has been chosen, for example a car, a ship or a train. Thus, for example, the commuter traffic to and from a city can be analyzed.
The results of such crowd analyses are used to draw general conclusions from the behavior of an otherwise incomprehensibly complex system such as a large crowd of individuals or devices. It can also be used for technological applications such as, for example, to control or plan public transportation systems or to facilitate new forms of mobility such as car sharing or i-mobility, where it is necessary to predict the demand, availability, service intervals and other functional parameters.
The disadvantage of the analyses that have been employed so far is that the underlying technological problem often requires to design new algorithms, parsing rules and search strategies and, subsequently, repeated and involved scans of the whole data set. Knowledge obtained in previous searches or queries is not used, because the data structures are not backwards compatible. Similarly, the data obtained by one algorithm are incompatible with the one obtained by another algorithm. As a consequence, the present procedures do not allow for new or subsequent queries or search tactics which may emerge, for example, in the course of a manual or automated analysis, real-time monitoring or controlling, or in future campaigns.