1. Technical Field
Present invention embodiments relate to testing computerized analysis of communication data, such as call detail record data, and more specifically, to testing computerized analysis of a synthetic communication data set that realistically models a genuine communication data set with a manageable memory footprint.
2. Discussion of the Related Art
Communication data is frequently used to track details or attributes about various communications, including usage rates, usage patterns, communication locations (both originating and receiving locations), and communication duration. For example, a call detail record (“CDR”) is a data record that includes details or attributes of a telephone call, such as an initiation time, a source or originating phone number, a call duration, an identifier of a source phone, an identifier of a cell tower handling the call, etc. CDR data is typically generated by telecommunications equipment as a telephonic communication (e.g., a text message or phone call) passes therethrough and is frequently used by telecommunications companies for billing. Moreover, since CDR data includes a wealth of social and lifestyle information, CDR data is also important for commercial purposes (e.g., targeted advertising), law enforcement purposes, and national security investigations. However, real CDR data is often difficult to obtain without significant justification because CDR data is both massive (e.g., many billions of records per day) and extremely sensitive (commercially and personally). Accordingly, CDR data analysis tools must be built, developed, and tested with synthetic CDR data.
One approach for generating synthetic communication data is to generate random values for the attributes (e.g., date, time, duration, cell tower ID, etc.) of each communication. However, in genuine CDR data, insofar as genuine is simply intended to mean real, these attributes have complex and subtle correlations that are not accurately modeled by a random selection of values. Consequently, other approaches for generating synthetic CDR data rely on detailed and complex configuration data, such as a contact list of other communication devices, a time-of-day usage profile, and a day-of-week usage profile, for each communication device involved in a simulation. In these approaches, an agent is then effectively used to mimic the behavior of each communication device in the simulation in view of its configuration data. While this approach may be suitable for generating highly realistic call patterns for a small number of phones, it requires a massive overhead to generate and store millions of phone configurations which is undesirable for effectively testing analysis tools that are intended for communication data.