1. Field of the Invention
The present invention relates generally to communications networks. More particularly, the present invention relates to efficiently collecting large amounts of raw data over a communications network.
2. Background of the Invention
Service providers and network operators are becoming more competitive and offering increasingly diverse services. At the same time, subscribers to these services (personal and business) are demanding more niche services that they can exploit on their high-end wireless and broadband devices. There is increasing pressure on service providers to offer quality services tuned to the subscribers' needs. This involves monitoring a subscriber's use of a service or an application, as well as monitoring other real-time and event-related data.
Data monitored typically includes various packets of raw data generated at several points within and outside the network. This includes records related to usage of services and applications. This also includes real-time information. These records are generated by a number of network nodes deployed across a large region such as a nationwide telecommunications network, including various locations, data centers, regional distribution centers, and Mobile Switching Centers (MSCs). An example of such raw data is a Call Detail Record (CDR), generated by a switching center or MSC when a user makes a telephone call. Similar records are generated when a user accesses a particular service or applications. With today's IP-based networks, actions of a user can be tracked to provide quality personalized service and to increase the operator's revenue. Furthermore, compliance with recent federal wiretap laws mandates an efficient and comprehensive database of calls made and services accessed.
The general idea is to collect and aggregate all the data and store it in what is called a data warehouse. At a high level, charging systems collect a lot of data, take this raw data and correlate it to the services used, and then process the correlated information. For instance, a mediation unit within a billing system is a collection point for raw data. Mediation uses data like billing records, charging records, and Call Data Records (CDRs), and correlates the data according to charging rules. Then the data packets (wrappers) are sent downstream to a billing system to correlate and rate those records with the subscriber's profile. There exists a plethora of similar uses for raw data. Packets of raw data from various sources can be sent to a data warehouse, or specialized data marts for purposes of service solution analytics, device identification analytics (using CDRs), network usage analytics, etc. Rich Internet Protocol (IP) services have advertising layers requiring historical information combined with real-time information about the subscriber to provide such services.
The present system, however, is inefficient when dealing with the increasing quantity of information generated every day. Presently, each application typically has its own sorting and filtering mechanism and a warehouse or data mart for storing the data. These nodes have the duty of sorting the packets and sending the packets to the correct data warehouse. Alternatively, the nodes could each send data to all of the warehouses where the warehouses sort the wrappers to place them in desired locations. What exists today involves a collection interface, a transformation layer that converts incoming data into a usable format (such as an FTP file that has historical HTTP information that needs to be transformed to raw HTTP), and a data sifting layer to determine what information is usable and what is not. After this, information is correlated to something meaningful.
This creates inefficiencies at both the nodal and warehouse levels. A specific CDR may be useful to more than one application or warehouse. However, a network switch programmed to deliver this CDR to multiple destinations has the problem of formatting the CDR to match the requirements of each destination node, and delivering that CDR to the node. Network nodes traditionally use their processing on their interfaces, so even having the ability to send CDRs to multiple places is processor intensive. A provider of an external application or a warehouse operator has to burden the network operators to see if they can deliver multiple call records in parallel. The network operators don't want to be responsible for running up the number of interfaces that they are sending to because of their limited capability.
What is needed is the ability to collect a lot of information from several network sources or web portal sources, and to stream that information to a collector. The collector should be able to orchestrate the information, dismiss duplicate packets, and send the information to multiple data warehouses or destinations.