Data science has been widely used to extract insights from a large volume of data generated from customer behavior to drive business decisions. For example, in electronic commerce (“e-commerce”), merchants use data science to analyze online activities of customers to predict customer behavior and preferences, which enables them to strategize procurement, sales, inventory, transportation, delivery, and other aspects in business processes. One of the major sources of the online activities of the customers is web traffic data, such as log data of an individual visiting a web data service (e.g., a website or a mobile application) using a device (e.g., a computer or a smartphone). In many situations, the web traffic data may be collected as a string of characters (e.g., a uniform resource identifier or “URI”) that records useful information representing a customer interacting with the data service. Analysts may use the collected web traffic data to perform an analysis.
Some existing solutions for web traffic data collection and analysis are not adaptable to different formats of log data. For those solutions, an uninformed change of log data format may cause inaccuracy in downstream analysis. Moreover, those solutions are not customizable enough to collect various types and structures of log data. Analysts may require different format and contents of the log data, or need to disregard or no longer use some format or contents of the log data. However, those existing solutions lack such capability, which may cause duplicate information in the same log data. Furthermore, those solutions might not be able to validate the correctness of the log data, such as a required format or required data type. When the log data is corrupted, those solutions might not be able to detect and inform the log data analysts.
Therefore, there is a need for dynamic, customizable, and near-realtime collection and validation of web traffic data.