In any typical distributed application deployment, the web server(s) acts as the entry point for a web request coming from a client. Each web request that passes via the web server gets logged into the server logs as a log entry. So a web server log holds the entries for all the events that occur on the web server and thus on the applications. Each event entry provides information about a single request made to the web application. The entries help understand how the end user uses the application; in short the user behavior with the application.
Depending on the configuration settings at the web server, some or all of the standard fields are logged in the production web logs. The web servers in production then automatically log the requested fields for any and every event that is invoked by an end user.
MapReduce is a framework proposed by Google to process large sets of data by parallelizing the processing task across clusters of multiple computers. A problem at hand is decomposed into many small portions of work, map and reduce jobs. These jobs are passed on to any worker node in the cluster, a mapper or a reducer for processing.
A mapper accepts a set of key-value as input and applies a user defined map function to generate an intermediate key-value pair. The output from multiple mappers is grouped together and passed on to the reducer. A reducer merges the intermediate values belonging to the same intermediate key using the user defined reduce function.
So an underlying MapReduce implementation takes care of parallelizing and executing the programs on a large cluster of computer nodes.
Apache Hadoop is one such implementation of MapReduce framework. It allows distributed computing in reliable and scalable way. It follows master/slave architecture. The master, called as jobtracker, decomposes the job and passes them onto its slaves, called as tastrackers, for execution and re-execution.
To support such distributed computing over the data, it also provides a file system that itself can be distributed across multiple computer nodes, named as HDFS (Hadoop Distributed File System).
Analysis of user behavior in using web applications is important as it helps provide insights needed to improve customer satisfaction by providing better experience. Webservers provide logs that contain user behavior related information like the web pages user traversed, the time user spent in performing an action, the time spent in thinking and so on. Analyzing weblogs can therefore provide useful information about customer behavior.
With increasing use of internet, there are millions of users resulting in huge log files so there is a need for a scalable solution. Also there is a need to increase automation of the log analysis so that as web applications change, there is less human intervention needed.
There are solutions that analyze web logs and provide transaction analytics given transaction definitions. There are solutions that provide analytics at URL levels which can then be aggregated at Transaction level if the transaction definitions are provided.
The log analytics are typically performed at URL/page level by providing metrics, like page views, workload by status codes, response times etc., related to them. For any transaction level analysis, the transaction definition has to be provided by business where in they define the URL sequences. There are methodologies that can identify patterns in the URL sequences providing a map-like structure of the URL accesses.
The drawbacks of the above mentioned prior art is that for business to provide transaction definition, they need complete domain knowledge with thorough site map information. This level of information of all the available resources is most often not available with the business and the development teams do not have the necessary domain knowledge.
So there is chance of key transaction definitions not being provided and so critical information about them not identified. So what follows is the guesstimate to foresee the end-user behavior. This adds to the effort required and also the inaccuracies in the results extracted. The transaction definitions need to be updated whenever the web application is modified. So, the web log analysis can go out of synch with the web application if not updated periodically.
Thus there is a need to provide a way to auto-identify the probable transactions from the historical log data collected. This provides a mechanism to discover analytics at business transaction level that are identified using the actual user behavior rather than guesstimates.
The present disclosure proposes the usage of distributed file systems and MapReduce frameworks. This helps to reduce the resource requirements and the time consumed to perform the necessary algorithms, making the complete process efficient, and so feasible. The present disclosure method enables automation of transaction identification and transaction analytics. It improves on solutions with Transaction analytics by automating transaction identification from web logs.
The feature of the present invention lies in providing better solution for transaction identification from web logs using a method for automated web transaction identification from web logs and applies MapReduce framework to provide automated transaction analysis so that processing can be parallelized and completed faster