The present invention relates to software diagnostics, and more particularly to generating diagnostic data in a distributed software environment.
Usually in a client-server environment and in a distributed software environment, applications run as different processes on different machines. Software issues or bugs resulting from an interaction between some of the aforementioned applications need diagnostic data from all the machines involved in the interaction. If the software issues are intermittent, diagnosing the issues poses a major challenge for software engineers, especially due to the distributed nature of the environment. In a client-server application running in two different virtual machines hosted on two different physical machines, an intermittent software issue may appear after an application runs for an extensive period of time. This period may span over many days or weeks.
A programming bug that causes a software issue may reside on either side of a client-server program. Erroneous data representation over the network and protocol rule violations are common sources of such software issues. For example, in an attempt to create a very large array using a remote request sent from a client to a server, an out of memory error can result if a length of the array is misread at the receiving end of the request, or if the size or length is written incorrectly in the network stream while sending the request and the incorrectly written length is a huge number. In this case, since the programming bug could reside on either side of the client-server environment, diagnostic data from only the failing end does not provide clear and complete diagnostic information about the programming bug.
In cases of cloud based solutions, micro services, and grid based software, where multiple individual software entities written using different programming languages and for different runtime environments, and running on different operating systems and interacting based on diverse protocol rules, highly intermittent software issues are challenging to debug, analyze, and eventually resolve.
Known diagnostic techniques include enabling diagnostics on all virtual machines involved for an entire lifespan of an application until an intermittent software issue occurs. These techniques are impractical because they cause performance degradation and generate extremely large amounts of diagnostic data generated over days or weeks. Generating the large amount of diagnostic data may result in running out of space used to store the diagnostic data or overwriting earlier diagnostic data.
Other known diagnostic techniques take process snapshots such as system dumps when a software issue occurs. The process snapshot is specific to a particular virtual or physical machine and does not provide information about what was happening on another virtual or physical machine which is involved in the application when the issue occurs. The process snapshot provides information about the state of various attributes at a specific point when the issue occurred but does not include any historical information leading up to the time the issue occurred. In a client-server environment, scenarios leading to a software issue build up over a period of time and a process snapshot is not able to capture information related to that period of time, instead capturing a static view about a virtual or physical machine only at a specific time at which the issue occurs.