Generally speaking, a profiler is a tool which aids in analyzing the dynamic behavior of a computer program e.g., for optimizing its performance or finding bugs or other problems. Profiler results can be used to identify inefficient sections of the program, which can then be modified to operate faster or more optimally.
Common profiler technologies include instrumenting and statistical analysis. The instrumenting approach, generally speaking, alters the computer program (or “instruments” it) with additional instructions that report back or log each time each function of interest is entered and exited. Log data from the instrumented commands is collected when the program is run. The log data can be used to reconstruct the program flow and examine how much time was spent in each function. Instrumenting can sometimes alter the program's dynamic operation or slow it down because of the extra code that is inserted. In a video game context, the delay for heavy instrumentation can sometimes be large enough to cause video games to become unplayable or no longer representative of actual dynamic game play.
A statistical profiler does not necessarily significantly alter the computer program. Instead, it may periodically stop the program execution (e.g., based on a timer) to sample where the program is in its execution at that particular instant. By sampling thousands or millions of times, a statistically accurate view of the program execution can be reconstructed.
Traditionally, many or most statistical profilers have been designed as general purpose tools for profiling and analyzing a wide variety of programs. Often, the specific nature of the program being profiled has not in the past been exploited in choosing any particular profiler sampling method. However, depending on the particular program being analyzed, purely deterministic sampling times, where each sample occurs at a fixed interval, may result in some areas of the code being oversampled and other areas being undersampled. This can happen when such samples are not necessarily statistically well distributed. On the other hand, statistical profiling with purely random distribution for the sampling points can in some contexts tend to group or cluster samples together, resulting in poor reconstruction.
Therefore, there is a need for an efficient method for choosing the sampling points during statistical profiling, based on the nature of the program being profiled.
In real-time video games and other real-time simulations, the video screen is typically refreshed at periodic rates at or above 30 times per second. This visual update frequency often dictates or at least informs the rate at which the simulation is recalculated and updated to provide new positions and animation points for the simulation display. As a result, similar processing takes place from one simulation frame to the next, such that the work done during each frame is roughly correlated to previous and subsequent frames. For example, during each frame, simulation entities may run their decision logic, movement logic, collision logic and other functions. This coherency from frame to frame causes similar code to be executed each simulation frame (e.g., every time through the simulation loop).
The exemplary illustrative non-limiting implementation herein uses the periodic nature of the video game frame refresh rate to provide improved statistical sampling of the executing code. This technique can compensate for the fact that similar code executes during each iteration through the simulation loop. However, instead of sampling at the same time each frame the sampling points may be deterministically or otherwise specified at points that differ from one frame to the next.
One exemplary illustrative non-limiting method and apparatus for efficient statistical profiling of an executing code in an embedded computing device chooses sample points based on characteristic properties of real-time video games and real-time simulation programs. A hybrid random distribution of sampling points can be used, wherein the sampling rate is fixed but is for example randomly or otherwise offset relative to the beginning of each simulation time frame. As a result of such sampling distribution, accurate measurements and comparisons can be deduced from fewer samples. Such samples can have good statistical coverage and provide a good representation of the underlying behavior being sampled.
Additionally, profiling an executing code in an embedded device, such as a video game console can confront system constraints such as limited on-board memory and relatively slow upload speed to a PC (personal computer) or other device. A statistical profiler may generate an extremely large amount of data in the embedded device relative to the available memory. Therefore, it is desirable to efficiently store the accumulated data, since otherwise it may not necessarily be streamed to a connected PC or other device quickly enough without impacting overall system performance. Once profiling has been completed, an efficient representation of the created data is desirable, so that the collected data can then be transferred to a PC or other device in a reasonable amount of time. Transferring collected raw data might take a considerably longer time to transfer. The disclosed exemplary illustrative non-limiting statistical profiler accomplishes these and other goals, resulting in operation fast enough so that the game can be played while the profiling is being performed in real time.
In one exemplary illustrative non-limiting embodiment, a list of function addresses required for the profiling process is provided to or generated by the embedded device before profiling begins. The list does not necessarily contain all of the function addresses and their sizes, but may for example contain only the function start addresses. In this way, a desired decrease in storing information is achieved. In one exemplary implementation, the call stack data that is created by the statistical sampling profiler can be transformed to contain only start addresses. This allows sample data to be sorted and accumulated on-the-fly—dramatically reducing the amount of stored data. Such exemplary efficient representation of statistical profiling samples can alleviate problems that can occur when profiling in an embedded device that lacks sufficient memory and/or adequate communication speed. The disclosed exemplary illustrative non-limiting technique of efficiently extracting and representing function addresses provides the ability to store thousands or millions of samples using magnitudes of less memory compared to storing raw data—also leading to faster transmission times to a connected PC or other analysis device.
Additional non-exhaustive exemplary non-limiting features include for example:                A statistical sampling profiler that takes the repetitive nature of the executing program code being profiled into account when choosing the timing and spacing of the samples.        A sampling algorithm that waits a random amount of time from the beginning of each simulation frame to begin sampling and then samples at regular intervals until the completion of the frame, at which time the sampling ceases until a next or subsequent frame.        A sampling algorithm that ensures the random amount of time to wait from the beginning of a frame is significantly different from previous frames, so that samples between an arbitrary number of consecutive frames are well distributed with respect to one another.        A statistical sampling profiler that sends a list of the starting addresses of functions to the embedded device on which profiling will be performed.        During profiling on the embedded device, each sample call stack that is recorded is transformed so that each address represents the starting address of the function that was sampled.        Each address transformation is accomplished by performing a binary search on the list of starting addresses, finding a match in log N time complexity. Other comparable methods such as a hash table could also be employed.        Given each transformed sample call stack, a binary tree known as a “first child-next sibling” is used to store the data. This efficiently packs the data into an array that grows as more samples are added. This represents the tree structure of the call graph.        A parallel array stores the frequency counts of each node in the “first child-next sibling” call graph tree. This separation from the call graph binary tree makes both data structures more cache and memory friendly, since this frequency count array is updated more frequently than the call graph binary tree.        To retain individual samples, an additional step is to save a pointer to the leaf node function in the call graph tree for each sample taken. Coupled with the call graph, this reduces each profile sample down to a single pointer, as opposed to the individual addresses of the entire call stack.        If memory allows, the raw profiling data can be stored as it is collected and then transformed at the conclusion of profiling to reduce its memory size.        When profiling is complete, the call graph tree and either the frequency count array or the sample pointers are transmitted to the PC from the embedded device, thus achieving greater speed than transmitting the original raw sample data.        