An explosion of computing devices, stationary and portable, is currently underway. These computing devices include, but are not limited to, smartphones, laptops, tablets, phablets, etc. Demand for these devices is made more acute by social media's rapid evolution, which encourages a need to communicate and be in contact with others. This creates continual competition among various manufacturers to generate better and faster computing devices.
However, an issue common to all these computing devices during development (and later) is reliability, i.e., the robustness, stability and interaction of their hardware and software. Most computer users have experienced the “blue screen of death,” i.e., the screen computers typically show after a system failure, for whatever reason. Launching a computing device with this type of issue on the market can be very damaging to its manufacturer.
To prevent or minimize this issue, manufacturers expend great effort (in both time and money) to debug and resolve potential errors of new devices. The faster these errors can be found and corrected, the faster the devices can be launched, generating income and positive press.
Fear is high of launching a new computing device that is error-prone, so some manufacturers limit the volumes of initial new device being launched. If a problem is found early by customers or by the manufacturer's engineers, negative public feedback is minimized. However, this approach limits a manufacturer's income potential. If an error is found, it becomes the manufacturer's priority to identify which hardware component or line(s) of code caused the problem so adjustments can be made and production corrected and ramped up. If the error detection process is lengthy, which might be the case with complex computing devices, that product's future may be compromised.
Thus, there is interest in mechanisms that quickly identify which hardware and/or code produces errors. A common and difficult error to solve in new computing devices is a system hang (or freeze). A hang can have several causes. Two such examples are now discussed. One involves a programming error, i.e., the computing device's software includes, for example, an infinite loop. More specifically, the infinite loop may be stuck on a spinlock in a kernel driver. Another example involves the processor's access to a hardware block that is not clocked or is powered off. In this situation, the processor enters into a real lockup during which the processor freezes and does not answer any interrupts. In the first example, it is possible to take control of the processor, for example, via a high-priority interrupt that a watchdog can trigger before it performs an actual system reset. With this interrupt handler, the operator may be able to dump as much information on the state of the device as possible (e.g., the content of the processor registers and of different memories). An engineer can then use this information to determine the cause of the system hang. In the second situation, it is much more difficult to find a freeze's root cause. Depending on the platform and hardware used in devices, different causes may be responsible for freezes or system failures.
One approach for identifying errors in a new computing device is now discussed with regard to U.S. Pat. No. 7,447,946, (the '946 patent herein) the entire content of which is incorporated herein by reference. As illustrated in FIG. 1, which corresponds to FIG. 1 of the '946 patent, a data processing apparatus 10 is a System-on-Chip (SoC) device having plural master devices 30 and 40 connected through a bus 20 to a slave device 60. Master devices 30 and 40 connect through a cache 50 to bus 20. Other system devices and peripherals 65 may be connected to bus 20, some of which are master devices and others are slave devices.
Cache 50 is configured to store information related to activities of the master modules. Based on this information, an engineer may determine an error in data processing device 10. Thus, the '946 patent solution relies on utilizing an existing cache connected to plural master devices. Information is saved in this cache together with data the cache would normally store. However, other master devices 65 do not have a cache, for example, modem hardware accelerators, graphics accelerators, direct memory access (DMA) controllers, video accelerators, microcontrollers such, as for example, ARM's M0-M4 series, etc. This means that if one of these master devices 65 initiates a transaction on a bus that hangs the system, the engineer will not be able to find in the cache 50 information regarding that master device's activity prior to freezing, making it impossible to determine why there was a system hang.
In addition, because the '946 patent focuses on reusing a part of an existing cache, it will typically mean that the data processing device will only trace transactions when it knows in advance it will debug a problem or investigate performance. While this can be useful, it is very limiting and requires advance preparation, which might be a luxury when debugging a device. This is especially true for a modem in which network signaling and current radio conditions can result in very different system behavior that can take a long time to reproduce during problem troubleshooting.
Further, selecting an existing cache to trace transactions only during debug sessions is intrusive because part of the cache cannot be used to cache the “usual data” at the same time. Furthermore, the '946 patent does not address at all the problem of losing information stored in the cache when the data processing device is reset.
Thus, there is a need to develop a mechanism and a method capable of recording and preserving information related to an error affecting a computing device if the computing device shuts down. Accordingly, it would be desirable to provide devices, systems and methods that avoid the afore-described problems and drawbacks.