1. Technical Field
The present invention relates generally to an improved data processing system, and in particular, to a method and apparatus for managing hardware and software components. Still more particularly, the present invention provides a method and apparatus for automatically recognizing, tracing, diagnosing, and recovering from problems in hardware and software components to achieve functionality requirements.
2. Description of Related Art
Modern computing technology has resulted in immensely complicated and ever-changing environments. One such environment is the Internet, which is also referred to as an “internetwork.” The Internet is a set of computer networks, possibly dissimilar, joined together by means of gateways that handle data transfer and the conversion of messages from a protocol of the sending network to a protocol used by the receiving network. When capitalized, the term “Internet” refers to the collection of networks and gateways that use the TCP/IP suite of protocols. Currently, the most commonly employed method of transferring data over the Internet is to employ the World Wide Web environment, also called simply “the Web”. Other Internet resources exist for transferring information, such as File Transfer Protocol (FTP) and Gopher, but have not achieved the popularity of the Web. In the Web environment, servers and clients effect data transaction using the Hypertext Transfer Protocol (HTTP), a known protocol for handling the transfer of various data files (e.g., text, still graphic images, audio, motion video, etc.). The information in various data files is formatted for presentation to a user by a standard page description language, the Hypertext Markup Language (HTML). The Internet also is widely used to transfer applications to users using browsers. Often times, users of may search for and obtain software packages through the Internet. While computer technology has become more powerful, it has also become more complex. As the complexity and heterogeneity of computer systems continues to increase, it is becoming increasingly difficult to diagnose and correct hardware and software problems. As computing systems become more autonomic (i.e., self-regulating), this challenge will become even greater for several reasons. First, autonomic computing systems, being self-configuring, will tend to work around such problems, making it difficult to recognize that anything is wrong. Second, problems will become harder to trace to their source because of the more ephemeral relationships among elements in the autonomic system. In other words, the set of elements that participated in the failure may no longer be connected to one another by the time the problem is noticed, making reconstruction of the problem very difficult. For instance, a number of publications address the topic of problem identification, but do so in a statically-configured system, such as Tang, D.; Iyer, R. K., “Analysis and modeling of correlated failures in multicomputer systems,” IEEE Transactions on Computers, Vol. 41 Issue 5, May 1992, pp. 567–577; Lee, I.; Iyer, R. K.; Tang, D., “Error/failure analysis using event logs from fault tolerant systems,” Digest of Papers., Twenty-First International Symposium on Fault-Tolerant Computing (FTCS-21), 1991, pp. 10–17; and Thottan, M.; Chuanyi Ji, “Proactive anomaly detection using distributed intelligent agents,” IEEE Network, Vol. 12 Issue 5, September–October 1998, pp. 21–27.
Today, human technical support personnel or system administrators perform most of the tasks associated with recognizing, diagnosing, and repairing hardware or software problems manually, often employing a good deal of trial and error, and relying on their own memory or ability to recognize similar patterns of behavior. This is a laborious process, and as system complexity increases there are progressively fewer system administrators who can do it competently. Thus, a need exists for technology to automate the recognition, tracing, diagnosis, and repair of problems in autonomic systems.