1. Field of the Invention
The present invention generally relates to computer data networks and more particularly relates to methods and architectures for managing health of computer servers across one or more data networks by constantly monitoring the status of vital components and parameters of the computer servers, wherein the status information is collected for analysis and breakdown predication and/or subsequently to determine what measures shall be taken before the computer servers actually break down.
2. Description of the Related Art
The Internet is a data communication network of interconnected computers and computer networks around the world and is rapidly evolving to the point where it integrates elements of telecommunications, computing, broadcasting, publishing, commerce, and information services into a revolutionary business infrastructure. The economy on the Internet is growing in every aspect of life, a wide range of businesses including stock trading, commodities, products, and services ordering are all via the Internet. The infrastructure that supports the Internet economy is a network of numerous computer servers running nonstop all over the Internet. If one of the servers is down, the business relying on the server may be significantly affected. One of the examples that was largely publicized is one of the famous auction web sites, Ebay, that suffered from a breakdown period of a few hours. As a result, Ebay had to be responsible for all costs related to the breakdown in addition to numerous complaints from users all over the world.
The health of a server is extremely important to an online business. A server typically comprises many components and executes numerous applications respectively or collectively at the same time. Any one of the components and applications can malfunction due to many unpredictable conditions and unknown reasons and subsequently causes the entire server to break down. To prevent any damages to the operations, many online businesses use a backup system, namely a secondary server is prepared to operate once the primary one breaks down. In reality, however, the backup server solution is not secure either. Not only does the backup server solution cost nearly twice as much as the single server, but also there is the same likelihood that the secondary server may break down any time, except the likelihood for both of the primary and the secondary servers to break down at the same time is significantly lower. Further the only time that a business becomes aware of a serious problem with the server is that the server is indeed in a poor or breakdown condition and damages as the result of the condition may have occurred.
There is therefore a great need for solutions that can automatically inform online businesses of the status of their servers in time and further provide solutions/measures/services to obviate any possible breakdown.
The present invention has been made in consideration of the above described problems and needs and can be advantageously used in a data network, such as a local area network or the Internet. With the present invention, the health condition of each single computing device on the network can be monitored periodically and provided managed cares when a need arises.
According to one aspect of the present invention, a computing device to be monitored is installed with a piece of hardware or software. After the computing device is registered with a monitoring server remotely located with respect to the computing device, the hardware or software module can provoke the monitoring process that periodically samples values representing the health condition of the computing device. The sampled values are then sent back to the monitoring server for analysis.
According to another aspect of the present invention, a monitoring server receives messages periodically from all of the computing devices on a network that are registered for being monitored. Each of the messages may include a plurality of sampled values, some being corresponding surrounding values. Typically, each of the values represents one of the parameters being specifically and periodically sampled.
The monitoring server maintains a database that includes information regarding each of the registered computing devices and one or more data areas. At least one of the data areas is used to keep history for each of the parameters being monitored. At least another one of the data areas is used to keep the sampled values for a defined period and refreshed after the sampled values are consolidated in the history data area. The historic data in the history data area are used to predicate based on a latest sampled value what the remaining time may have for the parameter to actually break down so that necessary measures may be taken to prevent an actual breakdown.
According to still another aspect of the present invention, a health diagnostic system is provided that can automatically detect or predict a failure of a registered computing device. The prediction is performed by software and/or hardware units built into the system. There are many critical parts in a registered computing device that can be monitored. For example, CPU thermostat, motherboard thermostat, chassis thermostat, cooling fan speed and voltages of many critical points of the registered computing device. A proxy server is configured to gather data from the registered computing device for diagnostic and preventative purpose. Further, a monitoring server that can be the proxy server or a centralized server coupled to the proxy server is configured with an expert system based on historic data collected over the time to predict when the registered computing device may experience a breakdown due to the failure of one of the parameters being monitored. If the prediction is critical to the computing device, the owner of the device is notified of xe2x80x9cthe sicknessxe2x80x9d by, for example, email, pager or phone. Depending on the xe2x80x9csicknessxe2x80x9d level, different measures may be taken to restore the health of the computing device.