This invention relates generally to computer appliances and, more particularly, concerns automated recovery and reliability methods and systems.
Computer appliances are becoming ever more popular. A xe2x80x9ccomputer appliancexe2x80x9d is a computing device that is similar in some respects to a general purpose computer. That is, a computer appliance can typically have many of the same constituent components that a general purpose computer has (e.g. one or more microprocessors, storage devices, memory, an operating system and the like). Computer appliances are different, however, because they typically have a fixed function that does not or cannot vary. Specifically, computer appliances are designed and programmed to implement very specific types of functionality. Many different types of computer appliances can exist. For example, a server appliance can be designed to implement functionalities that include file sharing, Internet sharing, and print sharing. Other types of appliances can include set top boxes that are used in connection with viewing multimedia presentations on a television, or hardware systems that are designed to control a home security system. In addition to having a fixed functionality, computer appliances are often characterized in that they sell for a price that is much less than that of a general purpose computer. This is due, at least in part, to the fact that computer appliances are designed to do only a limited number of things. In addition, computer appliances are often of a form factor that can be xe2x80x9ctransparentxe2x80x9d to the owner or user. That is, a user can simply xe2x80x9ctuckxe2x80x9d the computer appliance away and, after a while, may not even be aware that it exists (except for the fact that the appliance is implementing a functionality that the user desires). Another characterizing feature of some computer appliances is that they can lack a user display and/or other mechanisms that allow a user to interact with them (e.g. a keyboard, mouse input etc.). This is much different from a general purpose computer that typically has a display through which it can communicate with a user and user mechanisms such as a keyboard and mouse input. This is an important distinction when consideration is given to the problems that the current invention is directed to solving.
Computer appliances, by their very nature, are designed to execute software. That is, specific software applications and operating systems can be designed for operation in connection with the different appliances. And, because the functionalities of appliances can vary widely, so too can the software applications and operating systems with which they are used. Often times, software applications (such as device drivers) and operating systems for these computer appliances are designed by third parties known as original equipment manufacturers (or OEMs). As careful as designers of software and operating systems are, however, there are still instances when the software or a particular resource that is designed to operate on the appliance will fail. A xe2x80x9cresourcexe2x80x9d refers to any type of hardware, software, or firmware resource that is used by the appliance to implement its functionality. For example, hardware resources can include, without limitation, communication lines, printers and the like. Software resources can include, without limitation, software applications, memory managers and the like.
It is highly desirable that computer appliances operate in a dependable, reliable manner. If a computer appliance experiences a system failure, for whatever reason, an end user is not usually going to be able to fix it (other than perhaps by shutting the appliance down and restarting it). This is quite different from a general purpose computer which, in many instances, will use the display to advise the user that there has been a particular system failure and might display a graphic user interface (GUI) to step the user through a remedial procedure. Many times, though, the general purpose computer""s system failure will require specialized knowledge which the end user simply does not have. In that case, the end user may have to contact a xe2x80x9c1-800xe2x80x9d help line to have a trouble shooter fix the problem. At any rate, system failures typically require human intervention.
In the context of many computer appliances, system failures are even more difficult to fix because of the absence of a display or user interface to advise the user of a problem.
Accordingly, this invention arose out of concerns associated with improving the operability and reliability of computer appliances and further enhancing the user experience thereof.
Two primary goals for an ideal computer appliance are that: (1) it run for an extended period of time (i.e. months) without user intervention, and (2) it run without a disruption of user services.
To achieve these and other goals, aspects of the invention provide methods and architectures for enhancing the reliability of computer appliances and reducing the possibilities that human intervention is necessary in the event of a system failure or failure condition. The provided architecture is extensible and provides a generalized framework that is adaptable to many different types of computer appliances.
One aspect of the invention provides a boot up redundancy component to ensure that a computer appliance can be appropriately booted. In the described embodiment, a single hard disk is configured for use in a single computer appliance. A second disk (mirror disk) can also be used to enhance reliability. In this case the system BIOS will boot from the secondary disk (which is configured exactly like the primary disk) if the primary disk fails (e.g. a hardware failure). A first operating system is resident on the single hard disk and is configured for booting the computer appliance. In addition, a second operating system is resident on the single hard disk and is configured for booting the computer appliance. The second operating system serves as a back up for the first operating system in the event that the computer appliance cannot be booted from the first operating system.
Another aspect of the invention provides an appliance-booting method that first attempts to boot the appliance from a first partition of a hard drive containing a first operating system. If this attempt is unsuccessful, then a second attempt to boot the computer appliance is made from a second partition of the hard drive. The second partition contains a second operating system that is configured as a back up operating system for the first operating system. The backup operating system can serve as a xe2x80x9cpristinexe2x80x9d operating system (in that it only functions to restore the first or xe2x80x9cprimaryxe2x80x9d operating system to a working state) or a fully functional system providing end user services. The preferable configuration is for the backup operating system to serve as a xe2x80x9cpristinexe2x80x9d operating system. A pristine operating system restores the primary operating system by quick formatting the primary operating system partition and then installing a copy of the primary operating system onto the newly formatted partition. The pristine operating system can also restore configuration settings of end user services when service configuration checkpointing (i.e. saving changes to service configuration in a location accessible from both the primary and pristine operating systems) is employed from the primary operating system.
Another aspect of the invention makes use of a xe2x80x9cboot count variablexe2x80x9d. The boot count variable is a variable that keeps track of the number of times attempts are made to boot a particular appliance. Each time an attempt is made to boot the appliance from a selected disk partition, the boot count variable is incremented. When the boot count variable reaches a certain threshold value after the appliance has not been successfully booted from the selected disk partition, another disk partition is utilized, if available, to attempt to boot the appliance. In this manner, software redundancy is provided and reliability is enhanced.
Another aspect of the invention provides an architecture for use in monitoring for, and attempting to remedy failure conditions that are associated with various resources of a computer appliance. In the described embodiment, one or more resource monitoring components are provided. Individual resource monitoring components are programmed to monitor the status of an associated computer appliance resource and to detect a failure condition in which the resource cannot be used by the computer appliance for its intended purpose. At least some of the resource monitoring components are programmed to attempt to remedy the failure condition in the event that the resource monitoring component detects a failure condition. An appliance monitoring service is provided and is configured to be in communication with the resource monitoring components. The appliance monitoring service is programmed to attempt to remedy failure conditions that cannot be remedied by the resource monitoring components. In the described embodiment, the resource monitoring components are implemented as programming objects having callable interfaces. In addition, the appliance monitoring service comprises an appliance monitor object and a global recovery object having callable interfaces. The appliance monitoring object can be called by one or more of the resource monitoring objects to report a resource failure condition. The appliance monitoring object can then call the global recovery object which contains a collection of recovery algorithms that can be implemented to recover the appliance.
Embodiments of the invention provide an automated, flexible, extendable appliance recovery system that greatly reduces that chances that human intervention is needed to recover an appliance that has experienced a system failure. In addition, the inventive methods and systems reduce the possibility that a particular system failure will disrupt end user services.