1. Field of Invention
The present invention relates generally to the field of software applications used on an information network (such as a cable television network), and specifically to the logging, analysis, and control of events occurring on electronic devices used in the network during operation of the software.
2. Description of Related Technology
Software applications are well known in the prior art. Such applications may run on literally any type of electronic device, and may be distributed across two or more locations or devices connected by a network. Often, a so-called “client/server” architecture is employed, where one or more portions of applications disposed on client or consumer premises devices (e.g., PCs, PDAs, digital set-top boxes {DSTBs}, hand-held computers, etc.) are operatively coupled and in communication with other (server) portions of the application. Such is the case in the typical hybrid fiber coax (HFC) or satellite content network, wherein consumer premises equipment or CPE (e.g., DSTBs or satellite receivers) utilize the aforementioned “client” portions of applications to communicate with their parent server portions in order to provide downstream and upstream communications and data/content transfer:
Digital TV (DTV) is an emerging technology which utilizes digitized and compressed data formats (e.g., MPEG) for content transmission, as compared to earlier analog “uncompressed” approaches (e.g., NTSC). The DTV content may be distributed across any number of different types of bearer media or networks with sufficient bandwidth, including HFC, satellite, wireless, or terrestrial. DTV standards such as the OpenCable Application Platform middleware specification (e.g., Version 1.0, and incipient Version 2.0) require that applications be downloaded to CPE from the bearer or broadcast network in real-time. The OCAP specification is a middleware software layer specification intended to enable the developers of interactive television services and applications to design such products so that they will run successfully on any cable television system in North America, independent of set-top or television receiver hardware or operating system software choices.
Due to the broad variety of applications which can be downloaded over cable networks, and the broad variety of different CPE hardware and middleware that can receive such applications, application run-time and other software errors are somewhat inevitable. These errors can result in both significant frustration for the consumer, and the generation of many unnecessary service calls from the cable systems operator or other service provider. These deficiencies stem largely from the inability of existing cable/CPE devices to (i) log, analyze, and recover from fairly routine or non-critical errors; and (ii) communicate with the cable systems operator. Specifically, a network provider must be able to process events occurring within the CPE connected to their networks, including identifying (and ideally diagnosing and correcting) any errors. This CPE may include both leased equipment and retail consumer electronic equipment, and hence any corrective system must be adapted to interface with a variety of different equipment.
One type of error or event which can occur in cable network CPE is what is generally referred to as “resource exhaustion”. This term is applied to a group of different circumstances wherein one or more resources within the CPE (such as memory, CPU capacity, etc.) are at or near exhaustion, thereby indicating an incipient or prospective error condition within an application. As is well known, when resources such as memory become exhausted within an OCAP compliant Host device (e.g., set-top box, integrated TV), the application manager within the OCAP system will begin destroying applications starting with the lowest priority application. Hence, the OCAP-compliant CPE employs a priority-based system of resource self-preservation. However, such systems are generally not capable of (uniquely) dealing with different types of resource exhaustion, logging data relating to the exhaustion event(s), or initiating corrective action for other types of events occurring within the CPE (such as thrown but uncaught Java exceptions), or reboot events which are not initiated by the middleware. Accordingly, the OCAP-complaint prior art CPE is generally not as robust as it could be, and does not afford the level of control over the CPE operations during error conditions that is desired by cable network operators.
A variety of other approaches to error logging and handling within computer systems are taught in the prior art. These approaches generally range from bit-level systems such as those used in semiconductor applications, to higher-level functional or behavior logging systems for networked computers. For example, U.S. Pat. No. 3,999,051 to Petschauer issued Dec. 21, 1976 and entitled “Error logging in semiconductor storage units” discloses a maintenance procedure comprising a method of and an apparatus for storing information identifying the location of one or more defective bits, i.e., a defective memory element, a defective storage device or a failure, in a single-error-correcting semiconductor main storage unit (MSU) comprised of a plurality of large scale integrated (LSI) bit planes. The method utilizes an error logging store (ELS) comprised of 128 word-group-associated memory registers. A defective device counter (DDC) counts the set tag bits in the ELS and is utilized by the machine operator to schedule preventative maintenance of the MSU by replacing the defective bit planes. By statistically determining the number of allowable failures, i.e., the number of correctable failures that may occur before the expected occurrence of a noncorrectable double bit error, preventative maintenance may be scheduled only as required by the particular MSU.
U.S. Pat. No. 4,339,657 to Larson, et al. issued Jul. 13, 1982 and entitled “Error logging for automatic apparatus” discloses methods and apparatus for error logging by integrating errors over a given number of operations that provides long memory and fast recovery. Errors integrated over a selected number of associated operations are compared to a criterion. An exception is logged each time the number of errors is not less than the criterion but if the number of errors is less than the criterion, the exception log is cleared.
U.S. Pat. No. 4,604,751 to Aichelmann, Jr., et al. issued Aug. 5, 1986 and entitled “Error logging memory system for avoiding miscorrection of triple errors” discloses apparatus by which miscorrection of triple errors is avoided in a memory system by providing a double bit error logging technique. The address of each fetched word is logged in which a double bit error is detected. The address of each fetched word in which a single bit error is detected is compared with all logged addresses. If a coincidence is found between the compared addresses, a triple bit error alerting signal is generated and error recovery procedures are initiated.
U.S. Pat. No. 5,121,475 to Child, et al. issued Jun. 9, 1992 and entitled “Methods of dynamically generating user messages utilizing error log data with a computer system” discloses methods of error logging and correction in a communications software system. An error log request is generated by a component of the system; the error log request is analyzed and compared to entries in one of a plurality of records in a message look-up table. If there is a match between the fields of the error log request and selected entries of a record in the look-up table, a user message request is generated which facilitates the display of a pre-existing user friendly message as modified with data included in the generated user message request.
U.S. Pat. No. 5,155,731 to Yamaguchi issued Oct. 13, 1992 and entitled “Error logging data storing system” discloses an error logging data storing system containing a first storing unit for storing error logging data corresponding to an error of high importance, a second storing unit for storing error logging data corresponding to an error of either high or low importance. A first indicating unit indicates whether or not the first storing unit is occupied by error logging data the diagnosing operation of which is not completed. A second indicating unit indicates whether or not the second storing unit is occupied by error logging data the diagnosing operation of which is not completed. A storage control unit stores error logging data corresponding to an error of high importance in the second storing unit when the first indicating unit indicates that the first storing unit is occupied by error logging data the diagnosing operation of which is not completed and the second indicating unit indicates that the second storing unit is not occupied by error logging data the diagnosing operation of which is not completed.
U.S. Pat. No. 5,245,615 to Treu issued Sep. 14, 1993 and entitled “Diagnostic system and interface for a personal computer” discloses a personal computer having a NVRAM comprising an error log for storing predetermined error log information at predetermined locations therein. The information is accessible by various programs such as a POST program, a diagnostics program, and an operating system program. Access is made by BIOS interrupt calls through a BIOS interface. The NVRAM also stores vital product data and system setup data.
U.S. Pat. No. 5,463,768 to Cuddihy, et al. issued Oct. 31, 1995 and entitled “Method and system for analyzing error logs for diagnostics” discloses an error log analysis system comprising a diagnostic unit and a training unit. The training unit includes a plurality of historical error logs generated during abnormal operation or failure from a plurality of machines, and the actual fixes (repair solutions) associated with the abnormal events or failures. A block finding unit identifies sections of each error log that are in common with sections of other historical error logs. The common sections are then labeled as blocks. Each block is then weighted with a numerical value that is indicative of its value in diagnosing a fault. In the diagnostic unit, new error logs associated with a device failure or abnormal operation are received and compared against the blocks of the historical error logs stored in the training unit. If the new error log is found to contain block(s) similar to the blocks contained in the logs in the training unit, then a similarity index is determined by a similarity index unit, and solution(s) is proposed to solve the new problem. After a solution is verified, the new case is stored in the training unit and used for comparison against future new cases.
U.S. Pat. No. 5,790,779 to Ben-Natan, et al. issued Aug. 4, 1998 and entitled “Method and system for consolidating related error reports in a computer system” discloses a method and system for consolidating related error reports. In a preferred embodiment, a facility preferably implemented in software (“the facility”) receives error reports and success reports generated by programs. When the facility receives a novel error report specifying an error source for which no error state is set, it sets an error state corresponding to the error report. The facility also preferably generates a consolidated error report at this point, which is delivered to a error state reporting subsystem. The error state reporting subsystem may add the consolidated error report to an error log and/or display it to a user. When the facility receives a redundant error report specifying an error source for which an error state is already set, the facility preferably does not set a new error state, nor does it generate a consolidated error report. When the facility receives a success report specifying an error source, it clears any error states that are set for the specified error source, and preferably generates a consolidated success report. The performance of the facility is preferably optimized by processing success reports asynchronously.
U.S. Pat. No. 5,862,316 to Hagersten, et al. issued Jan. 19, 1999 and entitled “Multiprocessing system having coherency-related error logging capabilities” discloses protocol agents involved in the performance of global coherency activity that detect errors with respect to the activity being performed. The errors are logged by a computer system such that diagnostic software may be executed to determine the error detected and to trace the error to the erring software or hardware. In particular, information regarding the first error to be detected is logged. Subsequent errors may receive more or less logging depending upon programmable configuration values. Additionally, those errors which receive full logging may be programmably selected via error masks. The protocol agents each comprise multiple independent state machines which independently process requests. If the request which a particular state machine is processing results in an error, the particular state machine may enter a freeze state. Information regarding the request which is collected by the state machine may thereby be saved for later access. A state machine freezes upon detection of the error if a maximum number of the multiple state machines are not already frozen and the aforementioned error mask indicates that full error logging is employed for the detected error. Therefore, at least a minimum number of the multiple state machines remain functioning even in the presence of a large number of errors. Still further, prior to entering the freeze state, the protocol state machines may transition through a recovery state in which resources not used for error logging purposes are freed from the erring request.
U.S. Pat. No. 6,381,710 to Kim issued Apr. 30, 2002 and entitled “Error logging method utilizing temporary defect list” discloses an error logging method utilizing a temporary defect list to store errors produced at or above a predetermined occurrence frequency during a defect detecting test. The method includes the steps of: determining whether an error is recorded on a temporary defect list, determining whether the error is recorded on an error frequency list when the error is not recorded on the temporary defect list, adding the error to the error frequency list if the error is not recorded on the error frequency list, increasing the occurrence frequency of the error if the error is on the error frequency list, and adding the error to the temporary defect list if the error has an occurrence frequency greater than or equal to a threshold value established as a standard for classifying an error as a defect. The temporary defect list can be used as a final error list, and thereby reduce memory requirements.
U.S. Pat. No. 6,532,552 to Benignus, et al. issued Mar. 11, 2003 and entitled “Method and system for performing problem determination procedures in hierarchically organized computer systems” discloses a method and system for performing problem determination procedures in a hierarchically organized computer system. The hardware components of the data processing system are interconnected in a manner in which the components are organized in a logical hierarchy. A hardware-related error occurs, and the error is logged into an error log file. At some point in time, a diagnostics process is initiated in response to the detection of the error. The logged error may implicate a particular hardware component, and the hardware component of the data processing system is analyzed using a problem determination procedure. In response to a determination that the hardware component does not have a problem, the logically hierarchical parent hardware component of the hardware component is selected for analysis. The logically hierarchical parent hardware component is then analyzed using a problem determination procedure. The method continues to analyze the logically hierarchical parent components until the root component is reached or until a faulty component is found.
U.S. Pat. No. 6,505,298 to Cerbini, et al. issued Jan. 7, 2003 and entitled “System using an OS inaccessible interrupt handler to reset the OS when a device driver failed to set a register bit indicating OS hang condition” discloses a method and system for providing a reset after an operating system (OS) hang condition in a computer system, the computer system including an interrupt handler not accessible by the OS. The method includes determining if an interrupt has been generated by a watchdog timer; monitoring for an OS hang condition by the interrupt handler if the interrupt has been generated and after it is known that the OS is operating; and resetting the OS if a device driver within the OS has not set a bit in a register, the bit for indicating that the OS is operating. The method and system in accordance with the present invention uses existing hardware and software within a computer system to reset the OS. The invention uses a method by which a critical hardware watchdog periodically wakes a critical interrupt handler of the computer system. The critical interrupt handler determines if the OS is in a hang condition by polling a share hardware register that a device driver, running under the OS, will set periodically. If the critical interrupt handler does not see that the device driver has set the register bit, it will assume the OS has hung and will reset the system. In addition, the critical interrupt handler will store the reset in non-volatile memory. The reset can be logged into the system error log. Because the method and system in accordance with the invention uses existing hardware and software within the computer system, instead of requiring an additional processor, it is ostensibly cost efficient to implement while also providing a reset of the OS without human intervention.
United States Patent Publication No. 20010007138 to Iida, et al. published Jul. 5, 2001 and entitled “Method and system for remote management of processor, and method and system for remote diagnosis of image output apparatus” discloses a method and system for remote management of processors and a method and system for remote diagnosis of processors such as image output apparatus. Operation information about contents of operation performed by a processor during an operational preset period or a preset number of executions of processing is recorded. An operation log is formed by combining the operation information and is transferred to a remote management apparatus connected to the processor by a communication line. The remote management apparatus performs remote management of the condition of the processor on the basis of the transmitted operation log. An error log containing information about occurrences of errors having occurred in the processor is also formed and transferred to the remote management apparatus.
United States Patent Publication No. 20020083214 to Heisig, et al. published Jun. 27, 2002 and entitled “Protocol adapter framework for integrating non-IIOP applications into an object server container” discloses a method and apparatus for providing access to objects and methods via arbitrary remote protocols in a computer with object server. This includes a mechanism known as the protocol adapter framework that allows protocol adapters to manage remote socket sessions, encrypt communication on this session, translate text to the local character set, perform security validation of the remote user, log incoming work requests, classify the incoming work request for differentiated service purposes, and queue the work for execution. Also, included is a mechanism to invoke the protocol adapter in order to manipulate output from the execution of a method on a server object and send it back to the original requester. This allows the implementers of objects and methods that reside in the object server rather than the owner of the object server to provide a protocol adapter that allows communication with remote clients using any arbitrary protocol that the object implementer deems appropriate. In this way, the object implementer can enjoy benefits such as differentiated service, workload recording, server object process management, process isolation, error logging, systems management and transactional services of running objects in a robust object server container.
United States Patent Application Publication No. 20020144193 to Hicks, et al. published Oct. 3, 2002 and entitled “Method and system for fault isolation methodology for I/O unrecoverable, uncorrectable error” discloses a method and system for managing uncorrectable data error conditions from an I/O subsystem as the UE passes through a plurality of devices in a central electronic complex (CEC). The method and system comprises detecting a I/O UE by at least one device in the CEC; and providing an SUE-RE (Special Uncorrectable Data Error-Recoverable Error) attention signal by at least one device to a diagnostic system that indicates the I/O UE condition. The method and system further includes analyzing the SUE-RE attention signal by the diagnostic system to produce an error log with a list of failing parts and a record of the log. The invention provides a fault isolation methodology and algorithm, which allows for the determination of an error source and provides appropriate service action if and when the system fails to recover from the UE condition.
United States Patent Application Publication No. 20030041291 to Hashem, et al. published Feb. 27, 2003 and entitled “Method and system for tracking errors” discloses a system and method for tracking errors, the system residing on a user's desktop communicating with a central database over a network. The system comprises an error log including error recording tools for enabling the user to record an error; error resolution tools for enabling the user to resolve the error; and error follow-up tools for enabling a user to follow up on resolved errors; error reporting tools for enabling a user to generate error reports from the user's desktop; and communication tools for enabling the user to transmit logged errors to the central database and to receive reports generate from errors logged in the central database.
United States Patent Application Publication No. 20030056155 to Austen, et al. published Mar. 20, 2003 and entitled “Method and apparatus for filtering error logs in a logically partitioned data processing system” discloses a method, apparatus, and computer implemented instructions for reporting errors to a plurality of partitions. Responsive to detecting an error log, an error type for the error log is identified. If the error log is identified as a regional error log, an identification of each partition to receive the error log is made. Then, the error log is reported to each partition that has been identified to receive the error log.
United States Patent Application Publication No. 20030105995 to Schroath, et al. published Jun. 5, 2003 and entitled “Method and apparatus for rebooting a printer” discloses detection and logging of printer errors in an error log. If the same printer error has occurred within a predetermined time period, an error message is generated on the printer's control panel and a network administrator is notified of the printer errors. If the same printer error has not occurred within the predetermined time period, the printer is rebooted. If the same printer error has occurred a predetermined number of consecutive times, an error message is generated on the printer's control panel and a network administrator is notified of the printer errors. If the same printer error has not occurred a predetermined number of times, the printer is rebooted.
United States Patent Application Publication No. 20030140285 to Wilkie published Jul. 24, 2003 and entitled “Processor internal error handling in an SMP server” discloses a system and method for handling processor internal errors in a data processing system. The data processing system typically includes a set of main microprocessors that have access to a common system memory via a system bus. The system may further include a service processor that is connected to at least one of the main processors. In addition, the system includes internal error handling hardware configured to log and process internal errors generated by one or more of the main processors. The internal error hardware may include error detection logic configured to receive internal error signals from the main processors. By incorporating error logging and handling into dedicated hardware tied directly to the processor internal error signals, the invention ostensibly provides a lower cost, lower response latency mechanism for handling processor internal errors in high performance multiprocessor systems.
The well known Windows® NT operating system manufactured by Microsoft Corporation includes an error logging capability (“Event Viewer”) that may be used on, e.g., data networks including servers. The Event Viewer is a tool used to examine the three NT event logs: System, Security, and Application.
Each message within the Windows NT error logger has an event ID number. The maximum size of logs can be set, and overriding of log entries can be set depending on available disk space. System errors include: (i) Information—a significant event has occurred, but the event is not critical; (ii) Warning—this is a caution indication of a possible significant event which may or may not affect future operations; and (iii) Error—indicates a problem that has caused a failure of service.
Security Log errors include: (i) Success Audit—a successful audited security event has occurred; and (ii) Failure Audit—a failed audited security event has occurred.
The exemplary Windows NT Event viewer display includes information relating to the date, time, source, category, Event ID number, user, and computer to which a given error is related.
The Windows NT system uses a registry to locate files (.EXE or .DLL) that contain resource strings. RegisterEventSource and ReportEvent functions are provided to log messages to the event log service. The name specified as a parameter to RegisterEventSource must match the name of the key in the registry. With Windows NT, each system maintains its own log files; there is no central storage location.
Similarly, other third party products such as the EventReporter product sold by Adiscon GMbH monitors Windows NT/2000/XP/Server 2003 event logs and reports via syslog or email. Automated monitoring is provided to assist in early detection of problems on the network. For applications with a larger number of servers, a centralized log is maintained via syslog servers available for Windows, Unix, Linux and other operating systems. See also the “Snare” freeware product, which collects and processes Windows NT Event Log information from multiple event logs, and converts the information to tab or comma delimited text format and delivers it via UDP to a remote server.
A recently proposed Home Audio Video Interoperability (HAVi) specification is a consumer electronics (CE) industry standard design to permit digital audio and video devices that conform to this standard, regardless of manufacturer, to interoperate when connected via a network in the consumer's home. The HAVi standard (e.g., Version 1.1) uses the digital IEEE-1394 network standard for data transfer between devices and the 1394 A/VC protocols for device control.
The HAVi standard focuses on the transfer and processing (for example, recording and playback) of digital content between networked devices. HAVi-compliant devices will include not only familiar audio and video components but also cable modems, digital set-top boxes and “smart” storage devices such as personal video recorders (PVRs).
By employing modular software, the HAVi standard allows consumer electronics devices to identify themselves and what they can do when plugged into the host. The software functions by assigning a device control ID module to each hardware component of a system. Each system also is assigned multiple functional component modules, containing information about an individual device's capabilities, for example, whether a camcorder operates in DV format, or whether a receiver is designed to process AC3 audio.
All HAVi APIs involving messaging (e.g., those APIs where the Communication Type is “M” or “MB”) use a “status” structure consisting of two fields: an API code and an error code. Generally the different software elements will define their own error codes (see Annex 11.7 of HAVi Version 1.1). Additionally, there are several “general purpose” error codes that can be used by any software element. These general error codes are: (i) SUCCESS—the operation has succeeded (this is the normal return value in Status and not an error); (ii) EUNKNOWN_MESSAGE—the receiver of a HAVi message does not support the API indicated by the Operation Code contained within the message; (iii) EACCESS_VIOLATION—the caller of an API does not have permission to perform the operation; (iv) EUNIDENTIFIED_FAILURE—an error of unknown origin has occurred; (v) ERESERVED—the operation is refused because the FCM (or, in the case of a DCM, one of the FCMs involved in the DCM operation) is reserved by another software element and the invoking software element (possibly a secondary client) is not allowed to perform this operation; (vi) ENOT_IMPLEMENTED—the receiver of a HAVi message does not implement the optional API indicated by the Operation Code contained within the message; (vii) EINVALID_PARAMETER—one or more parameters in a HAVi message contain invalid values; (viii) ERESOURCE_LIMIT—the operation failed due to resource limitations at the destination device EPARAMETER_SIZE_LIMIT—one or more parameters in a HAVi message exceed their safe; (ix) parameter size limit and the receiver is unable to handle the parameter(s); (x) EINCOMPLETE_MESSAGE—the length of a HAVi message is shorter than the length required for compliant messages (using the Operation Code contained within the message); (xi) EINCOMPLETE_RESULT—one or more out parameters in a HAVi message are correct but incomplete. Note that this may only occur when one or more parameters are at least the safe parameter size; (xii) ELOCAL—the caller of a “local” API (as indicated in the “Services Provided” tables) is not on the same device as the provider of the API; and (xiii) ESTANDBY—the operation is refused because the target device is in standby state.
The error code appearing in the status value returned by a HAVi API is either: one of the general codes listed above, a Messaging System error code, or an API-specific error code (one that is listed in the “Error codes” section following the description of the API). If the Status value returned by a HAVi API contains one of the “general error codes” listed above (including SUCCESS), the API code is that used in invoking the API, otherwise it is the API code associated with the contained error (as identified in Annex 11.7). If the contained error is not listed in the “Error codes” section following the description of the API or the contained error has an invalid API code, the client of the API shall interpret the contained error as EUNIDENTIFIED_FAILURE. Therefore, if the client is a Java client, the corresponding messagesending method of the client class, server helper class (see section 7.3.8.1.2) or the SoftwareElement class throws HaviUnidentifiedFailureException in these cases.
In terms of resource limitations, some of the HAVi APIs have specifications that would allow unbounded sizes for some parameters. However, each FAV and IAV will only have a limited amount of memory. These limitations can differ from controller to controller and thus hamper interoperability between controllers. Therefore, for variable sized (input or output) parameters in HAVi APIs a “safe parameter size limit” is specified. Such limits indicate that compliant software elements will be able to handle messages where the size of the parameter in question is less than or equal to the safe parameter size limit. However, accepting parameters of size larger than the safe parameter size limit is allowed.
The safe parameter size limit puts a requirement to support the indicated parameter size at both sending and receiving sides. At the receiving side (in parameters for servers, out parameters for clients) this means being able to receive and handle. At the sending side (out parameters for servers, in parameters for clients) this means being able to construct and send.
The server may return the EPARAMETER_SIZE_LIMIT error if it cannot handle the request due to the safe parameter size of an in parameter being exceeded.
The server returns the EINCOMPLETE_RESULT error if the parameters it returns are valid but incomplete. Note that a server may only return this error when one or more of the parameters it returns are at least the safe parameter size.
The server returns ERESOURCE_LIMIT if it fails to process a request due to lack of resources. If the server generates an incomplete or potentially incomplete response, i.e., one where values of the out parameters are valid but may be incomplete, this error is not returned.
Despite the foregoing, no suitable methodology or architecture for both logging and responding to errors (such as repetitive boots or uncaught thrown Java exceptions) encountered during operation of networked systems has been disclosed under the prior art. This is particularly true in the context of leased set-top boxes and OpenCable compliant Host devices. Prior art solutions also do not provide the ability to (i) tailor delivery of error and reboot reports to a network agent, and (ii) transfer recovery of exhausted system resources from CPE manufacturer control to network operator control.
Accordingly, there is a need for improved apparatus and methods for providing error logging, diagnosis, operation, and control of applications within such networks. These improved apparatus and methods would meet these needs while also enabling compliance with industry standard requirements within the network.