Distributed nodal systems of processors, also called “embedded” processor systems, are being employed in a wide variety of applications, and in ever increasing numbers. In distributed nodal systems of processors, for example, in a control system, overall system control is distributed among two or more processor nodes in the system or product.
An advantage of such systems is that problem diagnosis and repair is simplified because functions are isolated to different areas of the system. Further, such systems can be expanded by adding components and processor nodes, without replacing the entire system. The nodes of a distributed control system are usually interconnected with one or more communication networks, herein called a “network”.
One example of a control system comprising a distributed nodal system of processors comprises an automated data storage library, such as an IBM 3584 Ultra Scalable Tape Library. The processors of the 3584 library are embedded with various components of the library, communicate over a bus network, and operate the components and, thereby, the library. A discussion of an automated data storage library with a distributed nodal system of processors is provided in U.S. Pat. No. 6,356,803, issued Mar. 12, 2002. Repair actions for such systems may comprise replacing an individual component, a processor node, or a processor at the node. The library is formed of one or more “frames”, each comprising a set or subset of library components, such as storage shelves for storing data storage media; one or more data storage drives for reading and/or writing data with respect to the data storage media; a robot accessor for transporting the data storage media between the storage shelves and data storage drives; a network; and a plurality of processor nodes for operating the library. The library may be expanded by adding one or more frames and/or one or more accessors or other nodes.
Another example of a control system comprising a distributed nodal system of processors comprises an automobile multi-processor network.
In order to communicate over the network, the components and/or the processor nodes, must have node addresses, such as employed with CAN busses or Ethernet networks, as are known to those of skill in the art. When a frame is added, the processor node(s) added to the network may have no node address, or have a partial node address, and node addresses must be given to the new processor nodes. When an individual component, processor node, or a processor at the node, is replaced, either with a new component, etc., or another component, etc., is swapped, the processor node may have no node address, or, if swapped, may employ its previous node address. Further, processor node cards may be interchangeable for all of the components, allowing ease of parts handling and to simplify diagnosis and repair, but preventing the use of static addressing where there is a separate part number for each node address, or preventing the node address from being permanently fixed at each processor node card.
One way of providing a new node address is for an operator or repair person to assign a node address. In one example of a complex node address, a component may have a function portion of an address coded into a card, and a frame number is supplied to the processor node, and the function address and frame number are combined to calculate a node address. Alternatively, automatic modes of providing new node addresses may be employed. As one example, a cable is designed with certain lines tied high or low to provide a binary number that may be employed to calculate the node address. As another example, as discussed in U.S. patent application Ser. No. 09/854,865, filed May 14, 2001, a pulse generator and delay signal generator may provide an automatic frame count, which may be used with the function address to calculate the node address. As another example, a server may employ a dynamic host configuration protocol (DHCP) to give a processor node an IP address.
In either a manual or an automated mode, failure is a possibility. For example, the operator may misjudge the placement or function of the processor node. In an automatic mode, the binary cable might become defective, misplugged, or the wrong cable might be used. As another example, the frame counter circuit might become defective, or the cabling could become misplugged.
In such a case, the component may have no node address when on the network, may have a wrong address, or may present an address that is a duplicate of another component on the network. A duplicate address is possible when one processor node is at a component which performs a duplicate function as another component, and misreads the frame number, and employs the erroneous frame number in the address. Alternatively, the processor node may be swapped from one system to another, and be at a different location in the new system. Addressing errors, such as the presence of an unknown component or duplicate on the network, can render all or part of the system inoperable, and require maintenance actions.
Failures of products are becoming less tolerable as systems and customer expectations move toward a concept of continuous availability, such as the well known “24×7×365” availability.
As an example, automated data storage libraries provide a means for storing large quantities of data on data storage media that are not permanently mounted on data storage drives, and that are stored in a readily available form on storage shelves. One or more robot accessors retrieve selected data storage media from storage shelves and provide them to data storage drives. Typically, data stored on data storage media of an automated data storage library, once requested, is needed quickly. Thus, it is desirable that an automated data storage library be maintained in an operational condition on a continuous basis as much as possible.
Automated data storage libraries may comprise systems which are operated by a plurality of processors working together, such as a central controller which interfaces with the host systems through an external interface, and provides a constantly updated inventory of the locations and content of the data storage media within the library, and a robot control system which identifies precise locations of the data storage drives and the storage shelves, and calculates the best operation of the robot accessor(s) to efficiently transport data storage media between the various storage shelves and data storage drives. Many of the components are redundant, allowing a processor node to fail, and still have the overall system operate, but all are dependent upon a proper network addressing structure to perform together.
Global addressing is known, for example, as shown by U.S. Pat. No. 5,987,506 for allocating addresses across multiple “clouds” by carefully segmenting the addresses, and does not allow the same address to be allocated in more than one cloud, but provides no method of handling addressing failures.
There are many examples of conflicts in addressing. For example, U.S. Pat. No. 5,386,515, resolves a conflict by shifting an address space of a hardware adapter to a next sequential address space, ignoring the conflicting address space. However, the system is down and not operational until the conflict is resolved. IBM Technical Disclosure Bulleting Vol. 41, No. 01, January, 1998, pp. 703–705, forces a reply by a host in response to duplication of both sender and receiver IP addresses, so that the duplicate host turns off its interface, and posts a warning message. The replying host may keep using the IP addresses until it is corrected manually, reducing the disruption to the replying host. When a node logs on, all hosts on the network will receive the duplicate IP addresses, and all may thus go through the reply process. Only after the duplicate host receives the reply and turns off its interface, can the replying hosts communicate with the original IP address owner.