1. Field of the Invention
This invention relates to the connection and control of distributed parallel computing systems. More particularly this invention relates to connection and control of networked connected computer systems to form a virtual network structure for large multiprocessing systems.
2. Description of Related Art
In the past, video streaming servers required that a file be fully present before the sever could start streaming the file. This imposed a considerable restriction as typical DVD or broadcast quality videos may be several Gigabytes in size and thus imposed a large latency before a viewer could start viewing a video.
Video is the most dominant medium in entertainment and is rapidly becoming a critical part of computing as well. Video is often used in CD-ROM titles, for example, to mimic personal or virtual environments, increasing an application's appeal and usability. Video has a large information carrying capacity and is heavily used in capturing and conveying complicated situations such as news events, live interviews, scientific experiments, tourist attractions, and many others.
With the increasing availability of high bandwidth networks, video on-demand applications are gaining popularity on global digital communications networks such as the Internet as well as private and corporate digital communication internal networks commonly referred to as Intranets. Example applications include online training, news broadcasts, educational programming, corporate information, and virtual seminars directly to every desktop computing system or workstation. Similarly, video kiosks can be set up in enterprises and university campuses to display live video and up-to-the-minute news, without ever needing an on-site upgrade.
Video files, however, occupy huge amounts of space on computers. It requires about 10 MB to store one minute of video in most standard compression and decompression video formats, including Motion Picture Experts Group standard MPEG-1, the Apple Computer Inc. Indeo, Intel Corp. QuickTime, and Super Mac, Inc Cinepak. That translates into 1.2 GB of space for two hours of video, the length of an average feature film. These tremendous storage requirements make effective on-demand sharing of video files at least as important as conventional file sharing.
However, conventional file servers do not address video's unique requirements and cannot effectively support video sharing. Full-motion video, inherited from analog TV, is a sequence of images played out at constant intervals. The two most common analog video formats are the National Television Standards Committee (NTSC), used in the United States and Japan, and Phase Alternation Standard (PAL), used in Europe. NTSC plays video at 30 frames per second, while PAL plays it at 25 frames per second. The sequence of images in a video clip must be relayed at a constant interval, or else the perceptual quality degrades rapidly: the motion jumps and the sound breaks. This rigid periodic timing property is referred to as the isochronous requirement. Conventional file servers are designed for minimal transfer latency. Files are thus transferred to maintain the minimum latency and are transferred as quickly as possible. The files will be interleaved with other digital communication traffic on the network and thus non-isochronously. Without explicit mechanisms to ensure isochronism, delivery rates are irregular, resulting in erratic playback quality at the client computing system.
To avoid erratic playback, the usual approach is to download whole files from the server to the client computing system before starting video playback. This approach results in unacceptable delays for most video files, which are large. For example, even with transfer rates as fast as 1.5 Mb/second, the initial start-up delay is 60 seconds for a one minute video clip.
It is thus desirable to deliver video streams isochronously so that video playback is guaranteed to have smooth motion and sound. The file server must now transfer or stream the files such that the time between each section of the file is transferred at a period of time. The even interval allows the file to arrive isochronously with the first section to be displayed before any of the remaining sections of the file have arrived at the client system. This allows a video clip to begin practically instantaneously.
The rapid advances in the speeds of microprocessors, storage, and network hardware may give a false impression that video on-demand (VOD) solutions do not need special purpose video streaming software. Video streaming allows efficient playback of full motion videos over networks with guaranteed quality using isochronous timing.
When an operating system's default file transfer mode is used to stream a video file, faster hardware may accelerate the operating system's transfer rate, but this improved hardware still cannot change the fundamental, erratic behavior of a file transfer. By default, the file transfer process does not respect the isochronous nature of a video stream. This typically results in a jerky and poor-quality playback of a video stream. The dominant factors of a system's overall streaming performance are the higher level client/server and networking processes, and are not the raw power of the low level physical devices.
U.S. Pat. No. 5,805,821 (Saxena, et al.) teaches a video optimized media streamer user interface employing non-blocking switching to achieve isochronous data transfers. The media streamer includes at least one control node; a user interface having an output coupled to the at least one control node; at least one storage node for storing a digital representation of at least one video presentation; and a plurality of communication nodes each having an input port for receiving a digital representation of at least one video presentation there from. The video presentation requires a time T to present in its entirety, and is stored as a plurality of N data blocks. Each data block stores data corresponding to a T/N period of the video presentation. Each communication nodes further has a plurality of output ports for outputting a digital representation. A circuit switch is connected between the at least one storage node and the input ports of communication nodes for coupling one or more input ports to the at least one storage node. The user interface includes a capability for specifying commands for execution, and the at least one control node is responsive to individual ones of the commands for controlling at least one of the at least one storage node and at least one of the plurality of communication nodes, in cooperation with the circuit switch, so as to execute a function associated with individual ones of the commands. The commands may include video cassette recorder-like commands that include commands selected from a group that includes a Load command, an Eject command, a Play command, a Slow command, a Fast Forward command, a Pause command, a Stop command, a Rewind command, and a Mute command. The commands may also include commands selected from a group that includes a Play List command, a Play Length command, and a Batch command. A synchronous application program interface (API) is provided for coupling, via the user interface, a user application program to the at least one control node. The API includes Remote Procedure Call (RPC) procedures.
U.S. Pat. No. 5,550,577 (Verbiest, et al.) illustrates a video on demand network, including a central video server and distributed video servers with random access read/write memories. The video on demand network transmits video signals to user stations pursuant to the receipt of control signals issued by these user stations. In order to optimize the retrieval costs, this video on demand network maintains a large video library in a central video server and stores locally popular video signals in a plurality of local distributed video servers from which the latter video signals are transmitted to the user stations. The video signals provided by the local distributed servers are updated from the central server based upon the changing popularity of the video signals. The video on demand network of Verbiest proposes in particular to store the video signals in the local distributed servers in random access read/write memories, e.g., electronic RAMs, magnetic or optical disks from which the video signals can flexibly be supplied on-line to the user stations and to store the video signals in the central server in sequential access memories, e.g. Digital Audio Tapes (DAT) and CD-ROMs (CDR), providing cheap mass storage.
U.S. Pat. No. 5,712,976 (Falcon, Jr., et al.) illustrates a video data streamer for simultaneously transferring data blocks stored from a storage node to any of multiple communication nodes. The media streamer includes at least one storage node including mass storage for retaining a digital representation of at least one video presentation. The mass storage is comprised of a plurality of mass storage units. A video presentation requires a time T to present in its entirety, and is stored as a plurality of N data blocks. Each data block stores data corresponding to approximately a T/N period of the video presentation. The media streamer further includes a plurality of communication nodes each having at least one input port that is coupled to an output of the storage node for receiving a digital representation of a video presentation there from. Each of the plurality of communication nodes further includes a plurality of output ports. Each output port transmits a digital representation as a data stream to a consumer of the digital representation. The N data blocks of the digital representation are partitioned into X stripes, wherein data blocks 1, X+1, 2*X+1, etc., are associated with a first one of the X stripes, data blocks 2, X+2, 2*X+2, etc., are associated with a second one of the X stripes, etc., and wherein individual X stripes are each stored on a different mass storage unit. The mass storage units preferably store a single copy of a digital representation of a video presentation. The X stripes are read out in such a manner as to enable multiple data streams to simultaneously convey the N data blocks, or are read out in such a manner as to enable multiple data streams to simultaneously convey different data blocks.
U.S. Pat. No. 6,101,547 (Mukherjee, et al.) describes an inexpensive, scalable and open-architecture media server. The multi-media server provides client systems with streaming data requiring soft real-time guarantee and static data requiring a large amount of storage space. The servers use a pull-mode protocol to communicate with client systems through a real-time network. Separate data and control channels enhance the soft real-time capability of the server. The data channel conforms to an open standard protocol such as such as Transmission Control Protocol (TCP), User Datagram Protocol (UDP), or Real-time Transport Protocol (RTP). A switched data link layer for the control channel permits separate intrahost control messages that may be multicast and broadcast. The distributed file system selects a specific data block size based upon the compression technique employed to enhance soft real-time guarantee. A hierarchical data structure combined with merging empty data blocks minimizes disk fragmentation. Data blocks are striped across multiple disks to improve disk utilization. A local buffer and a queue for both read and write requests provides support for simultaneous read and write data streams.
“Performance Evaluation of QuickVideo OnDemand (QVOD) Server,” InfoValue Computing, Inc. Technical Report IV-TR-QVOD-1999-07-1-1, Jul. 8, 1999, InfoValue Computing, Inc., Elmsford, N.Y. describes a video on-demand system developed for high performance, effective and flexible, network-based, on-demand sharing of videos. QuickVideo On Demand provides streaming throughput for broadband applications Further, QuickVideo On Demand allows a linearly scalable clustering mechanism which provides support for higher throughputs, if required. QuickVideo On Demand supports all video formats, codecs, networks and applications, and is compatible with any open application platform.
“Network Video Computing Via QuickVideo Suite,” InfoValue Technical White Paper, InfoValue Computing, Inc., Elmsford, N.Y., 1999, describes Network Video Computing the core of which is video streaming. Video streaming allows the efficient playing of full-motion video content over networks with guaranteed quality. The rigid timing property of full motion video is referred to as the isochronous timing. File servers are designed to minimize transfer latency during conventional network transfers, and are insensitive to video's unique timing requirement. As a result, delivery rates are irregular and produce erratic playback as described above. Video streaming technologies are real-time network transfers that maintain the video's critical timing property throughout the entire delivery period. This white paper describes an open architecture with a streaming core.
“Web Distribution Systems: Caching and Replication” Chandbok, Ohio State University, 1999, found http://www.cis.ohio-state.edu/˜jain/cis788-99/web_caching/index.html, Aug. 15, 2000, provides an overview of the current techniques for caching and replication of digital data on computer systems interconnected through a global or local digital communication network.
As described above, video files may be very large, on the order of 1.2 GB for a two hour movie or video presentation. In the digital communication networks, the files are generally formed into data packets for transfer. These data packets may not arrive to a designated client system in correct order for processing. This requires reception of the complete file before processing may begin. If the file is an audio or video file requiring isochronous presentation of the file, the files must be totally received before processing or the files must be segmented or partitioned into portions to allow smaller units of the files to be processed.
The simplest computing system, as shown in FIG. 1, consists of a single processor 2, its memory 8 and associated external storage media 10 usually implemented as some form of magnetic direct access storage device (DASD). The performance of this system is determined by the speed of the processor 2, the characteristics of the memory bus 6 (bandwidth and latency), the size of the memory 8 and the performance of the DASD 10 which is determined by such factors as seek access time and the bandwidth of the connection to memory 8. Such a machine is referred to a uniprocessor system. At any given time the state of the semiconductor processing technology determines the maximum performance of such a machine. Efforts to achieve performance above that available to the current technology involve the introduction of multiple disks, processors, and memories.
The first way that performance can be improved beyond that of a simple uniprocessor system is by adding processors 4 as shown in FIG. 1. Such a machine is called a symmetric multiprocessor (SMP). Such machines can, theoretically, multiply the performance of a uniprocessor by the number of processors 4 that are added. As the number of processors 4 increases, this does not occur because of competition for resources, both logical and physical, by the processors.
In an SMP design, the memory bus 6 bandwidth may not be sufficient to meet the needs of all the processors 2 and 4 seeking data from memory 8. The bus 8 bandwidth can be increased by using more aggressive technology, but this method of increasing performance is limited. Often caches are added to the memory subsystem that is associated with each processor. In this way, as long as the fetch requests of the processors 2 and 4 are not directed at the same address ranges, contention for the memory bus 8 is avoided. When various caches contain the same object and the object is changed by a processor 2 and 4, then the objects in the other caches must be invalidated. This problem limits the number of processors 2 and 4 that can be supported in an SMP design. Another way to increase SMP performance to introduce interleaved memory. In an interleaved memory 8, the memory 8 is divided into banks so that simultaneous requests to separate banks are supported. This adds significant complexity and cost to the memory subsystem and so it is a limited solution that does not scale indefinitely. In addition the operating system design that runs on SMP machines becomes complex difficult and time consuming to implement. Despite these difficulties the SMP design is a popular way to increase performance.
Another way to increase performance is to cluster machines on some kinds of network, as shown in FIGS. 2 and 3. In a cluster, the machines within the cluster are called nodes 12. It will be noted that these nodes 12 may actually be clusters of processors themselves. Many networks have been used for this purpose. These networks can be divided into two types. These two types are topology transparent and topology visible.
FIG. 4a illustrates a network where the topology is visible. The nodes 16 are connected in a topology where the communication bandwidth and distance between nodes 16 is not uniform, the topology is visible. In a topology visible network, the topology of the network shows through to the application and can affect performance. This is usually because the latency and/or bandwidth existing between any two processors (nodes 16) on the network are not the same for all pairs of processors. Examples of such networks include two dimensional meshes and n dimensional cubes. One of the drawbacks of these schemes is that the mapping of tasks onto the topology becomes a performance issue. This is, in general, thought to be an NP problem, but for restricted sets of applications this type of network may be used.
In the topology transparent network, as shown in FIG. 4b, the latency and/or bandwidth existing between any two nodes 18 is the same for all pairs of nodes 18. These networks are generally implemented by crossbar switches, multistage networks, or multicast networks such as Ethernet or token ring.
The topology transparent networks consist of nodes 18 configured as a clustering of computer processors. These multi-computer networks have the advantage that very little special software and hardware is required for their implementation and a great deal of readily available software and hardware is available for the implementation.
When processor clusters are used to implement high performance data servers, these clusters must allow for processor failures, or a single processor failure will cause the whole cluster to fail. If a processor in a cluster fails then tasks must not be assigned to that failed processor. A technique for monitoring the physical, logical identity, and operational state of processors within a cluster is commonly referred to as transmission of a heartbeat message.
U.S. Pat. No. 4,710,926 (Brown, et al.) describes fault recovery in a distributed processing system the heartbeat is broadcast among the processors once during each major processing cycle. The heartbeat message indicates the physical and logical identity of the transmitting processor with respect to the system arrangement as well as the processor's present operational state. By monitoring the heartbeats from other processors, spare processors can autonomously take over the functions of failed processors without being required to consult or obtain the approval of an executive processor. The new physical location of a replaced processor will be automatically recorded by the other processors.
In addition to the processor performance problems that have been discussed above, there are disk access performance issues as well and these issues become more important in multiple processor systems. There are two general performance issues. The first has to do with single thread disk access requests through a single processor and the second has to do with simultaneous request for data from a single file. In the case where the file system is read only, as in the case of video files, the solutions to these problems are not complicated by consistency issues.
In the first instance, if a file resides on a disk that is only accessible from a single processor complex, then requests for data from that disk is single threaded through a single processor complex. This can become a performance bottle neck. The solution to this problem involves allowing multiple processor connection to the disk controller. In this disk requests can be honored by any processor connected to the disk controller.
In the second instance, if multiple processors wish to read from the same file, then the read requests will be single threaded through the single disk on which the file resides. The solution to this problem involves the dividing of the file into physical segments that are stripped onto multiple physical volumes. This stripping of data is accomplished by the file system and is known to the file system and is different from the stripping that occurs in a RAID device. RAID devices can support this kind or file segmentation and are supported in the methods described in this application. When this is done, then multiple simultaneous requests for the same file can be supported, unless the same physical segment is requested. For files that are under heavy contention, multiple copies of segments can be created.
U.S. Pat. No. 6,014,669 (Slaughter, et al.) describes a highly-available distributed cluster configuration database. The cluster configuration database has a consistent copy of the configuration database maintained on each active node of the cluster. Each node in the cluster maintains its own copy of the configuration database and configuration database operations can be performed from any node. Configuration database updates are automatically propagated to each node in a lock-step manner. If any node experiences a failure, the configuration database uses a reconfiguration protocol to insure consistent data in each node of the cluster. The cluster configuration database uses a two-level consistency update framework to insure consistent data among the nodes. Each local configuration database uses a self-contained consistency record to uniquely identify and stamp each copy of the configuration database. The consistency of each individual copy of the configuration database can be verified from the consistency record. Additionally, the cluster configuration database uses a two-phase commit protocol to guarantee the copies of the configuration database are consistent among the nodes. Cluster operations may be performed from any node in the cluster. Cluster operations include restore, checkpoint, update, query, and consistency checking. Because each node maintains a local copy of the configuration database, configuration database queries are local operations and can be performed in parallel among the nodes. Update operations are global operations that involve each active node of the cluster. The cluster configuration database uses a global locking mechanism that implements single write/multiple reader semantics.
“XTP as a Transport Protocol for Distributed Parallel Processing,” Strayer et al., Proceedings of the USENIX Symposium on High-Speed Networking, USENIX Assoc., August 1994, illustrates Xpress Transfer Protocol (XTP) as a flexible transport layer protocol designed to provide efficient service without dictating the communication paradigm or the delivery characteristics that qualify the paradigm. XTP provides the tools to build communication services appropriate to the application.