Multicast functionality delivers a packet of network data to ideally all recipients registered in a multicast group. More specifically, a group of receivers register an interest in receiving a particular data stream. This group does not have any physical or geographical boundaries, i.e., the receivers can be located anywhere on the network.
Multicast functionality can be directly supported by the network infrastructure or it may be emulated. For example, in the BladeFrame product by Egenera the multicast functionality is emulated.
In the BladeFrame platform architecture, the communication mechanism is point-to-point and provided in a virtual interface (VI)-modeled switched Giganet fabric. Multicast functionality is emulated using a control node to distribute multicast packets to processor nodes. A single multicast packet is received by the control node which then sends the packet to each processor node in the multicast group.
The BladeFrame platform includes a set of processing nodes connected to the switch fabric via high speed interconnects. The switch fabric is connected to at least one control node that is in communication with an external internet protocol (IP) network or other data communication networks. Processing nodes, control nodes, and at least two switch fabrics are interconnected with a fixed, pre-wired mesh of point-to-point links.
As shown in FIG. 1, an embodiment of a BladeFrame hardware platform 100 includes a set of processing nodes 105a-n connected to a switch fabrics 115a,b via high-speed, interconnect 110a,b. The switch fabric 115a,b is also connected to at least one control node 120a,b that is in communication with an external IP network 125 (or other data communication networks), and with a storage area network (SAN) 130. A management application 135, for example, executing remotely, may access one or more of the control nodes via the IP network 125 to assist in configuring the platform 100 and deploying virtualized processing area networks (PANs).
In certain embodiments, about 24 processing nodes 105a-n, two control nodes 120, and two switch fabrics 115 a,b are contained in a single chassis and interconnected with a fixed, pre-wired mesh of point-to-point (PtP) links. Each processing node 105 is a board that includes one or more (for example, four) processors 106j-l, one or more network interface cards (NICs) 107, and local memory (for example, greater than 4 Gbytes) that, among other things, includes some BIOS firmware for booting and initialization. There is no local disk for the processors 106; instead all storage, including storage needed for paging, is handled by SAN storage devices 130.
Each control node 120 is a single board that includes one or more (for example, four) processors, local memory, and local disk storage for holding independent copies of the boot image and initial file system that is used to boot operating system software for the processing nodes 105 and for the Central Processing Units 106 on the processing nodes. Each control node communicates with SAN 130 via 100 megabyte/second fibre channel adapter cards 128 connected to fiber channel links 122, 124 and communicates with the Internet (or any other external network) 125 via an external network interface 129 having one or more gigabit Ethernet NICs connected to gigabit Ethernet links 121,123. (Many other techniques and hardware may be used for SAN and external network connectivity.) Each control node includes a low speed Ethernet port (not shown) as a dedicated management port, which may be used instead of remote, web-based management via management application 135.
The switch fabrics are composed of one or more 30-port Giganet switches 115, such as the NIC-CLAN 1000 (collapsed LAN) and Clan 5300 switch, and the various processing and control nodes use corresponding NICs for communication with such a fabric module. Giganet switch fabrics have the semantics of a Non-Broadcast Multiple Access (NBMA) network. All inter-node communication is via a switch fabric. Each link is formed as a serial connection between a NIC 107 and a port in the switch fabric 115. Each link operates at 112 megabytes/second.
Under software control, the platform supports multiple, simultaneous and independent processing areas networks (PANs). Each PAN, through software commands, is configured to have a corresponding subset of processors that may communicate via a virtual local area network that is emulated over the point-to-point mesh.
Certain embodiments allow an administrator to build virtual, emulated LANs using virtual components, interfaces, and connections. Each of the virtual LANs can be internal and private to the platform 100, or multiple processors may be formed into a processor cluster externally visible as a single IP address.
In certain embodiments, the virtual networks so created emulate a switched Ethernet network, though the physical, underlying network is a PtP mesh. The virtual network utilizes IEEE MAC addresses, and the processing nodes support address resolution protocols, for example, IETF ARP processing to identify and associate IP addresses with MAC addresses. Consequently, a given processor node replies to an ARP request consistently whether the ARP request came from a node internal or external to the platform.
The control node-side networking logic maintains data structures that contain information reflecting the connectivity of the LAN (for example, which nodes may communicate to which other nodes). The control node logic also allocates and assigns virtual interface or reliable virtual interface (VI) (or RVI) mappings to the defined MAC addresses and allocates and assigns VIs or (RVIs) between the control nodes and between the control nodes and the processing nodes.
As each processor boots, BIOS-based boot logic initializes each processor 106 of the node 105 and, among other things, establishes a (or discovers the) VI to the control node logic. The processor node then obtains from the control node relevant data link information, such as the processor node's MAC address, and the MAC identities of other devices within the same data link configuration. Each processor then registers its IP address with the control node, which then binds the IP address to the node and an RVI (for example, the RVI on which the registration arrived). In this fashion, the control node will be able to bind IP addresses for each virtual MAC for each node on a subnet. In addition to the above, the processor node also obtains the RVI or VI-related information for its connections to other nodes or to control node networking logic.
Thus, after boot and initialization, the various processor nodes should understand their Layer 2, data link connectivity. As will be explained below, Layer 3 (IP) connectivity and specifically Layer 3 to Layer 2 associations are determined during normal processing of the processors as a consequence of the address resolution protocol.
FIG. 2A details the processor-side networking logic 210 and FIG. 2B details the control node-side networking 310 logic of certain embodiments. The processor side logic 210 includes IP stack 305, virtual network driver 310, ARP logic 350, RCLAN layer 315, and redundant Giganet drivers 320a,b. The control node-side logic 310 includes redundant Giganet drivers 325a,b, RCLAN layer 330, virtual Cluster proxy logic 360, virtual LAN server 335, ARP server logic 355, virtual LAN proxy 340, and physical LAN drivers 345.
The IP stack 305 is the communication protocol stack provided with the operating system (e.g., Linux) used by the processing nodes 105. The IP stack provides a Layer 3 interface for the applications and operating system executing on a processor 106 to communicate with the simulated Ethernet network. The IP stack provides packets of information to the virtual Ethernet layer 310 in conjunction with providing a Layer 3, IP address as a destination for that packet. The IP stack logic is conventional except that certain embodiments avoid check sum calculations and logic.
The virtual Ethernet driver 310 will appear to the IP stack 305 like a “real” Ethernet driver. In this regard, the virtual Ethernet driver 310 receives IP packets or datagrams from the IP stack for subsequent transmission on the network, and it receives packet information from the network to be delivered to the stack as an IP packet.
The stack builds the MAC header. The “normal” Ethernet code in the stack may be used. The virtual Ethernet driver receives the packet with the MAC header already built and the correct MAC address already in the header.
For any multicast or broadcast type messages, the virtual Ethernet driver 310 sends the message to the control node on a defined VI. The control node then clones the packet and sends it to all nodes (excluding the sending node) and the uplink accordingly. Further details regarding the virtual Ethernet driver, the RCLAN layer, the virtual interfaces, and generally the processor-side networking logic are described in International Publication Number WO 02/086712, published on 31 Oct. 2002, entitled “Virtual Networking System and Method in a Processing System” the entire teachings of which are herein incorporated by reference.
On the control-node-side networking logic, the virtual LAN server logic 335 facilitates the emulation of an Ethernet network over the underlying NBMA network. The virtual LAN server logic manages membership to a corresponding virtual LAN; provides RVI mapping and management; ARP processing and IP mapping to RVI; provides broadcast and multicast services; facilitates bridging and routing to other domains; and manages service clusters.
Administrators configure the virtual LANs using management application 135. Assignment and configuration of IP addresses on virtual LANs may be done in the same way as on an “ordinary” subnet. The choice of IP addresses to use is dependent on the external visibility of nodes on a virtual LAN. If the virtual LAN is not globally visible (either not visible outside the platform 100, or from the Internet), private IP addresses should be used. Otherwise, IP addresses must be configured from the range provided by the internet service provider (ISP) that provides the Internet connectivity. In general, virtual LAN IP address assignment must be treated the same as normal LAN IP address assignment. Configuration files stored on the local disks of the control node 120 define the IP addresses within a virtual LAN. For the purposes of a virtual network interface, an IP alias just creates another IP to RVI mapping on the virtual LAN server logic 335. Each processor may configure multiple virtual interfaces as needed. The primary restrictions on the creation and configuration of virtual network interfaces are IP address allocation and configuration.
Each virtual LAN server 335 is configured to manage exactly one broadcast domain, and any number of Layer 3 (IP) subnets may be present on the given Layer 2 broadcast domain. The servers 335 are configured and created in response to administrator commands to create virtual LANs.
With regard to processor connections, as nodes register with the virtual LAN server 335, the virtual LAN server creates and assigns virtual MAC addresses for the nodes, as described above. In conjunction with this, the virtual LAN server logic maintains data structures reflecting the topology and MAC assignments for the various nodes. The virtual LAN server logic then creates corresponding RVIs for the unicast paths between nodes. These RVIs are subsequently allocated and made known to the nodes during the nodes booting. Moreover, the RVIs are also associated with IP addresses during the virtual LAN server's handling of ARP traffic. The RVI connections are torn down if a node is removed from the topology.
With respect to broadcast and multicast services, broadcasts are handled by receiving the packet on a dedicated RVI. The packet is then cloned by the server 335 and unicast to all virtual interfaces 310 in the relevant broadcast domain.
The same approach is used for multicast. All multicast packets will be reflected off the virtual LAN server. Under some alternative embodiments, the virtual LAN server will treat multicast the same as broadcast and rely on IP filtering on each node to filter out unwanted packets.
When an application wishes to send or receive multicast addresses it must first join a multicast group. When a process on a processor performs a multicast join, the processor virtual network driver 310 sends a join request to the virtual LAN server 335 via a dedicated RVI. The virtual LAN server then configures a specific multicast MAC address on the interface and informs the LAN Proxy 340 as necessary. The Proxy 340 will have to keep track of use counts on specific multicast groups so a multicast address is only removed when no processor belongs to that multicast group.
The BladeFrame architecture uses the control node for multicast functionality as illustrated in FIG. 3. Multicast packets, whether they originate from an external source or a processor node, are processed by the control node 412. The control node 412 replicates the multicast packet and forwards the multicast packet to each processor node 414a . . . n. All multicast emulation is centralized in the control nodes.
Recipients that are interested in receiving data flowing to a particular multicast group join the group using Internet Group Management Protocol (IGMP). The control node 412 keeps track of processor node 414a . . . n multicast group membership using IGMP snooping of Layer 2 function which is a method to deal with multicast in a Layer 2 switching environment. For every multicast packet received, the control node iterates over the list of processor nodes according to the multicast membership information and transmits the packet to each member processor node. As such, because all multicast processing is in the control node scalability issues arise in certain contexts.