Ethernet in the World s Top500 Supercomputers

Transcription

1 White PAPER Ethernet in the World s Top500 Supercomputers Updated June 2006 Introduction: The Top500 Supercomputing Elite The world s most powerful supercomputers continue to get faster. According to top500.org, which maintains the list of the 500 supercomputers with the highest Linpack performance, the aggregate performance of the listed computers has grown 21% in the last seven months and 65% in the last year. This growth rate is slower than for other recent lists, but continues to compare favorably with the rate of improvement predicted by Moore s Law (2x every 18 months). The #1 supercomputer on the June 2006 list is still the BlueGene/L whose performance is unchanged at TeraFlops, while the #500 supercomputer comes in at TeraFlops vs TeraFlops in November. The somewhat slower rate of performance improvement is notable throughout the list. At the top, seven of the Top10 systems from the November 2005 list were able to maintain a Top10 position. For the November 2005 list, only 4 systems from the June 2005 list held onto their Top 10 status. At the bottom of the new list, performance improvement caused 158 systems from November to be de-listed compared to more than 200 systems that fell off the previous time. All 500 listed supercomputers use architectures that employ large numbers of processors, (from as many as 131,072 to as few as 40), to achieve very high levels of parallel performance. A typical modern supercomputer is based on a large number of networked compute dedicated to parallel execution of applications plus a number of I/O that deal with external communications and with access to data storage resources. Top500.org categorizes the supercomputers on their list in the following way: Clusters: Parallel computer systems assembled from commercially available systems/servers and networking devices, with each compute or I/O node a complete system capable of standalone operation. The current list includes 364 cluster systems (up from 360 in 11/05), including the #6 and #7 in the Top 10. Constellations: Clusters in which the number of processors in a multi-processor compute node (typically an n-way Symmetric Multi-Processing or SMP node) exceeds the number of compute. There are 38 constellations listed in the Top500 (up from 36 in 11/05), with the highest performing system listed at #5. Not counting the #5 computer, the highest performing constellation is at #67. Massively Parallel Processors (): Parallel computer systems comprised in part of specialized, purpose-built and/ or networking systems that are not commercially available as separate components. s include vector supercomputers, DM-MIMD (distributed memory-multiple instruction stream, multiple data stream), and SM-MIMD (shared memory-multiple instruction stream, multiple data stream) supercomputers available from HPC computer vendors such as IBM, Cray, SGI, and NEC. systems account for 98 entries on the current list (down from 104 in 11/05), including #1 through #4 and three more of the Top 10. For a more detailed discussion of supercomputer designs and topologies, see the following white papers on the Force10 Networks website: Building Scalable, High Performance Cluster/Grid Networks: The Role of Ethernet ( Ethernet in High Performance Computing Clusters ( Among the major trends in recent Top500 lists is the emergence of highly scalable switched Gigabit Ethernet as the most widely deployed system interconnect (SI). The remainder of this document focuses on the key roles that Ethernet plays as a multi-purpose interconnect technology used by virtually all supercomputers on the Top500 list FORCE10 NETWORKS, INC. [ P AGE 1 OF 11 ]

2 The Rise of Clusters and Ethernet Cluster Interconnect Cluster systems have become the dominant category of supercomputer largely because of the unmatched price/ performance ratio they offer. As shown by the top curve in Figure 1, the total number of clusters on the list continues to grow, with an increase of 26% in the last two years. As clusters have become both more powerful and more cost-effective, they have helped to make High Performance Computing (HPC) considerably more accessible to corporations for speeding up numerous parallel applications in research, engineering, and data analysis. As shown by the middle curve in Figure 1, the number of Top500 clusters owned by industrial enterprises has grown by 53% over the last two years. The Top500 list places supercomputer owners in the following categories: industry, research, government, vendor, academic, and classified. Figure 1. Growth of cluster systems on the Top500 Over the last three years, the adoption of HPC clusters by industrial enterprises has spurred a 26% increase in the total number of industrial systems on the Top500, as shown by Figure 2. Clusters have now become the dominant computer architecture for the industrial component of thetop500. In June 2006, clusters account for over 88% of the 257 industrial supercomputers on the list. The cost-effectiveness of clusters as a category of supercomputer is driven by three major factors: 1. Availability of high volume server products incorporating very high performance single and multicore microprocessors, minimizing hardware costs. Enterprises can even build high performance clusters using the same models of server already being deployed in the data center for mainstream IT applications. Figure 2. Growth of industry-owned systems on the Top Linux is the cluster operating system of choice, minimizing software licensing costs. Linux is the operating system used by 367 supercomputers on the list, up from 334 one year ago. Most Linux systems on the list are clusters, although some s, including the IBM BlueGenes and Cray XT3s, also run Linux. Because Linux is increasingly popular as an enterprise server operating system, no new expertise is required and applications can readily be migrated from conventional Linux servers to Linux clusters. 3. Gigabit Ethernet (GbE) is the most cost-effective networking system for cluster interconnect (inter compute-node communication). GbE is particularly attractive to enterprises because it is already a familiar mainstream technology in data centers and campus LAN networks. In addition, most high end Linux servers come with integral GbE at no extra cost. As shown by the bottom curve on Figure 1, GbE is the cluster interconnect for 94% of the industrial enterprise clusters on the Top500 (212 out 226). The top performing GbE cluster, at #27on the list. with performance of 12.3 TeraFlops, is a Geoscience industry computer built by IBM. The system is a BladeCenter LS20 with 5,000 AMD Opteron processor cores. Networking for Supercomputers Regardless of whether the supercomputer architecture is a cluster, constellation, or, the computer that house the multiple processors must be supported by a network or multiple networks to provide the system connections for the following functions: IPC Fabric: Also known as simply the "Interconnect", an essential aspect of multi-processor supercomputing is the Interprocessor communications (IPC) that allow large numbers of processors/compute to work in 2006 FORCE10 NETWORKS, INC. [ P AGE 2 OF 11 ]

3 a parallel, yet coordinated, fashion. Depending on the application, the bandwidth and latency of transfers between processors can have a significant impact on overall performance. For example, processors may waste time in an idle state waiting to receive intermediary results from other processors. All compute are connected to the IPC fabric. In some cases, I/O are also connected to this fabric. Management Fabric: A separate management fabric allows system to be accessed for control, troubleshooting and status/performance monitoring without competing with the IPC for bandwidth. In general, every compute node and I/O node is attached to the management fabric. I/O Fabric. This fabric connects the supercomputer I/O to the outside world, providing user access and connection to complementary computing resources over the campus LAN, WAN, or Internet. Storage Fabric: A common practice is to attach file servers or other storage subsystems to I/O. This isolates the compute from the overhead of storage access. A separate storage fabric may be used to provide connectivity between I/O and file servers. In a few cases, the compute are attached to storage resources via a SAN, which then acts as the storage fabric. Top500.org focuses most of its attention on the Interconnect (IPC) fabric because that is obviously the Figure 3. Growth of Gigabit Ethernet vs. other IPC interconnects in Top500 network connection that has the primary impact on system performance. Figure 3 shows the number of supercomputers on recent Top500 lists that use each type of IPC Interconnect fabric. The continued rapid growth of GbE to its current position as the #1 Interconnect fabric in the Top500 (51% of the 6/06 list) reflects the accelerated adoption of clusters discussed previously. Virtually all the GbE IPC fabric systems shown in the chart are clusters. In Figure 3, the "Other" category includes vendorspecific Interconnect fabrics that computer vendors incorporate in their products, such as the IBM and Cray 3D torus networks, IBM s SP and Federation networks, the NEC crossbar, and the SGI NumaLink. Myrinet and Quadrics are commercially available proprietary HPC switching systems that were both specifically designed to provide low latency IPC Rank System LLNL BlueGene/L, IBM IBM Watson Research Center BlueGene/W, IBM LLNL ASC Purple, IBM NASA Columbia SGI Altix CEA Tera-10, Bull Sandia National Labs Thunderbird, Dell GSIC Centre Tsubame, NEC/Sun FZJ / JUBL BlueGene, IBM Sandia National Labs Red Storm, Cray XT3 Processors /System Type 131,072 DM-MIMD 40,960 DM-MIMD 10,240 10,160 SM-MIMD 8,704 Cluster 8,000 Cluster 10,368 Cluster 16,384 DM-MIMD 10,880 DM-MIMD Interconnect Fabric 3D Torus + Tree + Barrier 3D Torus + Tree + Barrier Federation switch NunaLink + 20 x IB + 40 x 10GbE Quadrics InfiniBand InfiniBand 3D Torus + Tree + Barrier 3D Torus Control/Mgmt FE on 65,536 compute FE on compute FE on 1,536 compute and I/O GbE on 20 Super FE/GbE on 600 FE on 8,192 compute 2,582 x FE External Network I/O GbE on 1,024 I/O GbE on 320 I/O 32x 10 GbE + 8 GbE Via 10 GbE fabric GbE on 56 I/O GbE on x GbE and 10 GbE via InfiniBand GbE on 288 I/O 40 x 10 GbE Storage FibreChannel Via 10 GbE fabric FibreChannel InfiniBand 80 x 10 GbE TeraFlops Earth Simulator NEC 5,120 Vector 640 x 640 crossbar GbE on 640 vector SC Nodes GbE on 640 vector SC FibreChannel on 640 vector SC 35.9 Figure 4. System interconnects of the Top 10 supercomputers (Top500 list) 2006 FORCE10 NETWORKS, INC. [ P AGE 3 OF 11 ]

4 system interconnect. InifinBand is an industry standard, low latency, general purpose system interconnect. InfiniBand is the only IPC interconnect besides GbE that is capturing a growing share of the Top500 list. As this chart indicates, the Top500 list, as a whole, is moving away from proprietary interconnect technologies. On the June 2006 list, all but one of the Quadrics systems are listed as clusters, while Myrinet systems include 53 clusters and 33 constellations (i.e., nearly all the constellations in the Top500 use Myrinet as the IPC Interconnect fabric). The recent decline in the number of Myrinet systems on the list is due partly to growth of the Ethernet cluster and partly to the decline in the number of constellations that make the list (down to 38 in June 2006 from 70 in June 2005). Networking for the Top 10 Supercomputers Figure 4 provides an overview of the Top 10 systems on the June 2005 top500.org list: with respect to the networking fabrics deployed in each of the functional areas described above. As can be noted from column four of the table, all of the systems in the Top 10, use low latency IPC Interconnect fabrics to help achieve high performance. The six systems in the table rely on the computer vendors proprietary interconnects, the clusters use commercially available interconnect (InfiniBand or Quadrics). It is notable that commercially available proprietary interconnects are losing popularity in the Top10 as well as throughout the list. Figure 5 summarizes typical performance levels for these more specialized interconnects as well as Gigabit Ethernet and 10 Gigabit Ethernet in terms of Message Passing Interface (MPI) latency and bandwidth. There are a number of Ethernet NIC technologies now on the market (RDMA, TOE, kernel bypass, etc.) that reduce the host component of IP/Ethernet MPI latency. Therefore, MPI latency for Ethernet is expected to continue to decline towards a figure closer to the switch latency on the order of 10 microseconds. The impact of TOE NICs is seen in the last row of the table. The TOE data comes from recent testing of 10 Gigabit Ethernet cluster interconnect by Chelsio Communications ( a leading Ethernet NIC supplier, and Los Alamos National Laboratory. The results demonstrate compelling performance levels compared with Myrinet and Infiniband. 10 GbE switches and TOE NICs are now available at volume price levels with additional improvements to come over the next couple of years. As these technologies ride further down the cost curve, Ethernet clusters should be able to continue to enhance their share of the Top500 by delivering ever-improving performance even without significant increases in processor counts. For a more detailed discussion of supercomputer designs and topologies, see the following white papers on the Force10 Networks website: Building Scalable, High Performance Cluster/Grid Networks: The Role of Ethernet ( Ethernet in High Performance Computing Clusters ( As shown in columns five and six of Figure 4, all of the Top 10 supercomputers use switched Gigabit Ethernet or Fast Ethernet networks as the management fabric and general I/O fabric. Gigabit Ethernet is also the predominant fabric used to connect I/O to file server resources. Therefore, although none of these Top 10 systems is categorized as a system with GbE interconnect, all of Technology Vendor MPI Latency (msec) short message single hop MPI Bandwidth (MB/s) (unidirectional) NumaLink 4 3D Torus QsNet II SeaStar 3D Torus InfiniBand 4X 640 x 640 crossbar Myrinet XP2 Gigabit Ethernet 10 Gigabit Ethernet SGI IBM Quadrics Cray Voltaire NEC Myricom Various Various n/a (non-toe) (with TOE) 3, , , (non-toe) 863 (with TOE) Source: IBM, NEC, Sandia, Chelsio, and SGI ( Figure 5. Latency and bandwidth of IPC interconnect fabrics 2006 FORCE10 NETWORKS, INC. [ P AGE 4 OF 11 ]

5 the systems make extensive use of Fast Ethernet, Gigabit Ethernet, and/or 10 Gigabit Ethernet for non-ipc system interconnect. A more complete description of the Top 10 Supercomputers on the list is included toward the end of this document. If the other 490 systems in the Top500 were examined as closely, we would expect to see that scalable Ethernet switching always plays an important system interconnect role in more than one of the four required functional areas. Top500 Performance by System Type Figure 6 is a column chart that shows the Linpack performance of all systems in the Top500, where the type of system is identified by the color of the column. Clusters are increasingly dominant between #100 and #500 on the list, accounting for 80% of the systems. In addition, clusters have made significant inroads among the Top 100 positions on the list. Clusters now occupy 45 positions in the Top100, including 37 with low latency interconnects and 7 with GbE interconnect. If current trends continue, we can expect to see clusters becoming even more dominant for at least the next one or two list iterations. Networking for Gigabit Ethernet Clusters For supercomputer clusters that use GbE as the Interconnect fabric, switched Ethernet technology can be chosen to satisfy all of the system networking requirements. Figure 7 provides a conceptual example of how this may be done. Highly scalable switches with 10 GbE and GbE ports are connected in a mesh forming a "fat tree" serving as the Figure 7. Cluster using Ethernet for all four system interconnect fabrics IPC fabric connecting compute and the I/O. Additional 10 GbE or GbE ports in the mesh of switches serve as the Storage Fabric that provides connectivity between the I/O and file servers. Another set of switch ports and logical interfaces can play the role of an I/O fabric connecting the supercomputer I/O to external resources and users. A separate set of meshed Fast Ethernet switches can be used to construct a out-of-band management fabric. Fast Ethernet has more than adequate bandwidth for the management fabric function and is very inexpensive. Frequently the management fabric can be built using re-purposed high density Fast Ethernet switches previously used for server connectivity in data centers or earlier generations of cluster. Figure 6. Performance of Top500 computers by system architecture 2006 FORCE10 NETWORKS, INC. [ P AGE 5 OF 11 ]

6 The Top 10 Supercomputers on the June 2006 Top500 List This section provides additional information on the systems in the Top 10. Information is limited to that which the owners or vendors of the systems have placed on their web sites or elsewhere on the Internet. Links to some of these information sources are included in the Appendix at the end of the document. #1 Lawrence Livermore National Labs (LLNL) Blue Gene/L The highest performing supercomputer on the Top500 list is the LLNL Blue Gene/L, whose performance has risen to TeraFlops from 137 TeraFlops five months ago. Blue Gene/L is an system with 65,536 dual-processor compute and 1,024 I/O. The compute run a stripped down version of the Linux kernel and the I/O run a complete version of the Linux operating system. The full system consists of 64 racks with each rack housing 1,024 compute and 16 I/O. Blue Gene/L (BG/L) uses three specialized networks for IPC: a 3D torus with 1.4 Gbps bidirectional bandwidth for the bulk of message passing via MPI, a tree network for collective operations, and a synchronization barrier/interrupt network. Interfaces for all three of these networks are integrated on the node processor ASICs as shown in Figure 8. Figure 8. Block diagram of the Blue Gene/L processor ASIC In addition to the IPC networks, further connectivity is provided by two separate Ethernet networks, as shown in Figure 9. Each compute node has a 100/1000 Ethernet interface dedicated to control and management, including system boot, debug, and performance/health monitoring (control information can also be transmitted via the ASIC s JTAG interface). Each of the 1,024 I/O uses Gigabit Ethernet for file access and external communications. Therefore, the LLNL BG/L system incorporates 65,536 ports of Fast Ethernet or GbE in the control management network and ports of GbE in the I/O and file server network. One of the key design guidelines for the BG/L was to optimize performance per watt of power consumed rather than maximizing performance per processor. The result is the ability to integrate 1,024 dual-processor compute into a rack 0.9 m wide, 0.9 m deep, and 1.9 m high that consumes 27.5 kw of total power. For example, BG/L yields 25 times more performance per KW than the NEC Earth Simulator at #7 on the current list. Figure 9. High level view of the Blue Gene/L system Because of the large number of in a single rack, more than 85% of the inter-node connectivity is contained within the racks. The corresponding dramatic reduction in connectivity across racks allows for higher density, higher reliability, and a generally more manageable system. Because the design philosophy led to a very large numbers of processors, the decision was made to provide the system with a very robust set of Reliability, Availability, and Serviceability (RAS) features. The BG/L design team 2006 FORCE10 NETWORKS, INC. [ P AGE 6 OF 11 ]

7 was able to exploit the flexibility afforded by an ASIC level design to integrate a number of RAS features typically not found on commodity servers used in cluster implementations. As supercomputers continue to scale up in processor count, RAS is expected to become an increasingly critical aspect of HPC system design. BG/L has been designed to be applicable to a broad range of applications in the following categories: Simulations of physical phenomena. Real-time data processing Off-line data analysis. Accordingly, IBM has made the BG/L into a standard product: line which it intends to sell to both the traditional HPC market and the broader enterprise market. The Linux-based IBM eserver Blue Gene Solution is available from 1 to 64 racks with peak performance up to 5.7 TeraFlops per rack. A one-rack entry version sells for approximately $1.5M. This price/performance point is likely to be attractive for enterprises with compute-intensive, mission critical applications that can be accelerated through parallelization. As a result of this eserver Blue Gene Solution initiative, we can expect to see an increasing number of Blue Gene systems appearing on the Top500 list for some time to come. There are 24 eserver Blue Gene Solution computers on the current list. #2 IBM Thomas J. Watson Research Center Blue Gene/W At #2 on the Top500 list, with performance of 91.3 TeraFlops, is another Blue Gene system installed at the IBM Thomas J. Watson Research Center (BG/W). BG/W uses the same system design as BG/L but is a 20 rack system with 20,480 compute and 320 I/O. Therefore, the IBM BG/W system incorporates 20,480 ports of Fast Ethernet or GbE in the control management network and 320 ports of GbE in the I/O and file server network. #3 Lawrence Livermore National Laboratory ASC Purple IBM At #3 with upgraded performance of 75.8 TeraFlops for the June 2006 list is the ASC Purple built by IBM for LLNL. ASC Purple currently consists of 1,536, including 1280 compute and 128 I/O. Purple is comprised of 131 node racks, 90 disk racks, and 48 switch racks. Each p575 Squadron IH node is an 8-way SMP server that is powered by eight Power5 microprocessors running at 1.9 GHz and is configured with 32 GB of memory. As shown in Figure 10, the ASC Purple IPC interconnect fabric is an IBM 3-stage, dual-plane Federation switch with 1,536 dual ports. This switch array is built from port switches and 9,216 cables. The fabric provides 8 GBps of peak bi-directional bandwidth. Purple has 2 million gigabytes of storage furnished by SATA and FibreChannel RAID arrays with over 11,000 disks. More than 2,000 FibreChannel2 links are required for storage access. In addition, the system has two Squadron 64-way Power5 Logically Partitioned into four login. Each login node has eight 10 GbE ports for parallel FTP access to the archive and 2 GbE ports for NFS and SSH (login) traffic. System management functions are facilitated with a separate Ethernet management fabric with over 1,536 Fast Ethernet ports. Figure 10. High level view of the ASC Purple system 2006 FORCE10 NETWORKS, INC. [ P AGE 7 OF 11 ]

8 processors in an SMP configuration. The SMP is based on FAME (Flexible Architecture for Multiple Environments) internal switches that are used to provide individual processors with access to I/O and shared memory. Note that the system has been incorrectly categorized as a constellation in the Top500 list. Quadrics QSnet II provides the IPC fabric connecting compute and I/O, FibreChannel on the I/O is used for storage connect, and Ethernet is used for data I/O and management. Figure 11. NASA s Columbia System #4 NASA Columbia SGI Altix 3700 The fourth system on the Top500 list at 51.9 TeraFlops is the NASA Columbia system consisting of 20 SGI Altix 3700 Superclusters, as shown in Figure11. Each Supercluster contains 512 Itanuium 2 processors with 1 Terabyte of global shared memory across the cluster. Each Supercluster runs a single image of the Linux operating system. The primary fabric for IPC is NumaLink, a low-latency proprietary SGI interconnect with low latency and 24 Gbps of bidirectional bandwidth. Each supernode is also connected with InfiniBand and two 10 Gigabit Ethernet ports for I/O and Storage system access. Therefore, this system design requires 40 ports of 10 Gigabit Ethernet switching. Figure 12. CEA s Tera-10 System #5 Commissariat à l Énergie Atomique (CEA) Tera-10 The fifth computer on the list is the Tera-10 supercomputer with performance of 42.9 Teraflops owned by the French nuclear energy agency. The Tera-10 is a Linux cluster of Bull NovaScale 602 servers consisting of 544 compute and 56 I/O, as shown conceptually in Figure 12. Each NovaScale 602 server node has 16 Itanium-2 #6 Sandia National Labs Thunderbird Dell PowerEdge Cluster At the #6 position, the Sandia Thunderbird is the second highest performing cluster on the list with 38.3 TeraFlops of performance. Thunderbird is constructed of 4,096 compute- consisting of Dell PowerEdge servers. Each PowerEdge U server has two single-core Intel 64-bit (EM64T) Xeon 3.6 MHz processors, for a total of 8,192 processors. The IPC fabric is provided by 10 Gbps InfiniBand. A large switched Ethernet network with 4600 GbE ports and forty 10 GbE ports serves as the management fabric, I/O fabric, and storage fabric of the cluster. The management fabric spans the compute, the InfiniBand switches, and the storage. #7 Tokyo Institute of Technology TSUBAME Sun Fire Cluster The #7 computer on the list at 38.2 TeraFlops is the Tokyo Tech TSUBAME based on 655 Sun Fire x64 servers with a total of 10,480 AMD Opteron processor cores and Sun InifiniBand-attached storage. Each Sun Fire uses a Galaxy 4 8-way SMP processor configuration. All are interconnected via InfiniBand DDR (20 Gbps) for IPC communications, as well as for storage interconnect and network I/O via an InfiniBand/Ethernet Gateway. The TSUBAME, therefore, is based on the version of a converged server fabric being promoted by InfiniBand vendors. The Sun Fire servers also use ClearSpeed's Advance floating-point co-processors to accelerate floating point operations. The Advance board can reportedly deliver 25 GigaFlops of number-crunching performance and only consume 10 watts of power. The Advance co-processor is a multi-core special parallel processor implemented as a system on a chip. The coprocessor uses a MultiThreaded Array Processor (MTAP), with 96 floating point cores, and a high-speed network interconnecting them and dedicated DDR2 memory FORCE10 NETWORKS, INC. [ P AGE 8 OF 11 ]

9 Clusters based on multi-core SMP compute and multi-core co-processors, perhaps interconnected in a grid of clusters, appear to be a fruitful direction in the pursuit of higher performance that is less constrained by either physical size or power consumption difficulties. #8 Forschungszentrum Juelich (FZJ) JUBL BlueGene/L At #8 on the Top500 list, with performance of 37.3 TeraFlops, is another Blue Gene system installed at FZJ The JUelicher BlueGene/L (JUBL) uses the same system design as BG/L but is a 8 rack system with 8,192 compute and 288 I/O. Therefore, the JUBL system incorporates 8,192 ports of Fast Ethernet or GbE in the control/management network and 288 ports of GbE in the I/O and file server network. #9 Sandia National Labs Red Storm Cray XT3 Sandia worked closely with Cray to develop Thor s Hammer, the first supercomputer to use the Red Storm architecture. Cray has now leveraged the design to create its next generation product: the Cray XT3 supercomputer. Thor s Hammer is now listed as the Red Storm Cray XT3. Red Storm is currently comprised of 5,184 dual-core Opteron processor compute housed in 108 cabinets. In addition, there are 256 service and I/O housed in 16 cabinets. The compute run a microkernel derived from Linux developed at Sandia, while the service and I/O run a complete version of Linux. The system architecture allows the number of processors to be increased to 30,000 processors, potentially upping performance from the current 36.2 TeraFlops. The installation at Sandia will operate as a partitioned network configuration with a classified section (Red) and unclassified section (Black), as shown in Figure 13 The machine can be rapidly reconfigured to switch 50% of all the compute between the classified or unclassified sections. In Figure 13, the switchable compute cabinets are shown in white. In normal operation, threequarters of the compute are in either the Red or Black section. Red Storm uses a 3D Torus 27 x 16 x 24 IPC fabric to interconnect for its compute. The peak bi-directional bandwidth of each link is 7.6 GBps with a sustained bandwidth in excess of 4 GBps The torus leverages the Opteron s HyperTransport interfaces and is based on Cray s SeaStar chip. The Cray torus interconnect carries all message passing traffic as well as the traffic among the compute and the I/O as shown in Figure 13. As with other clusters, Fast Ethernet is used for management of the compute and I/O for a total of over 2,582 ports. In addition, Red Storm will incorporate more than 80 ports of 10GbE to connect the system to file servers and another 40 ports for external I/O to other computing resources such as a new "Visualization Cluster" for 3D modeling. Figure 13. High level view of the Sandia Red Storm 2006 FORCE10 NETWORKS, INC. [ P AGE 9 OF 11 ]

10 #10 NEC Earth Simulator System (ESS) The #10 system is the NEC Earth Simulator, in Japan. The Earth Simulator is a special purpose machine, made by NEC with the same vector processing technology used in the NEC SX-6 commercial product. The decision by NEC to base the design entirely on vector processors was something of a departure from previous approaches to supercomputer design. The Earth Simulator consists of 640 shared memory vector supercomputers that are connected by a massive highspeed interconnect network. The interconnection network (IN) consists of a 640 x 640 single-stage crossbar switch with approximately 100 Gbps of bi-directional bandwidth per port. The aggregate switching capacity of this interconnect network is over 63 Tbps. This high level of performance was achieved by splitting the switch into 128 data switch units, each consisting of a byte-wide 640 x 640 switch. The 128 data switch units are housed in 65 racks and require over 83,000 cables. Each supercomputer node contains eight vector processors with a peak performance of 8GFlops and a high-speed memory of 16 GBytes. The total number of processors is 5120 (8 x 640), which translates to a total of approximately 40 TeraFlops peak performance, and a total main memory of 10 Terabytes. However, the SX-6 processors consume considerable Figure 14. Special building for the Earth Simulator power and space. With only16 processors per rack, 320 racks are required for the processors alone. A special building 65m x 50m in area was constructed to house NSS as shown in Figure 14. The system layout for the ESS is similar to the one shown in Figure 15, which is from a large NEC SX-8 based system that adheres to the same general architecture. The Compute are connected by three switched networks: the 640 x 640 crossbar (IN or IXS), GbE for I/O and management, and a Fibre Channel SAN for storage access. Therefore, the ESS uses a total of 640 ports of GbE switching. Figure 15. Earth Simulator block diagram 2006 FORCE10 NETWORKS, INC. [ P AGE 10 OF 11 ]

11 Conclusion Switched Ethernet technology is making an increasingly significant contribution to the advancement of supercomputing and HPC. Within the Top500 Ethernet has achieved the following milestones: GbE is now the leading IPC fabric (used by 51% of the supercomputers on the list) GbE is the leading IPC fabric for clusters (69% of clusters use GbE) The cost-effectiveness of GbE is helping make supercomputing accessible to more industrial enterprises. (94% of all industrial clusters use GbE for the IPC fabric) Driven by GbE cluster technology, supercomputing is being more widely adopted by industry. With the growth in industrial clusters that began in earnest in June 2003, 88% of all industrial supercomputers are now clusters and 51% of the supercomputers on the list are now owned by industrial enterprises. Although they are not listed as being based on GbE interconnect, the Top 10 supercomputers in the world make extensive use of high density Fast Ethernet, GbE, and 10 GbE switching for non-ipc fabric functions: management, network I/O, and storage I/O. The cost-effectiveness and the accessibility of supercomputing based on GbE clusters has been well demonstrated within the Top500. This is encouraging more enterprises to work to identify opportunities to derive business benefit from parallel applications and HPC, even in areas such as financial analysis and database processing. As a result, GbE clusters are expected to continue to grow in significance as both a mainstream technology of the enterprise data center and as a component of the Top500 list. Appendix: Links for additional information Top500 Lists and Database #1 LLNL Blue Gene/L more on how the control Ethernet is used: General info on IBM s Blue Gene/L: #2 BGW IBM s Blue Gene at Watson: /4.BGW_Overview.pdf and something on the apps: /7.BGW_Mission_Utilization.pdf #3 LLNL ASC Purple #4 NASA Columbia #5 Commissariat à l Énergie Atomique (CEA) Tera-10 #6 Sandia Thunderbird #7 Tokyo Institute of Technology TSUBAME Sun Fire Cluster xml #8 Forschungszentrum Juelich (FZJ) JUBL BlueGene/L #9 Sandia Red Storm NewsRelease.html pr10_21_03.html ppt#6 #10 Earth Simulator Force10 Networks, Inc. 350 Holger Way San Jose, CA USA PHONE FACSIMILE 2006 Force10 Networks, Inc. All rights reserved. Force10 Networks and the Force10 logo are registered trademarks, and EtherScale, FTOS, SFTOS, and TeraScale are trademarks of Force10 Networks, Inc. All other brand and product names are trademarks or registered trademarks of their respective holders. Information in this document is subject to change without notice. Certain features may not yet be generally available. Force10 Networks, Inc. assumes no responsibility for any errors that may appear in this document. WP v FORCE10 NETWORKS, INC. [ P AGE 11 OF 11 ]