Clusters: Mainstream Technology for CAE Alanna Dwyer HPC Division, HP
Linux and Clusters Sparked a Revolution in High Performance Computing! Supercomputing performance now affordable and accessible Linux enabled the use industry-standard technologies Many more users and new applications Cluster growth rate is over 50% per year! (volume is half of HPC) Now a critical resource in meeting today s CAE challenges Increasingly complex CAE analysis demands more larger models; more jobs to run; longer runs Market is responding, adding enterprise RAS features to clusters Treating CLUSTERS like PRODUCTS, not custom deployments Integration with large SMP systems allows one to optimize resource deployment Some jobs just can t be distributed 2
Why cluster? Budget: Price-performance (+10 GFLOPs system today < $4K) Scale beyond practical SMP limits Faster time to market and profit, improved insights Resource consolidation Centralized management, optimize utilization Clusters aren t just for compute engines Can apply same principles to file systems and visualization Can help deal with exponential growth in volume of simulation data 3
Application Experience User Application (Courtesy of NTUST) A large-scale FE model (nonlinear continuum mechanics) Computing time of 80 days was necessary with 1-CPU in year 2000 14 processors of AMD Athlon 1600+ with Myrinet 67 hours 96 processor cores of HP Opteron 270 at NTUST cluster < 12 hours A home-made application ported in less than a day. NTUST: National Taiwan University of Science and Technology 4
SMP vs. Cluster (farm) Example 7000 MSC.Nastran: XLTDF Comparison 6000 Total elapsed time 5000 4000 3000 2000 1000 0 1 2 4 Number of processes Integrity rx5670 4 way SMP ProLiant DL145 G2-2 node cluster Integrity rx2620-2 node cluster 5
CAE Application Sub-Segments CAE Domain: Pre/Post Structures Impact Fluids Parallelized Serial (SMP*) SMP (MPI*) MPI MPI Job Scalability 32 64 GB 1 4 (8*) cores 2 16 (32*) cores 4 128 (256*) cores Typical Solution Workstation or SMP server Integrity SMP or Farm X64 Cluster X64 Cluster CPU cycles Auto CPU cycles Aero All jobs 30% 30% 60% 20% 10% 50% (*emerging capability) 6
HPC Cluster Implementation challenges System and workload management Scalable performance Scalable data management Interconnect/Network Complexity Application availability and scalability Power and cooling Acquisition and deployment 7
Latest Advancements in Clustering Multi-core delivering continued price-performance improvements Improvements in clustering software and tools More applications are being developed and tuned to leverage cluster/dmp solutions Principles of compute clusters being applied to storage and visualization InfiniBand now established in HPC Solutions now coming to market that address power and cooling concerns 8
HP Unified Cluster Portfolio 9
Powerful Solver Technology Applications: ISVs standardizing on HP-MPI Molpro University of Cardiff AMLS One of the top reasons that we went with HP-MPI is that we've had a great working relationship with HP. It was a win-win for ANSYS, HP and our customers - in terms of cost, interconnects, support and performance compared to other message passing interfaces for Linux and Unix. In addition, I've always had great turnaround from HP in response to hardware and software issues. Lisa Fordanich, Senior Systems Specialist, ANSYS www.ansys.com/services/ss-interconnects.htm HP-MPI is an absolute godsend, notes Keith Glassford, director of the Materials Science division at San Diego, CA-based Accelrys Software Inc. It allows us to focus our energy and resources on doing what we re good at, which is developing scientific and engineering software to solve customer problems. 10
CAE Reference Architecture Client Workstations Front End HA job scheduler pre/post SMP RGS direct attached Disk Array (or use SFS) compute SMPs Remote Workstations compute compute compute clusters clusters compute compute compute clusters clusters InfiniBand switched fabric interconnect LAN Scalable File Share meta data object data visualization cluster 11
A Cluster Alternative to Direct Attached Storage: HP Scalable File Share (SFS) Applying principles of clusters to file systems and storage enables the sharing of data sets without performance penalty MSC.Nastran is Fast on HP SFS: Replace extra-disk fat-nodes with flexible storage Traditional approach: Special nodes in the cluster w/ multiple local JBOD disks Expensive and hard to manage New approach Use fast centralized, virtualized HP SFS filesystem Similar performance Lower cost Shared rather than dedicated storage Easier to use Any node in the cluster can run Nastran Higher reliability: RAID 6 instead of RAID 0 12
MSC.Nastran Benchmark XXCMD Standard MSC benchmark XXCMD: solution of the natural frequencies of an automotive body Performs a medium amount of I/O compared to industry real-life customer datasets (4 TB of I/O with blocksize of 256 KB) Multiple jobs running simultaneously: no shared data Customers typically use direct attached storage for each host 1 controller and 5 drives per job are recommended for good throughput SFS performance 1 Object Storage Server node and 4 enclosures (with array of SATA drives) for every 4 hosts achieved excellent performance No degradation for up to 16 hosts, and small degradation from 16 to 32 hosts Significant (~6 times) advantage vs. small SCSI configuration time (sec) 140000 120000 100000 80000 60000 40000 20000 0 MSC.Nastran benchmark XXCMD - performs medium I/O (small is better) 1 2 4 8 16 32 # hosts SFS 2 jobs per host MSA 2 jobs per host SCSI 2 jobs per host 13
Key Considerations in Designing a Solution What processor and interconnect for the mix of jobs Centralized resource or single purpose systems Can applications co-exist? Economics of consolidation Environmentals: power, cooling, weight, space Roll your own system or acquire a total solution Production scalability requirements Performance Availability and Reliability Manageability (provisioning, booting, monitoring, upgrades) Budget, of course and TCO 14
For more information see www.hp.com/go/hptc Cluster Platform Express: www.hp.com/go/cp-express alanna.dwyer@hp.com 15
Implementations of CAE Reference Architecture: AMD Opteron example Fastest Faster Fast HP xw9300 Workstation Opteron Workstation for Pre/Post XW9300 2 Dual Core Opteron 2.6 GHz CPUs 2 internal 146 GB drives 32 GB memory DVD ProLiant DL585 Server with Disk Array Opteron Server for Structural Analysis DL585 22U Rack with Factory integration 4 Dual Core Opteron CPUs 2 internal 146 GB drives 32 GB memory MSA30 Dual Bus CP 4000 Cluster Opteron Cluster for CFD and Impact Analysis HP Cluster Platform 4000 compute cluster 42U Rack Sidewinder option DL385 head node for cluster administration DL145G2 with two Dual Core Opteron CPUs, each with 1 internal drive and 4 GB memory (1GB/core) DL585 front end node with 64GB for grid generation and domain decompositioin XC Software Operating Environment support 16