SUN GRID ENGINE & SGE/EE: A CLOSER LOOK Carlo Nardone HPC Consultant Sun Microsystems, GSO SUN GRID ENGINE & SGE/EE: A CLOSER LOOK
Agenda Sun and Grid Computing Sun Grid Engine: Architecture Campus Grid Model: Sun Grid Engine Enterprise Edition Global Grid Model & Interoperability Grid Computing: A New Computing Utility Model Problem-solving through resource pooling in virtual systems Virtualization of Transparent scalability of Resources into a dynamic, single compute resource from federated assets CPU cycles, storage, devices Access that is... Dependable, consistent, pervasive, inexpensive
The Grid Computing Solution Breaks boundaries Brings resource diversity/scalability Enables efficient use of resources Optimal development environment for users Paves the way to hosting and outsourcing compute on demand Grid Computing Models: Cluster Grids Usage Simplest grid deployment Single team: Project Department Single site firewall Benefit Optimal alignment of resources, tasks and budgets
Grid Computing Models: Campus Grids Usage Multiple teams in organizations share one or more Cluster Grid Single site to enterprisewide Benefit Maximum ROI and utility Grid Computing Models: Global Grids Usage Linked Cluster and Campus Grid Models across many organizations Typically used for research Benefit Creates large virtual system Facilitates collaboration between organizations
Grid@Sun Timeline 1985: "The Network is the Computer" 1992+ GridEngine Product Family (Genias, Gridware) 1995+ Java, JINI, JXTA... 1996+ EU Grid Projects (Eroppa, Unicore, Autobench, Julius...) 2000: Acqu. of Gridware, Sun Grid Engine 5.2 for Solaris 2001: Sun Grid Engine for Linux, AIX, Tru64, Irix, HP/UX 2001: SGE Open Source, GGF DRMAA Standard 2001: SGE Enterprise Edition beta / Campus Broker Grid Software Stack: TCP, SGE, Broker, HPCClusterTools, iplanet, SunMC, JXTA, SRM, DevKits (Forte, SunMC) Sun Grid Steering Committee, Sun Grid Advisory Board ~4000 Cluster & Campus Grids powered by Grid Engine Key Software Technologies for the Grid Cluster Grid Infrastructure Sun HPC Cluster Tools Forte tools Sun MC Web Interface iplanet Technical Computing Portal Campus Grid Infrastructure Sun Grid Engine Enterprise Edition Grid Broker * Distributed Resource Management Sun Grid Engine Family Solaris Operating Environment Global Grid Infrastructure Globus Toolkit * Avaki * Sun Enterprise and Sun Fire Servers Sun StorEdge Systems and HPC SAN Sun s Ultra, Blade Desktops and Sun Ray Information Appliances * Available from partners, non-sun products/research toolkit * Under development from Sun
Grid Computing Adoption Steps CLUSTER GRID MODEL Single Team Single Organization Academic & Research Business CAMPUS GRID MODEL Multiple Teams Single Organization GLOBAL GRID MODEL Multiple Teams Multiple Organizations Sun Grid Engine AGGREGATES THE COMPUTE POWER OF ALL RESOURCES AND DELIVERS COMPUTE POWER AS A NETWORK SERVICE User Jobs Sun Grid Engine Dispatch Results Resource- Selection e.g. most important job first, most expensive license onto fastest machine, first in first out, job specific control,...
Host Types sge_shadowd sge_execd Optional Mandatory sge_commd sge_shadowd sge_schedd sge_execd sge_masterd sge_commd Master-Host Exec-Host Submit-Host Admin-Host Architecture Qmon Qsub Qconf Qrsh Qtcsh Qmake Qmod Qrls Qhost Qacct GDI (internal interface) Execd Shepherd Commd Qmaster Schedd Shadowd O/S: Solaris, Linux, other Unix,...
Information Flow qmaster 2) Notify 3) Job Placement Schedd 1) Submit 7) Inform when done 4) Dispatch Execd 8) Record qsub 6) Control Execd 5) Load Report accounting Execd Sun Cluster Grid Model Solution Maximize resources for single projects, teams, departments Prioritize jobs Manage jobs from start to finish SUN GRID ENGINE POWERS MORE THAN 118,000 CPUs WORLDWIDE
Grid Computing Adoption Steps CLUSTER GRID MODEL Single Team Single Organization Academic & Research Business CAMPUS GRID MODEL Multiple Teams Single Organization GLOBAL GRID MODEL Multiple Teams Multiple Organizations Why Campus Grid Model? Untapped resources are available for everyone.
Campus Grid Model: Key Challenge Lack of Trust My resources won t be available when I need them. Untapped resources are available for everyone. Distributed Resource Management ESSENTIAL COMPONENT FOR COMPUTE FARMS Priority Management Priority Management Policy Management Load Management
SGE, Enterprise Edition: Dynamic Scheduling Maintains active low-level control of workload during execution Supports multiple policies Keeps resource utilization aligned with policies Correlates all workload elements Responds to ad hoc needs Sun Cluster Grid Model Solution: Policies and Monitoring Owners negotiate policies Automated tools enforce policies Exceptions for specific needs/events provide flexibility Monitoring ensures policies are enforced
Sun Grid Engine Enterprise Edition: Multiple Owners, One Location Campus Grid Model with multiple owners Department resource demand for Project A SGE / Enterprise Edition Policies
Grid Computing Adoption Steps CLUSTER GRID MODEL Single Team Single Organization Academic & Research Business CAMPUS GRID MODEL Multiple Teams Single Organization GLOBAL GRID MODEL Multiple Teams Multiple Organizations SGE Interoperability with Globus DELIVERING THE GLOBAL GRID MODEL Joint effort with Globus team announced at SC2001 Demo'd at Argonne National Lab ANL, ARL Army Research Lab, Raytheon, and San Diego SDSC Globus/SGE interaction through GRAM (Globus Resource Allocation Mgr) scripts Globus jobs from ANL submitted to ARL cluster Next step: SGE/EE on top of Globus SGE acting as the resource broker for Globus Globus: multi-site comm., authentication, security, file transfers,... SGE/Broker submits and tracks jobs to remote systems using Globus services
SGE Interoperability with Avaki DELIVERING THE GLOBAL GRID MODEL Avaki over multiple SGE instances (knits together resources) SGE does cluster mgmt within an admin domain and FS area Organization A Organization B Organization C SGE d a e b f SGE Proprietary c Data g SGE h Data mapped and available through Avaki Data Grid Clients view Avaki HPC / a b c d e f g h Avaki Data Grid Sun Powers the Grid Complete suite of Grid software-stack components Over 3000 SGE grids worldwide Collaboration with key technology providers (Globus, Legion/Avaki, Cactus, Punch...) Open source, open standards Grid Engine Projects: www.gridengine.sunsource.net ClusterTools community source access Forte/Netbeans development tools DRMAA standard initiative within Global Grid Forum: Distributed Resource Management Application API Application portability across compliant DRM systems
For further info Wolfgang.Gentzsch@sun.com (Director of Grid Computing @ Sun) www.sun.com/grid www.sun.com/gridware www.gridengine.sunsource.net www.gridforum.org www.globus.org www.avaki.com CARLO NARDONE carlo.nardone @sun.com
Additional slides Coordinated resource sharing and problem solving in dynamic, multi institutional virtual organizations. I. Foster, C. Kesselman, S. Tuecke, "The Anatomy of the Grid", Int. J. Supercomp. Appl., 2001.
Sun / Gridware Timeline CODINE & GRD from Genias GmbH since 1993 Gridware Inc. Aug 2000: Sun acquires Gridware Sept 2000: Sun launches Sun Grid Engine (formerly CODINE) as free download July 2001: SGE goes Open Source Nov 2001: SGE Enterprise Edition announced (formerly GRD) Nov 2001: more than 12,000 downloads, more than 118,000 CPUs (3,000 grids) under SGE mgmt on going integration with Sun SW stack & with Global Grid toolkits (Globus, Legion/Avaki...) DRM Product Space Functionality Advanced RMS GRID Computing ++ Sun Grid Engine Enterprise Edition Sun Grid Engine Load Management Standard Capability
Job Types Job types - a mixture of: Batch Interactive (qsh, qrsh, qlogin) Parallel (mpi, pvm, qmake,...) Checkpointing (CPR, Hibernator, Unicos, user defined,...) Array Jobs (unlimited size, massive scalability) Transfer (to other cluster/queuing systems) Submitted with a request profile Dynamically changeable while pending Queues Where a job executes Job class container/description Bound to a host Queue slots = number of concurrently executing jobs Different queue types: batch/interactive... Queues have attributes (e.g. available memory) Users can be owners of queues
Complexes Queue Complex all attributes being queue related and requestable (in principle) definition of attribute characteristics (e.g. data type) Host/Global Complex all parameters managed on a host/global level, e.g. load (memory), SW licenses, also total of queue slots definition of attribute characteristics User Defined Complexes free definition and grouping of additional attributes definition of attribute characteristics Consumables Capacity management for limited resources Available memory Free Software licenses Available disk space Available network bandwidth... Cluster global -> host related -> queue specific (inheritance) Link with standard and user defined load sensors