OSG Operational Infrastructure
|
|
|
- Cori Gaines
- 10 years ago
- Views:
Transcription
1 OSG Operational Infrastructure December 12, 2008 II Brazilian LHC Computing Workshop Rob Quick - Indiana University Open Science Grid Operations Coordinator
2 Contents Introduction to the OSG Operations Center (GOC) Support Services Hardware and Facility Services Compute Services Lessons Learned as Operations Coordinator Questions and Comments 2
3 The Open Science Grid QuickTime and a decompressor are needed to see this picture.
4 What is the GOC? Modeled off of the hugely successful Global Research Network Operations Center ( The GRNOC services worldwide research and education networks such as Internet 2, NLR, AMPATH, TransPAC, MAN LAN, and others with highly responsive network support. Collocated with the GOC to leverage 24x7 support. Leveraged other systems such as ticketing, internal monitoring, people and knowledge of Operations Centers 6 FTE Effort (4 OSG and 2 IU) Split into separate Support and Service Infrastructure groups. OSG is a distributed grid, so operational effort is not entirely at IU, FNAL, BNL, and others offer contributions Hardware located at IU Bloomington, IU Indianapolis, and FNAL Services are run locally but usually borrowed from other sources Some bridging development is necessary for these services 4
5 What do we do? Support Services User, Administrator, Application Developer, and Other Collaborator Support Problem Tracking and Troubleshooting Communication Hub Participation in just about every aspect of OSG Hardware Services Physical Facility Infrastructure System Administration Testing Environments Compute Services Information Services Monitoring Services Communication Services Administrative Services Metric and Reporting Services Collaborative Documentation Services 5
6 Contents Introduction to the OSG Operations Center (GOC) Support Services Hardware and Facility Services Compute Services Lessons Learned as Operations Coordinator Questions and Comments 6
7 OSG Support Services User, Administrator, Application Developer, and Other Collaborator Support 24x7 Problem Reporting Webform Submission ( Phone ( ) (goc, support, abuse, incident, Emergency Response to Security Issues and Critical Service Outages A GOC Engineer is contact 24x7 While ticket creation is 24x7 unless there is an security incident or a critical service is down most response is postponed until the next business day Problem Tracking and Troubleshooting Footprints Trouble Ticketing System (Leveraged from the GRNOC) GOC Support members gather information and troubleshoot problems received and either find solutions or pass them to the responsible party in the OSG organization ~6200 Tickets in since the OSG was formed, ~70 open tickets being worked at any given time. VO Support Centers handle most VO-centric issues and a lot of times the GOC only acts as a conduit for exchange tickets Automated ticket exchange is set up with WLCG GGUS, BNL RT, FNAL Remedy, VDT RT, and other collaborators 7
8 OSG Support Services (Continued) Communication Hub Community Notifications News RSS ( Collaborative Documentation Twiki ( Participation in just about every aspect of OSG Integration Software and Tools Security WLCG Interoperability Education and Outreach Site Admins Group VO Users Group Storage Executive Board and Council 8
9 Contents Introduction to the OSG Operations Center (GOC) Support Services Hardware and Facility Services Compute Services Lessons Learned as Operations Coordinator Questions and Comments 9
10 Hardware Services Physical Facility Infrastructure State of the Art Machine Room Facilities (Indianapolis and Bloomington) Power and Cooling (24x7 machine room operators on site) Institutional Backup System Institutional Security (Physical and Software) Network Facilities (GRNOC in house) System Administration One Full Time Administrator Approximately 25 Machines Mostly Dell PowerEdge RHEL 5.1 and Gentoo OS Testing Environments For each production instance of a service there is at least one integration or test instance Test Instances Moving to VMs 10
11 Contents Introduction to the OSG Operations Center (GOC) Support Services Hardware and Facility Services Compute Services Lessons Learned as Operations Coordinator Questions and Comments 11
12 Compute Services (Information Services) BDII Defined as a Critical level service by CMS and ATLAS CEMon Client pushes raw GLUE Schema to collectors Data is put into BDII which allows the information to be collected via LDAP queries Currently running queries at about 60Hz, testing has show we should be OK up to ~5 times this level on one BDII instance DNS Roundrobining is currently being set up this will allow failover and load balancing Pushed up to WLCG top level BDII QuickTime and a decompressor are needed to see this picture. ReSS Housed at FNAL CEMon pushes raw GLUE data to the ReSS collector Uses Condor friendly class-ads 12
13 Compute Services (Monitoring Services) RSV Local test probes test basic grid functionality CondorCron Scheduler Gratia Collector Puts monitoring back into the hands of the Admin Results are published locally and to the GOC ATLAS and CMS results are forwarded to WLCG in a specified format Alpha version of central display tools at Allows individually filtered views VOMS Monitor Sends queries to VOMS machines to see if they are available Tabular display at VORS Serial heartbeat monitoring Soon to be deprecated Display at GIP Validator Tests for sanity of GIP data reported by resources Being redesigned by CMS 13
14 Compute Services (Other Services) Communication Services Footprints Trouble Ticket System Trouble Ticket Exchange Notification Tools News Feeds Webform Ticket Submissions Downtime Reporting Tools Weekly Operations Meeting Administrative Services OSG Information Management Database Software Caches Metric and Reporting Services Several daily reports about status of OSG Availability and Reliability Statistics to WLCG Trouble ticket metrics Collaborative Documentation Service TWiki 14
15 Contents Introduction to the OSG Operations Center (GOC) Support Services Hardware and Facility Services Compute Services Lessons Learned as Operations Coordinator Questions and Comments 15
16 Lessons Learned Supporting a wide range of collaborators takes a wide range of support mechanisms. The number one cause of problems is solutions. While distributed hardware works well, some central coordinating and reporting services will always be necessary. There is always more work than effort. A centralized troubleshooting group can not find and solve every customer issue. Bad documentation can be worse than no documentation. Everyone has need of a different view of the OSG. (Admin, User, GOC Engineer, VO Coordinator, Support Center) Conway s Law: Any piece of software reflects the organizational structure that produced it. Any data that does not exist in two places, may not exist at all. Don t try to race up a mountain with Horst, he ll win Despite its seeming complexity, OSG does provide significant resources to the scientific community. And, even better than the compute resources, their human resources are 16
17 Lessons to be Learned The LHC turn up will create an explosion in the user base, how will the OSG infrastructure and support mechanisms handle this load. Can we effectively consolidate the information sources into a single display. How will collaborators embrace RSV? We can only report issues, we can not fix them. Can social networking technologies help build community? How will growth and expansion affect operations. 17
18 Contents Introduction to the OSG Operations Center (GOC) Support Services Hardware and Facility Services Compute Services Lessons Learned as Operations Coordinator Questions and Comments 18
Batch and Cloud overview. Andrew McNab University of Manchester GridPP and LHCb
Batch and Cloud overview Andrew McNab University of Manchester GridPP and LHCb Overview Assumptions Batch systems The Grid Pilot Frameworks DIRAC Virtual Machines Vac Vcycle Tier-2 Evolution Containers
Monitoring Clusters and Grids
JENNIFER M. SCHOPF AND BEN CLIFFORD Monitoring Clusters and Grids One of the first questions anyone asks when setting up a cluster or a Grid is, How is it running? is inquiry is usually followed by the
Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013
Virtualisation Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013 Virtualisation @ RAL Context at RAL Hyper-V Services Platform Scientific Computing Department
LHC schedule: what does it imply for SRM deployment? [email protected]. CERN, July 2007
WLCG Service Schedule LHC schedule: what does it imply for SRM deployment? [email protected] WLCG Storage Workshop CERN, July 2007 Agenda The machine The experiments The service LHC Schedule Mar. Apr.
CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN 16.06.2006
CERN local High Availability solutions and experiences Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN 16.06.2006 1 Introduction Different h/w used for GRID services Various techniques & First
Summer Student Project Report
Summer Student Project Report Dimitris Kalimeris National and Kapodistrian University of Athens June September 2014 Abstract This report will outline two projects that were done as part of a three months
The dcache Storage Element
16. Juni 2008 Hamburg The dcache Storage Element and it's role in the LHC era for the dcache team Topics for today Storage elements (SEs) in the grid Introduction to the dcache SE Usage of dcache in LCG
Global Grid User Support - GGUS - in the LCG & EGEE environment
Global Grid User Support - GGUS - in the LCG & EGEE environment Torsten Antoni ([email protected]) Why Support? New support groups Network layer Resource centers CIC / GOC / etc. more to come New
An objective comparison test of workload management systems
An objective comparison test of workload management systems Igor Sfiligoi 1 and Burt Holzman 1 1 Fermi National Accelerator Laboratory, Batavia, IL 60510, USA E-mail: [email protected] Abstract. The Grid
GMOC Service Desk Support. GMOC Experimenter Support Proposal GEC13-2012
GMOC Service Desk Support GMOC Experimenter Support Proposal GEC13-2012 Existing GMOC Service Desk Support Description and Overview Currently the GMOC Service Desk provides the following support for the
Managing and Maintaining Windows Server 2008 Servers (6430) Course length: 5 days
Managing and Maintaining Windows Server 2008 Servers (6430) Course length: 5 days Course Summary: This five-day instructor-led course provides students with the knowledge and skills to implement, monitor,
CONDOR CLUSTERS ON EC2
CONDOR CLUSTERS ON EC2 Val Hendrix, Roberto A. Vitillo Lawrence Berkeley National Lab ATLAS Cloud Computing R & D 1 INTRODUCTION This is our initial work on investigating tools for managing clusters and
Novell ZENworks Asset Management
Novell ZENworks Asset Management Administrative Best Practices and Troubleshooting www.novell.com APRIL 19, 2005 2 GETTING THE MOST OUT OF NOVELL ZENWORKS ASSET MANAGEMENT The award-winning asset tracking
ENABLING VIRTUALIZED GRIDS WITH ORACLE AND NETAPP
NETAPP AND ORACLE WHITE PAPER ENABLING VIRTUALIZED GRIDS WITH ORACLE AND NETAPP Generosa Litton, Network Appliance, Inc. Monica Kumar, Frank Martin, Don Nalezyty, Oracle March 2008 WP-7037-0208 EXECUTIVE
Why a Server Infrastructure Refresh Now and Why Dell?
Why a Server Infrastructure Refresh Now and Why Dell? In This Paper Outdated server infrastructure contributes to operating inefficiencies, lost productivity, and vulnerabilities Worse, existing infrastructure
Database Services for Physics @ CERN
Database Services for Physics @ CERN Deployment and Monitoring Radovan Chytracek CERN IT Department Outline Database services for physics Status today How we do the services tomorrow? Performance tuning
CEMON installation and configuration procedure
CEMON installation and configuration procedure Introduction Tanya Levshina, Rohit Mathur, Steve Timm Draft This document is intended for administrators responsible for installing and configuring ITB Compute
DSA1.4 R EPORT ON IMPLEMENTATION OF MONITORING AND OPERATIONAL SUPPORT SYSTEM. Activity: SA1. Partner(s): EENet, NICPB. Lead Partner: EENet
R EPORT ON IMPLEMENTATION OF MONITORING AND OPERATIONAL SUPPORT SYSTEM Document Filename: Activity: Partner(s): Lead Partner: Document classification: BG-DSA1.4-v1.0-Monitoring-operational-support-.doc
Grid Computing in Aachen
GEFÖRDERT VOM Grid Computing in Aachen III. Physikalisches Institut B Berichtswoche des Graduiertenkollegs Bad Honnef, 05.09.2008 Concept of Grid Computing Computing Grid; like the power grid, but for
Global Grid User Support - GGUS - start up schedule
Global Grid User Support - GGUS - start up schedule GDB Meeting 2004-07 07-13 Concept Target: 24 7 support via time difference and 3 support teams Currently: GGUS FZK GGUS ASCC Planned: GGUS USA Support
Leveraging Virtualization for Disaster Recovery in Your Growing Business
Leveraging Virtualization for Disaster Recovery in Your Growing Business Contents What is Disaster Recovery?..................................... 2 Leveraging Virtualization to Significantly Improve Disaster
ARDA Experiment Dashboard
ARDA Experiment Dashboard Ricardo Rocha (ARDA CERN) on behalf of the Dashboard Team www.eu-egee.org egee INFSO-RI-508833 Outline Background Dashboard Framework VO Monitoring Applications Job Monitoring
Designing and Deploying Messaging Solutions with Microsoft Exchange Server 2010 Service Pack 2 20465B; 5 days, Instructor-led
Designing and Deploying Messaging Solutions with Microsoft Exchange Server 2010 Service Pack 2 20465B; 5 days, Instructor-led Course Description This five-day, instructor-led course provides you with the
Converged and Integrated Datacenter Systems: Creating Operational Efficiencies
Insight Converged and Integrated Datacenter Systems: Creating Operational Efficiencies Rob Brothers IDC OPINION Enterprise IT has moved away from siloed servers, storage, information, and processes and
Data Center Knowledge, Vision Control
Data Center Knowledge, Vision Control Objective Overview of the progressive trends in Data Centers, driven by Intelligent Infrastructure Solutions Data Center Layout Secured Storage Back up Core Backbone
Monitoring Evolution WLCG collaboration workshop 7 July 2014. Pablo Saiz IT/SDC
Monitoring Evolution WLCG collaboration workshop 7 July 2014 Pablo Saiz IT/SDC Monitoring evolution Past Present Future 2 The past Working monitoring solutions Small overlap in functionality Big diversity
The dashboard Grid monitoring framework
The dashboard Grid monitoring framework Benjamin Gaidioz on behalf of the ARDA dashboard team (CERN/EGEE) ISGC 2007 conference The dashboard Grid monitoring framework p. 1 introduction/outline goals of
An Integrated CyberSecurity Approach for HEP Grids. Workshop Report. http://hpcrd.lbl.gov/hepcybersecurity/
An Integrated CyberSecurity Approach for HEP Grids Workshop Report http://hpcrd.lbl.gov/hepcybersecurity/ 1. Introduction The CMS and ATLAS experiments at the Large Hadron Collider (LHC) being built at
Report from SARA/NIKHEF T1 and associated T2s
Report from SARA/NIKHEF T1 and associated T2s Ron Trompert SARA About SARA and NIKHEF NIKHEF SARA High Energy Physics Institute High performance computing centre Manages the Surfnet 6 network for the Dutch
Techniques for implementing & running robust and reliable DB-centric Grid Applications
Techniques for implementing & running robust and reliable DB-centric Grid Applications International Symposium on Grid Computing 2008 11 April 2008 Miguel Anjo, CERN - Physics Databases Outline Robust
EXTEND YOUR FEDERATION ENTERPRISE HYBRID CLOUD SOLUTION
EXTEND YOUR FEDERATION ENTERPRISE HYBRID CLOUD SOLUTION Accelerate the transition to ITaaS The Federation Enterprise Hybrid Cloud solution establishes a sound foundation for delivering IT as a service
Best of Breed of an ITIL based IT Monitoring. The System Management strategy of NetEye
Best of Breed of an ITIL based IT Monitoring The System Management strategy of NetEye by Georg Kostner 5/11/2012 1 IT Services and IT Service Management IT Services means provisioning of added value for
CMS Tier-3 cluster at NISER. Dr. Tania Moulik
CMS Tier-3 cluster at NISER Dr. Tania Moulik What and why? Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach common goal. Grids tend
Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft. Holger Marten. Holger. Marten at iwr. fzk. de www.gridka.de
Tier-2 cloud Holger Marten Holger. Marten at iwr. fzk. de www.gridka.de 1 GridKa associated Tier-2 sites spread over 3 EGEE regions. (4 LHC Experiments, 5 (soon: 6) countries, >20 T2 sites) 2 region DECH
An Oracle White Paper November 2010. Oracle Real Application Clusters One Node: The Always On Single-Instance Database
An Oracle White Paper November 2010 Oracle Real Application Clusters One Node: The Always On Single-Instance Database Executive Summary... 1 Oracle Real Application Clusters One Node Overview... 1 Always
Overcoming Backup & Recovery Challenges in Enterprise VMware Environments
Overcoming Backup & Recovery Challenges in Enterprise VMware Environments Daniel Budiansky Enterprise Applications Technologist Data Domain Dan Lewis Manager, Network Services USC Marshall School of Business
How Boston Scientific Lowered TCO of Credit Card Acceptance and PCI Compliance
How Boston Scientific Lowered TCO of Credit Card Acceptance and PCI Compliance Heidi Dallal, Boston Scientific Eric Bushman, VP, Solutions Engineering, Paymetric In This Session: Examine the innovative
Confidently Virtualize Business-critical Applications in Microsoft Hyper-V with Symantec ApplicationHA
WHITE PAPER: VIRTUALIZE BUSINESS-CRITICAL APPLICATIONS.............. WITH..... CONFIDENCE..................... Confidently Virtualize Business-critical Applications in Microsoft Hyper-V with Symantec ApplicationHA
Building and Managing a Standard Operating Environment
Building and Managing a Standard Operating Environment Dirk Herrmann Head of Strategic Consulting Central Europe, Red Hat Todd Warner Satellite Product Manager, Red Hat Milan Zázrivec Satellite Software
Das HappyFace Meta-Monitoring Framework
Das HappyFace Meta-Monitoring Framework B. Berge, M. Heinrich, G. Quast, A. Scheurer, M. Zvada, DPG Frühjahrstagung Karlsruhe, 28. März 1. April 2011 KIT University of the State of Baden-Wuerttemberg and
DSA1.5 U SER SUPPORT SYSTEM
DSA1.5 U SER SUPPORT SYSTEM H ELP- DESK SYSTEM IN PRODUCTION AND USED VIA WEB INTERFACE Document Filename: Activity: Partner(s): Lead Partner: Document classification: BG-DSA1.5-v1.0-User-support-system.doc
A (Web) Face for Radio. NPR and Drupal7 David Moore
A (Web) Face for Radio NPR and Drupal7 David Moore Who am I? David Moore Developer at NPR Using Drupal since 4.7 Focus on non-profit + Drupal CrookedNumber on drupal.org, twitter, etc. What is NPR? A non-profit
Solving the Top 5 Virtualized Application and Infrastructure Problems
Solving the Top 5 Virtualized Application and Infrastructure Problems By David Davis, vexpert Co-Founder, ActualTech Media May, 2015 Table of Contents Table of Contents... 2 Introduction... 3 Key Factors...
PULSE SECURE CARE PLUS SERVICES
DATASHEET PULSE SECURE CARE PLUS SERVICES Service Overview In today s dynamic marketplace, organizations are under constant pressure to meet market demand while maintaining or increasing return on investment.
CA Cloud Overview Benefits of the Hyper-V Cloud
Benefits of the Hyper-V Cloud For more information, please contact: Email: [email protected] Ph: 888-821-7888 Canadian Web Hosting (www.canadianwebhosting.com) is an independent company, hereinafter
Monitoring & Testing
Rivo provides a total monitoring, analysis, testing and reporting solution. Monitor environmental and other enterprise risk and performance metrics such as air, water and land waste/emissions. Monitor
Consulting Solutions Disaster Recovery. Yucem Cagdar
Consulting Solutions Disaster Recovery Yucem Cagdar Disaster Recovery Strategy How efficient is your DR Plan? Many are not prepared: 42% are not adequately armed with modern disaster recovery solutions,
NOS for Network Support (903)
NOS for Network Support (903) November 2014 V1.1 NOS Reference ESKITP903301 ESKITP903401 ESKITP903501 ESKITP903601 NOS Title Assist with Installation, Implementation and Handover of Network Infrastructure
Planning and Administering Windows Server 2008 Servers
Planning and Administering Windows Server 2008 Servers MOC6430 About this Course Elements of this syllabus are subject to change. This five-day instructor-led course provides students with the knowledge
Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies
Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Kurt Klemperer, Principal System Performance Engineer [email protected] Agenda Session Length:
Administration Guide. . All right reserved. For more information about Specops Inventory and other Specops products, visit www.specopssoft.
. All right reserved. For more information about Specops Inventory and other Specops products, visit www.specopssoft.com Copyright and Trademarks Specops Inventory is a trademark owned by Specops Software.
Server Monitoring: Centralize and Win
Server Monitoring: Centralize and Win Table of Contents Introduction 2 Event & Performance Management 2 Troubleshooting 3 Health Reporting & Notification 3 Security Posture & Compliance Fulfillment 4 TNT
ATLAS job monitoring in the Dashboard Framework
ATLAS job monitoring in the Dashboard Framework J Andreeva 1, S Campana 1, E Karavakis 1, L Kokoszkiewicz 1, P Saiz 1, L Sargsyan 2, J Schovancova 3, D Tuckett 1 on behalf of the ATLAS Collaboration 1
2014 Defense Health Information Technology Symposium Service Desk Strategy for the Defense Health Agency
Mr. Wayne Speaks - Acting Branch Chief Mr. Bill Novak - Acting Branch Chief of Operations DHA Infrastructure and Operations 2014 Defense Health Information Technology Symposium Service Desk Strategy for
HDA Integration Guide. Help Desk Authority 9.0
HDA Integration Guide Help Desk Authority 9.0 2011ScriptLogic Corporation ALL RIGHTS RESERVED. ScriptLogic, the ScriptLogic logo and Point,Click,Done! are trademarks and registered trademarks of ScriptLogic
Juniper Care Plus Services
Juniper Care Plus Services Service Overview In today s dynamic marketplace, organizations are under constant pressure to meet market demand while maintaining or increasing return on investment. IT departments
The Importance of Information Delivery in IT Operations
The Importance of Information Delivery in IT Operations David Williams Notes accompany this presentation. Please select Notes Page view. These materials can be reproduced only with written approval from
IP Address Management: Smoothing the Way to Cloud-Based Services
White Paper IP Address Management: Smoothing the Way to Cloud-Based Services What You Will Learn Cloud computing offers many operational advantages to service providers. An important element of successful
RaMP Data Center Manager. Data Center Infrastructure Management (DCIM) software
RaMP Data Center Manager Data Center Infrastructure Management (DCIM) software One solution for both IT and facilities managers As data centers become more complex, it becomes increasingly important to
Splunk for VMware Virtualization. Marco Bizzantino [email protected] Vmug - 05/10/2011
Splunk for VMware Virtualization Marco Bizzantino [email protected] Vmug - 05/10/2011 Collect, index, organize, correlate to gain visibility to all IT data Using Splunk you can identify problems,
Server Infrastructure Optimization
Best Practices to Reduce IT Operational Costs Abstract This paper shows technical decision makers and IT managers how organizations can reduce costs and improve their IT efficiency by optimizing their
Confidently Virtualize Business-Critical Applications in Microsoft
Confidently Virtualize Business-Critical Applications in Microsoft Hyper-V with Veritas ApplicationHA Who should read this paper Windows Virtualization IT Architects and IT Director for Windows Server
AdvancedHosting SM Solutions from SunGard Availability Services
AdvancedHosting SM Solutions from SunGard Availability Services A SINGLE POINT OF CONTACT A COMPLETE MANAGED SERVICES SOLUTION Higher levels of availability Continuous investment in people, technology
Virtual Patching: a Proven Cost Savings Strategy
Virtual Patching: a Proven Cost Savings Strategy An Ogren Group Special Report December 2011 Executive Summary Security executives, pushing the limits of traditional labor-intensive IT patch processes
BMC Remedy IT Service Management Suite 7.6.04 Installing and Configuring Server Groups
BMC Remedy IT Service Management Suite 7.6.04 Installing and Configuring Server Groups January 2011 www.bmc.com Contacting BMC Software You can access the BMC Software website at http://www.bmc.com. From
