Interconnecting network monitoring and ticketing systems at CERN Gyorgy Balazs Veronique Lefebure SIG-NOC meeting @ Stuttgart 08.04.2015 2
Network/Telecom environment Multi-purpose, multi-vendor network infrastructure for General connectivity, Technical instruments, WLCG, Internet Exchange Point (CIXP) 3 distinct 10-100 gigabit backbones 2 datacenters (Geneva + Budapest) 150+ high performance routers 3 700+ subnets 3 000+ switches 50 000 active user devices 80 000 sockets 5 000 km of UTP cable 400+ starpoints (from 20 to 1 000 outlets) 5 000 km of CERN owned fibers 500 Gbps of WAN connectivity Telecom services 12000 fixed telephone lines (analogue, ISDN, IP, Lync) 6000 mobile phones on partly CERN operated infrastructure 300 TETRA digital radio handsets on CERN managed infrastructure Extremely Dynamic environment 2x more visitors than staff 1500 connection and change request / month 3
NOC Structure (Network and Telecom incident management) CERN Users Standard incidents and user support On-site Outsourcing IT Helpdesk (Network) Service Desk Field Technicians Switchboard & telecom Lab (Telephony) Service Monitoring And advanced support CERN accelerator and experiment control rooms Network and Telecom Operations Computer Centre Operator (24/7) Advanced incident and problem management Network Engineer teams Telecom Engineer teams External support and collaboration External Entities (Vendors, CIXP, LHCOPN...) External Entities (Vendors, Telecom operators 4
Infrastructure Monitoring Monitoring tool: CA Spectrum fed by the in-house developed NMS Used for: - Monitoring network & telecom devices and related servers, PDUs, temperature sensors, selected hosts for experiment instruments - Tracking alarms and sending notifications - Collecting data for statistics 5
Service Management / Ticketing Integrated ITSM tool: Service-Now Used for: User incidents and requests Intra-NOC ticketing Knowledge base Change management Intervention planning Service catalogue and portal Service status board OWH support calendars Reporting (eg. SLA tracking) 6
Why interconnect Monitoring + Ticketing? Automatic mapping of ticket ID to alarm: Live tracking of the troubleshooting process from the alarm screen Ease updates of service status board to keep users informed Documented solution accessible from the alarm history Automatic ticket creation: Automate incident assignment to support groups Ensure analysis of even short outages for long-term improvement Procedures directly in the ticket based on alarm and equipment type No more copy-pasting when creating tickets Automatic clock-stack for SLA tracking and post mortem analysis 7
Interconnection via centrally managed message broker interface Ticket automatically created when alarm shows up Alarm contains link to corresponding ticket Ticket contains: Incident type Device(s) concerned Service concerned Link to corresponding procedure Device alarm history with ticket references Alarms are grouped in single ticket by service and geolocation Msg Broker 8
RestAPI How does it work? SpectroServer Alarm Alarm Notifier (filters) Ticket ID Listener script Spectrum SMS E-mail Script Return msgs STOMP call Incident grouping ServiceNow INC Message Broker 9
Challenges Granularity of alarm grouping not to flood, not to miss Mapping of alarms to: Support groups Procedures Corresponding alarm history Granularity of procedures per device/service/support group? Automatic update to the ticket when device status changes Lorem ipsum dorum Lorem ipsum dorum? 10
Results so far Allowed to outsource ticket dispatching to the Computer Centre operator 24/7 (no specialization needed, procedure in ticket) Improved response time consistency: Time precisely controlled between the incident and ticket assignment to the field technician Network operations team is refocusing on proactive instead of reactive tasks: post-mortem analysis, indicative monitoring events - visible improvement in detecting anomalies Engineers can check resolution status directly from the alarm screen or alarm history 11