Software-Defined Multi-Domain Performance Monitoring Prasad Calyam, Ph.D. calyamp@missouri.edu perfsonar FTW @ OARnet January 2015
Topics of Discussion Today s Applications and Network Monitoring Needs from domain application users Needs for network operators Multi-domain Monitoring Challenges Solution: Science DMZ Use Cases PhysicalTherapy-as-a-Service SoyKB Conclusion 2
2001-2002 History H.323 Beacon, Internet2 End-to-End Performance Initiative, E2E pipes, End User / Network Administrator (at Los Angeles requests H.323 session statistics data from Chicago to Los Angeles in Web Format) Beacon Client Network Administrator located at Los Angeles wants to have video conference session with Network Administrator located at Chicago Beacon Client End User / Network Administrator (at Chicago requests H.323 session statistics data from Los Angeles to Chicago in Web Format) Beacon Server (at Sioux Falls POP) Beacon Server Beacon Server (at Los Angeles POP) Beacon Server (at Tampa POP) Beacon Server (at Chicago POP) (at Austin, Texas) H.323 Statistics includes Video Frame Rate, Round Trip Time, Throughput, Video Jitter, Audio Jitter and Packet Loss 3
History Number of deployments today are significant, and we need to celebrate! J January 2014 4
Data-intensive Applications Today 5
Science DMZ New trend emerging across university campuses to deploy Science DMZs (demilitarized zones) Supports science drivers that involve for e.g., data-intensive applications needing access to remote instruments or public clouds Friction-free flow acceleration using gatekeeper proxy by passing the enterprise firewall Normal Application Campus Network Science Application Campus Access Network IP Network Gatekeeper Proxy Middleware Direct Connect Network Web Application Science Application Public Cloud Instrument Site on Campus Extended VLAN Overlay Software-Defined Network Campus Access Network Science Application Remote Collaborator 6
The Future looks more interesting 7
User Need for Network Monitoring ADTS Case Study Advanced Data Transfer Service (ADTS) FTP* for data movement in data-intensive applications Uses RoCE (RDMA over Converged Ethernet) protocol High speed application layer protocol that is highly sensitive to packet loss Does not use congestion-control schemes Ferrari car for broadband network roads Direct memory-to-memory transfer *P. Calyam, A. Berryman, E. Saule, H. Subramoni, P. Schopis, G. Springer, U. Catalyurek, D. K. Panda, Wide-area Overlay Networking to Manage Accelerated Science DMZ Flows, IEEE International Conf. on Computing, Networking and Communications (ICNC), 2014. 8
We rate-limit our networks inadvertently 10x times slower! Problem: Static Network Policy Management 9
Case of a Google Fiber Home in Kansas City 40 Mbps performance on a 1 Gbps access network connection!! Account # xxxxxxxxxxxxx, phone call with customer: Tested @ 2:38 PM 1/8/2015: 91/92 (HP) Tested @ 2:40 PM 1/8/2015: 87/91 (Asus) Tested @ 2:42 PM 1/8/2015: 90/86 (HP) Tested @ 2:44 PM 1/8/2015: 88/66 (Asus) Checked Speed & Duplex settings of HP device. Changed to Full 1 GBPS setting. Tested @ 2:49 PM 1/8/2015: 91/92 MBPS (HP) Restarted HP b/c it appeared that Speed & Duplex settings didn't take effect. Tested @ 2:54 PM 1/8/2015: 94/92 MBPS (HP). CX advised me that he has a 3rd party router installed in between the NB and his PC's. CX asked if this would cause issues. I educated CX that it would b/c some routers will limit speeds to 100 MBPS. I requested the CX bypass the 3rd party router and HW directly to the NB. Only the Asus device could be HW'd directly to the NB. The Asus device is not GIG capable, and would not provide best results. 10
Multi-domain Network Monitoring 11
Network Operator Need for Network Monitoring 12
What does the Red Mean? Need for performance Intelligence end-to-end visibility across multi-domain paths 13
Integrated Application and Network Monitoring Users and Network Operators together We use system and network measurements to infer application-level performance & We also use application performance independently to validate system and network performance Closely Integrated In Isolation Need for Programmability ability for flexible measurement scheduling (analogous to SDN) Need for Extensibility ability to add custom metrics of application/network 14
OnTimeDetect Tool Anomaly Detection with perfsonar framework Leveraging Uncorrelated (APD scheme) and Correlated Anomaly Detection (Topologyaware scheme) 15
OnTimeDetect Tools integration into DOE infrastructure ESnet perfsonar Nagios-plugin for network operations Nagios-plugin for RACF perfsonar Dashboards (e.g., for US Atlas) E-Center Anomaly Detection Service (ADS) for DOE lab sites monitoring Source Code: http://anonsvn.internet2.edu/svn/perfsonar-ps/branches/osc-apd- Nagios/perfSONAR_PS-Nagios/ 16
Integration in ESnet s perfsonar Nagios Plugins Usage: check_apd_throughput.pl -u --url <service-url> -s --source <source-addr> -d --destination <dest-addr> -b --bidirectional - r <number-seconds-in-past> -z -- sensitivity <Value> -W --swc <window-size> -w --elevation1 <elevation1-threshold> -c -- elevation2 <elevation2-threshold> -a --algorithm <algorithmselection> -o --output-file Sample Output:./check_apd_throughput.pl -u http://bnl-pt1.es.net:8085/ perfsonar_ps/services/psb -r 36000000 -w 0.2 -c 0.5 -s 198.124.238.38 -d 198.129.254.58 PS_CHECK_THROUGHPUT CRITICAL - Metric is Throughput Source: 198.124.238.38 Destination:198.129.254.58 {Critical{1304565397:1.65043e +08Gbps};Warning{1304531930:1.72025e+08Gbps};} TotalDatum(ForwardDirection)=200;; OK=178;; WARNING=1;; CRITICAL=1;; 17
Adaptive Plateau Detector (APD) APD detects uncorrelated anomalies in measurement data obtained through perfsonar MA web-services Performs better than using Static Plateau Detection (SPD) thresholds 18
Topology-Aware Correlated Anomaly Detection 19
Common Hop Analysis Common Hop Matrix (>65%) 20
Common Event Analysis Common Event Matrix (>70%) 21
Ranking critical anomaly events SPD (Static) 2 weeks APD (Adaptive) 3 Months Noisy grid due to SPD s high false alarms APD s accurate detection improves NTA-CAD grids usability 22
Data Sanity Checking and Certainty Analysis Importance of measurement tool calibration Low Certainty with Anomaly Event High Certainty without Anomaly Event Certainty verification in cases of presence/absence of anomaly events 23
OnTimeSecure: Resource Protection in perfsonar Need for Resource Protection Security prevent abuse of monitoring infrastructure 24
Case Study: Secured Middleground in a Multi-campus Testbed Importance of deciding your security and data/information sharing posture Measurement federation across campuses using Internet2 InCommon Risk assessment and threat modeling study using the NIST method Open-pS compared with RPS-pS (RBAC and ABAC) 25
Topics of Discussion Today s Applications and Network Monitoring Needs from domain application users Needs for network operators Multi-domain Monitoring Challenges Solution: Science DMZ Use Cases PhysicalTherapy-as-a-Service SoyKB Conclusion 26
Putting it all together Software-defined Measurement and Monitoring 5 principles of Intelligent Extensible Programmable Secure Social Easy to use! DOE STTR Phase I & II project in collaboration with Samraksh, Dublin OH 27
Sign-up for FREE Trial www.naradametrics.com 28
Architecture Metrics Throughput Central Intelligence System Delay Jitter Push Service Measurement Orchestration Service Centralized Control Conflict-free Measurements Automated Discovery Control API Loss MOS Video Client Measurement Point Appliance 1 Test Using Selected Metric Collector API Archive API Measurement Analysis Service Anomaly Detection Comparison Measurement Presentation Service Dashboard Event Notifications Data Management Service Archive Results Filter Manager HTML5 Query API Command Line Tool Web Interface VDMS: Clients, Network Operators, Third-Party tools Measurement Point Appliance 2 Protected by the Resource Protection Service Policy Federated Policy Engine Templates IAM Archive Appliance 29
Interoperable with perfsonar Interoperable with perfsonar Can test to perfsonar measurement points from Narada Metrics MPA s and vice versa; query data into custom dashboard 30
Custom Dashboard Measurement Requests Measurement Point Appliance Setup policies for Measurement Data/Resource Sharing with other domains 31
Automated MPA MPA Discovery Discovery and in Management CIS Custom Metric Integration Measurement Federation Portal Programmable Interface 32
Topics of Discussion Today s Applications and Network Monitoring Needs from domain application users Needs for network operators Multi-domain Monitoring Challenges Solution: Science DMZ Use Cases PhysicalTherapy-as-a-Service SoyKB Conclusion 33
MU Science DMZ! 34
MU OSU Dual-ended Science DMZs Performance Engineer Gatekeeper Proxy Middleware Service Engine Routing Engine Measurement Engine 1. Define application end-points and monitoring objectives Authenticated Researcher 2. Provision policy-directed flow rules OpenFlow Controller Legend: Data Flow Control Flow 3. Install HTC flow 3. Install HTC flow 3. Install measurement flow 4. Authorized HTC flow Campus-A Edge Non-IP Network Campus-B Edge 4. Authorized measurement flow Imaging Microscope Extended VLAN Overlay 4. Non-Science DMZ flow Image Processing Cluster Campus-A Firewall IP Network Campus-B Firewall 35
MU s PhysicalTherapy-as-a-Service 36
More than videoconferencing PT Home In-home patient view Home PT Clinic Physical therapist view 37
Synchronous Big Data Problem 38
Narada Metrics Integration with PTaaS 39
Troubleshooting Google Fiber! When everything works with static IP The 40 Mbps mystery!! 40
Importance of vantage points for perfsonar Fun of working with ISPs and Users together J MU<->KC = 12 Mbps MU<->OSU = 360 Mbps KC Home to 150.199.7.214 = 50 Mbps MU to 150.199.4.146 = 347 Mbps MU to 150.199.7.214 = 350 Mbps OSU to 150.199.4.146 = 840 Mbps 41
Route path from KC to MU 1 <1 ms <1 ms <1 ms networkbox.home [192.168.1.1]? 2 1 ms 1 ms 2 ms 10.26.0.25? 3 1 ms 1 ms 1 ms ae9.bng01.mci122.googlefiber.net [192.119.17.142]? 4 1 ms 1 ms 1 ms ae1.pr01.mci103.googlefiber.net [192.119.17.33] 1x10G 5 3 ms 2 ms 3 ms kc-core-01-tengige0-0-0-6-2.mo.more.net [150.199.7.205] 1x10G 6 1 ms 2 ms 1 ms umicn-kc-tengige0-1-0-1.um.more.net [150.199.7.253] 1x10G 7 6 ms 4 ms 4 ms umicn-cn-tengige0-0-0-1-1.um.more.net [150.199.4.169] 1x10G 8 12 ms 6 ms 4 ms umc-nn-i2.bb.missouri.edu [150.199.4.198] 2x10G 9 4 ms 4 ms 4 ms SOLO-xe010.1710.bb.missouri.edu [128.206.1.74] 2x10G 10 4 ms 10 ms 13 ms TEXUS3-SOLO-area0.bb.missouri.edu [128.206.130.194] 2x10G 11 4 ms 4 ms 4 ms RSCT1-TEXUS3.bb.missouri.edu [128.206.130.45] 2x10G 12 4 ms 4 ms 4 ms RSCN1-v300.bb.missouri.edu [128.206.2.203] 1x10G 13 6 ms 4 ms 4 ms ecaas1.rnet.missouri.edu [128.206.119.182] 1Gig host Bottleneck segment? 42
MU s SoyKB 43
Narada Metrics Integration with SoyKB 44
Expectation Management L3 performance L2 performance Setup patrol checks to verify expectations 45
Conclusion Sign-up for FREE trial of http://www.naradametrics.com Software-defined monitoring for application users and network operators Welcome to contact me for more discussions! J calyamp@missouri.edu 46
Thank you for your attention! 47