Cisco IT Hadoop Journey Alex Garbarini, IT Engineer, Cisco 2015 MapR Technologies 1
Agenda Hadoop Platform Timeline Key Decisions / Lessons Learnt Data Lake Hadoop s place in IT Data Platforms Use Cases 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 2
Bringing Hadoop into Cisco IT in 2011-2012 Paradigm shift from database based application development of last 2 decades at Cisco IT - Cost Structure - Development Methodology & Project lifecycle - Programming Model - Maturity curve of the technology is different FUD Fear, Uncertainty and Doubt Availability of skilled workforce Rapid pace of innovation and constantly changing industry dynamics 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3
Hadoop Journey in Cisco IT Use Cases Deployment Enterprise Data Lake 2014 Growth & Expanding Ecosytem POCs 2011 Multi-tenant Shared Platform July 2012 Starting 2013. 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
Key Decisions Rationale Open Source vs Distribution Architecture Operational Excellence, Availability, Performance, Skill set UCS Common Platform Architecture Support Growth & Leverage Ecosystem Hive (SQL), Mahout, Hbase, Cost & Ecosystem Environment Lifecycle Data Lake Production, Stage, Development & Technical POC (Isolate usage by Risk & Development lifecycle) Data Governance, Reduce cost, Eliminate duplication 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
Lessons from Technology Journey Architecture Choice (s) Multi-tenant Mission critical features Start Small & Grow Support: Open Source or Distribution Leverage Skills. Use components that help users leverage the existing skills like Informatica and SQL Tiered Integrated Architecture to manage data across multiple platforms 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6
Lessons from Technology Journey Hive doesn t support ANSI SQL Reusable UDFs for Hive were created Tidal Enterprise Scheduler allowed for easy workload management and error handling Hadoop scales linearly and our platform grew 100% in the first year. Invest in architecture that allows you to grow. 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7
Data Platform Reference Architecture v3 Data Sources Data Storage and Processing Data Consumption (Mobile / Browser / Data Service) Databases ALL other Sources Cisco Data Virtualization (Composite) Logical Data Abstraction Layer across transactional, SaaS, Big Data & DW Experience Toolkit Rapid Prototyping / Data Integration / Data Services Databases Agile Analytics Self Service Dashboard Rapid Business Intell. Customer Registry ERP SFDC Docs, Cases, Content, Social Media, Clicksteam Customer Network, Product Usage Internet of Everything (IoE) Big Data Platform Hadoop & Spark on UCS Machine Learning Data Archiving Data Science Network of Truth SAP HANA on UCS Prrediictive Engine Real time BI Mission Critical Reporting Legacy EDW Financial SSOTs Stable core Controlled Change Cisco Data Virtualization (Composite) Analytics & Modeling HANA Hadoop & Spark SAS Data Exploration Real time Predictive Data Analysis, Analytics Mission Critical Operational Reports Text Machine Learning,, Statistical Analysis (R) Machine Data Insights (e.g. In supply chain) Financial Reporting & Extract Operational Intelligence IT App & System Logs & Config. Index & Search Operational Intelligence(Splunk UI) 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8
Shared Data! Rich Analytics Engineering Advanced Services Cisco Services Marketing Enterprise Platform(s) IT Sales Security Finance Supply Chain 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9
Enterprise Data Lake Metadata driven utilities to automate ingestion of Data Access Management Driven by Metadata Scalable Cost Effective 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10
Hadoop Use Cases Organization (vs) Adoption Level Production Pipeline EDS CSTG - icam - Party Ranking Service - Teradata ETL Offload - Data Lake - Connected Analytics Network Deployment (CAND) - Smart Call Home - Cloud Consumption (Sentinel) - NOS Online - Network SSOT Marketing - Multi-Channel Scoring - Automatic Qualified Leads CWCS Metadata - Content Auto-Tagging CITS - Cisco Partner Annuity Initiative - Social Media Services GIS - Collaboration Dashboard - Item, BOM & Compliance Data Analytics Legal Supply Chain - Data Warehouse Expansion - Measurement - ACTS - TST 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 11
Cisco IT Use Cases for Hadoop in Production Data Platform Option to Reduce Cost Marketing & Content Management Services Risk & Compliance Migrate ETL Processing from EDW (Teradata) Data Lake & Adhoc Data Analysis Data Archiving Customer Segmentation Multi-Channel Scoring Content Autotagging Smart Analytics Offerings Service Opportunity Identification Organization Network Analytics Engineering Source Code Monitoring 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12
Hadoop Distribution: MapR Advantage(s) for Cisco IT High Availability Distributed Name Node Snapshots Volume Based Disaster Recovery Performance Higher performance and fewer nodes ($) Operational Cost / Productivity HBase (MapR DB) and Hadoop on the same cluster NFS (Fully Read & Write) Multiple simultaneous versions on same cluster 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13
Thank You 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14
Cisco Hadoop Platform Physical Architecture Multi UCS cluster Hadoop environment Multi-Tenant model for PROD and DEV/Stage Production Capacity N7K Components Details Cisco UCS 62XXUP Fabric InterConnects ( Per Domain ) 8X 10 Gb/s Each 80 Gb/s 80 Gb/s 8X 10 Gb/s Each OS RHEL 6.4 Distribution MapR (M7) Server (node) UCS 240 M3 16 cores (w HT Hyper Threading 32 cores) Processor E5-2655 Cisco Nexus 2232PP 10 GE Fabric Extenders ( Per Rack) Scalability High Performance High Availability Operational Simplicity Operational Simplicity Unified Management Unified Management ZooKeeper, CLDB, WebServer, JobTracker 3 nodes each, File Server, TaskTracker across all nodes, Platfora 4 nodes Cisco Unified Computing System C240 M3 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15 15 Memory/ Node Storage/Node No. of Nodes 54 Cores Total Memory Storage 256 GB 24*1 TB (22 HDFS) 864 (Hyper Threading enabled) 13824 GB 1188 TB No-SQL HBASE (MapR - M7)
2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16 16 Hadoop Lifecycles Components POC DEV QA Production Software OS RHEL 6.4 RHEL 6.4 RHEL 6.4 RHEL 6.4 Hadoop Distribution MapR M7 3.1.0 MapR M7 3.1.0 MapR M7 3.1.0 MapR M7 3.1.0 Server-Cluster Cisco UCS Servers UCS C210 M2 UCS C210 M2/ C240 M3 Processor Intel Xeon X5675 Intel Xeon X5675 UCS C240 M3 Intel Xeon X5675 UCS C240 M3 Intel Xeon E5-2655 Memory per Node 48 GB 48 GB / 256 GB 256 GB 256 GB Storage per Node (HDFS) 14*1 TB 7200 RPM SATA 14*1 TB / 22 *1TB 7200 RPM SATA 22*1 TB 7200 RPM SATA Rack Level No. of Nodes 4 18 8 54 Processors/Cores 48 240 128 864 22*1 TB 7200 RPM SATA Memory 4x48=192 GB 12x48 + 6x256 GB 8x256 GB 54x256 = 13.8 TB Storage Capacity ( 3 way Replication, Compression) 4x18 = 72 TB 12x14 + 6x22 = 257 TB 150TB 1188 TB
Cisco UCS Big Data Common Platform (CPA) A Highly Scalable Architecture Designed to Meet Variety of Scale-Put Application Demands " UCS Fabric Interconnects provide high-speed, fully redundant, active-active connectivity " Unified fabric (single wire management) " 66% reduction in switch ports " 66% reduction in cables " Powered by UCS C-Series Rack servers " Form factor extension to UCS blade system " UCS Manager " Global view of the cluster " Proactive monitoring of health " 1 Click system software management " UCS Central " Unified management across cluster (up to 10,000 nodes) " Application isolation 2013-2014 Cisco and/or its affiliates. All rights reserved. Business Benefits " Operational Simplification: Simplified and policy-based management Business Benefits " Modular Solution: Modular framework that can scale from small to very large " Risk Reduction: Pre-validation, tighter integration and optimizations reduce integration and deployment risk " Lower TCO: Unified fabric, unified management and infrastructure optimized for performance lowers TCO significantly Architectural Benefits " Scalability: Modular building block, scalable up to 7.2 PB with single management domain " Performance: Best-in-class performance of compute and network for massively scale-out applications " Management and Monitoring: Unified management across cluster (up to 10,000 nodes) Hadoop Requirements Distributed powerful computing Reliable Hardware Local storage in PB Low Latency Low Cost Scalability and Performance Manageability Cisco Confidential 17
Hadoop Platform Security Current State Penthao BI & DI Platform Hadoop Admins Business User Hadoop Developer/ Data Analyst Generic User ID Replication Used for Authentication Port opened for Hadoop Services (CLDB, Jobtracker, File System & Zookeepr) Load Balanced Port opened for Hadoop Services (CLDB, Jobtracker, File System & Zookeepr) CLDB MapR-FS, Job Tracker ZooKeeper Admin ACL to limit access Secure Shell Login Job Submission Tableau Dashboards Edge Servers Sqoop A tool for moving data to/from non-hadoop data stores Pig A high level data flow language Hive SQL like language to query and analyze data using MR Impala Interactive SQL tool on Hadoop Mahout Data mining algorithm using MR R Statistical & Machine Learning language Oozie A job control workflow Flume Tool to ingest/stream log data TES Agent To allow scheduled jobs to execute Port opened for Hadoop Services (CLDB, Jobtracker, File System & Zookeepr) icam Servers 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 18