Big data blue print for cloud architecture -COGNIZANT Image Area Prabhu Inbarajan Srinivasan Thiruvengadathan Muralicharan Gurumoorthy Praveen Codur 2012, Cognizant
Next 30 minutes Big Data / Cloud challenges and opportunities Cognizant s Solution Framework Solution deep dive Results Future Opportunities
Introduction About Us Solution Architects for E-commerce / Search & Advertising systems with focus on Big Data and Cloud Projects: Large scale cloud transformation projects for enterprise data warehouses & analytics environments using open BI technologies Big deal Business Expectations Complex Environment No frame of reference, No standard stacks
Expectations Expectation Effective cost utilization Business expansion to other regions Zero tolerance for the data / traffic loss Product/business 24x7 available Business continuity plan Technical Translation High Availability Elasticity Effective Backup Spanning across regions Efficient Seamless and Transparent to end users Popular Questions:
Recurring Questions? Build on Premises vs. Cloud source? What is the Return on Assets / Cost of Computing / Economics What is my frame of reference? What are the technology choices? What is the optimal technology stack? What is the optimal time to market? What are the operational challenges and how to mitigate it? Goal Tool Outcome
Cost - Optimization potential vs reality Environment: 10 Extra Large CPU instances, 60 Large CPU instances, and 30 Small CPU instances. Seasonality of traffic T r a f f i c F a i l o v e r s i t e Based on the peak traffic levels and failover utilization, any option that we choose, we are paying 100% cost while the utilization averages at < 30%
Moore says its not enough - Jan-10 Mar-12 On-Demand Instances Small Linux N. Virginia $0.085 $0.080 Quadruple Extra Large Windows N. California $3.160 $2.504 Data Transfer In Free till June 2010 Free Data Transfer Out Per GB depending on the total monthly volume $0.1 to $0.17 $0.05 to $0.12 Storage (EBS) Per allocated GB per month $0.10 $0.10 I/O Requests Per million I/O $0.10 $0.10 Concerns: Cloud pricing is not adequately keeping in line with Moore s law - cloud computing capacity improvement is 4-10% per year, while improvement in physical h/w is about 100% every 18months.
Peak Traffic Cloud - Optimization Models Cost & Utilization vs Complexity Planned capacity Traffic distribution Traffic distribution Traffic distribution Traffic distribution Parameter Fixed capacity 2 step scaling 3 step scaling 4 step scaling Utilization 30-35 % 50-55 % 60-70 % 75-80 % Complexity 0 % 30 % 50 % 75% Sweet Spot
Mission - Value Proposition Mission: To create a framework/solution which abstracts all of the below complexities from application developers and operators and provide a blue print implementation for Big data enterprise applications on Cloud Cloud Infrastructure Operating systems Security Monitoring Application stack for BI, Visualization, data collection and processing Data Stores Equivalent of What LAMP stack is for web app in the cloud and Big data space Value Prop: Encapsulates the body of knowledge around cloud and open BI, into an automated solution, resulting in Higher Productivity Higher Efficiency Repeatability Highly optimized
Solution SAHANA blue print for Cloud BI stack Collection BI & Visualization Scheduling Data Integration Distributed Processing Big Storage Storage Orchestration Provisioning Monitoring Security Infrastructure` & OS
SAHANA v1.0 Collection BI & Visualization Scheduling Data Integration Distributed Processing Big Storage Storage Orchestration Provisioning Monitoring Security Infrastructure & OS`
Complex Architecture Reports Display Summary Publishers Advertisers Load Balancer Ad Center API Cluster Ad server Middletier HDFS Map Reduce Master Slave Slave MemCache Job Scheduler HBASE ETL Master Slave Activity History SQL Server 2008 Slave Slave Master Slave Pentaho RPT Data Warehouse Infrastructure equipped for Ad Serving and analytics capabilities for a top tier Search Engine. An atlas of functional components of Front ends, Processing layers, Data stores, spawning over 500 machines and forecasted to grow to 5000+.
USER FACING
Architectures 2 tier Logging system 3 Tier Front end N tier Front end
Design Considerations Scaling up or scaling down as demand on the application fluctuates Back up critical data on persistent store (Object based like S3.)
Scaling Strategy Scaling Parameters CPU Memory Disk/Network IO System load Response latency Scaling Up All the above listed metrics should be at 80%. Scale horizontally if any parameter threshold is breached. Add 10% capacity at burst if load is coming back to threshold if not keep on adding 10% capacity till load comes back to threshold. Scaling Down DR Depends on the application nature, if app is fault tolerant scaling down can be automatic. If data needed to be backed up scaling down require human intervention. Scale down if above parameters values come below 70%. 20% capacity will be running as hot stand by. Burst addition of system after fail over based on traffic
Deployment strategy Minutes Can be done using sever template of the running application components. Auto scaling group will be defined on scaling parameter which will scale up servers by launching server templates Other approach is to bring base server and configure it as per the application role using config tool like chef. 6 5 4 3 2 1 0 Chef Server Templates 20 % peak capacity 50% peak capacity 80 % peak capacity
OFFLINE SYSTEMS
Design Considerations 01 02 03 Store for Historical data for Analytics queries. Readily available for ad-hoc querying. Tiered data retention policy 04 Amazon S3 as a data backbone. 05 Choices available AWS EMR, Hadoop cluster, Hive, Hbase
Architecture Hadoop Batch Processing Systems Job client Job client Job client Job client Message Queue Job tracker Host1 Host2 HostN ShardA ShardA ShardA ShardB ShardB ShardB TT/ DN TT/ DN TT TT ShardC ShardC ShardC ShardD ShardD ShardD
Architecture continued ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System)
Scaling Strategy Scaling Parameters Index size for Traditional OLTP system. Number for task handler capacity in case of batch processing system. Processing time of query/job. Scaling Up Add shard server to OLTP server. Add Task tracker to batch processing system. Scaling Down Reduce task trackers nodes if processing capacity is more. Distribute the data to other nodes before shutting off node. DR 0% capacity active at any time for batch processing system. When failing over launch a Hadoop batch processing system or use EMR.
Deployment strategy offline system minutes Sever Template of Data nodes task tracker. Launched server will come online at start processing data. Through configuration management tool. Launch base server and configure it as cluster node. 5 4 3 2 1 20 % peak capacity 50% peak capacity 80 % peak capacity 0 Chef Server Templates
DATA STORES
Data Stores Will have the raw, meta and summarized data sets. Summarized data is derived by processing raw data and used for historical comparisons. DR Site once operational will get the meta and summarized data sets from data bus. % of Total data 20 5 10 70 Raw Unprocessed Raw Historical Summarized Meta Data Type Raw Unprocessed Raw Historical Summarized Access Frequency Frequent Moderate Rare X X X Meta X
Scaling Strategy Scaling Parameters Storage Capacity Scaling Up Add Capacity if Utilization goes above 80% Scaling Down Not applicable DR Replicate the Raw unprocessed data, meta and summary data. Sync Raw unprocessed data from archive on need basis.
Orchestration Layout Provision Applications Ops System Configure Model Recipe Dashboard Virtual machines Monitor Chef AWS Cloud formation Automated Launch Scripts and Server templatization Application configuration for multiple region
Monitoring Layout Dash Board System metrics / SNMP DFS metric Java app with exposed JMX Apache/ Tomcat stats Zenoss System monitoring System state System count MR metrics Utilization metrics Ganglia Active nodes Functional stats System Load The new host should register them selves to monitoring systems using API. While scaling down the server needs to be removed for monitoring. Ran chef recipe to get node added to monitoring. Auto discovery
OUR BI SYSTEM IN ACTION
Normal Operation Failover - DR Managed DNS Primary Load Balancer App server Load Balancer Log server Data Store Log server Log server Log server Concentrator Secondary Concentrator Orchestration layer Name Node Secondary Name Node Data Node 1 Data Node n Ops Job Scheduler 30 RDBMS (M) RDBMS
DR In transition Managed DNS Failover - DR Primary App server Load Balancer App server Load Balancer Log server Data Store Log server Log server Log server Concentrato r Secondary Concentrato r Secondary Name Node Name Node Orchestration layer Name Node Secondary Name Node Data Node 1 Data Node n Ops Data Node 1 Job Scheduler Data Node n 31 RDBMS RDBMS (M) RDBMS
DR In transition Managed DNS Failover - DR Primary App server Load Balancer App server Load Balancer Log server Log server Log server Data Store Log server Log server Log server Concentrato r Secondary Concentrato r Concentrato r Secondary Concentrato r Secondary Name Node Name Node Orchestration layer Name Node Secondary Name Node Data Node 1 Data Node n Ops Data Node 1 Job Scheduler Data Node n 32 RDBMS RDBMS (M) RDBMS
DR In transition Managed DNS Failover - DR Primary App server Load Balancer App server Load Balancer Log server Log server Log server Data Store Log server Log server Log server Concentrato r Secondary Concentrato r Concentrato r Secondary Concentrato r Secondary Name Node Name Node Orchestration layer Name Node Secondary Name Node Data Node 1 Data Node n Data Node 1 Data Node n Job Scheduler Ops Job Scheduler 33 RDBMS RDBMS (M) RDBMS
Fallback to Primary Site Failover - DR Managed DNS Primary Load Balancer App server Load Balancer Log server Data Store Log server Log server Log server Concentrator Secondary Concentrator Orchestration layer Name Node Secondary Name Node Data Node 1 Data Node n Ops Job Scheduler 34 RDBMS (M) RDBMS
Results Productivity * Implementation time for a base Big data / ETL system reduced from multi-month to less than a day * Developers focus on Business than Infrastructure Efficiency * Better utilization of system resources operating at 30% more utilization than benchmark * Optimized for performance 10% higher than stock configuration Repeatability Opex TCO * Complete BI stack in less than a day regardless of scale * At least 50% better against bench mark * On a 3 yr scale at least 35% lower than the bench mark
Future Opportunities
Cognizant - Global Technology Consulting 7.2 billion gross revenue E-commerce 1000+ customers, 50+ delivery centers 160,000 employees 23+ Verticals (Ecommerce, Banking, Insurance, ) Dedicated practice for internet businesses Large scale, complex implementations with emerging Technologies 1000+ Enterprise Architects Search & Advertising Dedicated practice for Search, Advertising and Analytics Mature Big data / Cloud solutions and frameworks Research & Development Image Innovation & Patents Area Free assessment of your challenges / environment Contact us: Prabhu.Inbarajan@cognizant.com Muralicharan.Gurumoorthy@cognizant.com Praveen.Codur@cognizant.com 2012, Cognizant Credits: Laxmana Gunta Viral Shah Paramasivam Kumarasamy Sundaramoorthy OK