Big data blue print for cloud architecture

Similar documents
Scalable Architecture on Amazon AWS Cloud

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

The Inside Scoop on Hadoop

Using distributed technologies to analyze Big Data

Real Time Big Data Processing

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Cost-Effective Business Intelligence with Red Hat and Open Source

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Cloud Computing and Amazon Web Services

Hadoop IST 734 SS CHUNG

WE RUN SEVERAL ON AWS BECAUSE WE CRITICAL APPLICATIONS CAN SCALE AND USE THE INFRASTRUCTURE EFFICIENTLY.

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data on Microsoft Platform

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Building your Big Data Architecture on Amazon Web Services

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Hadoop & Spark Using Amazon EMR

Open source Google-style large scale data analysis with Hadoop

Oracle Big Data SQL Technical Update

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Technology Enablement

Business Intelligence for Big Data

BIG DATA TRENDS AND TECHNOLOGIES

Enterprise GIS Architecture Deployment Options. Andrew Sakowicz

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

The Future of Data Management

Virtualizing Apache Hadoop. June, 2012

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

HDP Hadoop From concept to deployment.

Analyzing Big Data with AWS

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Testing Big data is one of the biggest

Scaling in the Cloud with AWS. By: Eli White (CTO & mojolive) eliw.com - mojolive.com

Open Source for Cloud Infrastructure

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Big Business, Big Data, Industrialized Workload

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

TRAINING PROGRAM ON BIGDATA/HADOOP

Native Connectivity to Big Data Sources in MSTR 10

Migration and Disaster Recovery Underground in the NEC / Iron Mountain National Data Center with the RackWare Management Module

Implement Hadoop jobs to extract business value from large and varied data sets

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

How to Hadoop Without the Worry: Protecting Big Data at Scale

Cloud computing - Architecting in the cloud

Big Data Big Data/Data Analytics & Software Development

Big Data Analytics Nokia

Big Data Analytics - Accelerated. stream-horizon.com

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

TECHNOLOGY WHITE PAPER Jan 2016

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

TECHNOLOGY WHITE PAPER Jun 2012

Deploying Hadoop with Manager

Hadoop and Map-Reduce. Swati Gore

Mark Bennett. Search and the Virtual Machine

Design for Failure High Availability Architectures using AWS

Peers Techno log ies Pv t. L td. HADOOP

Tap into Hadoop and Other No SQL Sources

Scaling Database Performance in Azure

Cloud Based Application Architectures using Smart Computing

Logentries Insights: The State of Log Management & Analytics for AWS

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

Hadoop in the Hybrid Cloud

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Apache Hadoop. Alexandru Costan

Preparing Your IT for the Holidays. A quick start guide to take your e-commerce to the Cloud

Has been into training Big Data Hadoop and MongoDB from more than a year now

the missing log collector Treasure Data, Inc. Muga Nishizawa

Upcoming Announcements

EXECUTIVE SUMMARY CONTENTS. 1. Summary 2. Objectives 3. Methodology and Approach 4. Results 5. Next Steps 6. Glossary 7. Appendix. 1.

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Large scale processing using Hadoop. Ján Vaňo

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop and MySQL for Big Data

Oracle Database 12c Plug In. Switch On. Get SMART.

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

GigaSpaces Real-Time Analytics for Big Data

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

CRITEO INTERNSHIP PROGRAM 2015/2016

SAP and Hortonworks Reference Architecture


Transcription:

Big data blue print for cloud architecture -COGNIZANT Image Area Prabhu Inbarajan Srinivasan Thiruvengadathan Muralicharan Gurumoorthy Praveen Codur 2012, Cognizant

Next 30 minutes Big Data / Cloud challenges and opportunities Cognizant s Solution Framework Solution deep dive Results Future Opportunities

Introduction About Us Solution Architects for E-commerce / Search & Advertising systems with focus on Big Data and Cloud Projects: Large scale cloud transformation projects for enterprise data warehouses & analytics environments using open BI technologies Big deal Business Expectations Complex Environment No frame of reference, No standard stacks

Expectations Expectation Effective cost utilization Business expansion to other regions Zero tolerance for the data / traffic loss Product/business 24x7 available Business continuity plan Technical Translation High Availability Elasticity Effective Backup Spanning across regions Efficient Seamless and Transparent to end users Popular Questions:

Recurring Questions? Build on Premises vs. Cloud source? What is the Return on Assets / Cost of Computing / Economics What is my frame of reference? What are the technology choices? What is the optimal technology stack? What is the optimal time to market? What are the operational challenges and how to mitigate it? Goal Tool Outcome

Cost - Optimization potential vs reality Environment: 10 Extra Large CPU instances, 60 Large CPU instances, and 30 Small CPU instances. Seasonality of traffic T r a f f i c F a i l o v e r s i t e Based on the peak traffic levels and failover utilization, any option that we choose, we are paying 100% cost while the utilization averages at < 30%

Moore says its not enough - Jan-10 Mar-12 On-Demand Instances Small Linux N. Virginia $0.085 $0.080 Quadruple Extra Large Windows N. California $3.160 $2.504 Data Transfer In Free till June 2010 Free Data Transfer Out Per GB depending on the total monthly volume $0.1 to $0.17 $0.05 to $0.12 Storage (EBS) Per allocated GB per month $0.10 $0.10 I/O Requests Per million I/O $0.10 $0.10 Concerns: Cloud pricing is not adequately keeping in line with Moore s law - cloud computing capacity improvement is 4-10% per year, while improvement in physical h/w is about 100% every 18months.

Peak Traffic Cloud - Optimization Models Cost & Utilization vs Complexity Planned capacity Traffic distribution Traffic distribution Traffic distribution Traffic distribution Parameter Fixed capacity 2 step scaling 3 step scaling 4 step scaling Utilization 30-35 % 50-55 % 60-70 % 75-80 % Complexity 0 % 30 % 50 % 75% Sweet Spot

Mission - Value Proposition Mission: To create a framework/solution which abstracts all of the below complexities from application developers and operators and provide a blue print implementation for Big data enterprise applications on Cloud Cloud Infrastructure Operating systems Security Monitoring Application stack for BI, Visualization, data collection and processing Data Stores Equivalent of What LAMP stack is for web app in the cloud and Big data space Value Prop: Encapsulates the body of knowledge around cloud and open BI, into an automated solution, resulting in Higher Productivity Higher Efficiency Repeatability Highly optimized

Solution SAHANA blue print for Cloud BI stack Collection BI & Visualization Scheduling Data Integration Distributed Processing Big Storage Storage Orchestration Provisioning Monitoring Security Infrastructure` & OS

SAHANA v1.0 Collection BI & Visualization Scheduling Data Integration Distributed Processing Big Storage Storage Orchestration Provisioning Monitoring Security Infrastructure & OS`

Complex Architecture Reports Display Summary Publishers Advertisers Load Balancer Ad Center API Cluster Ad server Middletier HDFS Map Reduce Master Slave Slave MemCache Job Scheduler HBASE ETL Master Slave Activity History SQL Server 2008 Slave Slave Master Slave Pentaho RPT Data Warehouse Infrastructure equipped for Ad Serving and analytics capabilities for a top tier Search Engine. An atlas of functional components of Front ends, Processing layers, Data stores, spawning over 500 machines and forecasted to grow to 5000+.

USER FACING

Architectures 2 tier Logging system 3 Tier Front end N tier Front end

Design Considerations Scaling up or scaling down as demand on the application fluctuates Back up critical data on persistent store (Object based like S3.)

Scaling Strategy Scaling Parameters CPU Memory Disk/Network IO System load Response latency Scaling Up All the above listed metrics should be at 80%. Scale horizontally if any parameter threshold is breached. Add 10% capacity at burst if load is coming back to threshold if not keep on adding 10% capacity till load comes back to threshold. Scaling Down DR Depends on the application nature, if app is fault tolerant scaling down can be automatic. If data needed to be backed up scaling down require human intervention. Scale down if above parameters values come below 70%. 20% capacity will be running as hot stand by. Burst addition of system after fail over based on traffic

Deployment strategy Minutes Can be done using sever template of the running application components. Auto scaling group will be defined on scaling parameter which will scale up servers by launching server templates Other approach is to bring base server and configure it as per the application role using config tool like chef. 6 5 4 3 2 1 0 Chef Server Templates 20 % peak capacity 50% peak capacity 80 % peak capacity

OFFLINE SYSTEMS

Design Considerations 01 02 03 Store for Historical data for Analytics queries. Readily available for ad-hoc querying. Tiered data retention policy 04 Amazon S3 as a data backbone. 05 Choices available AWS EMR, Hadoop cluster, Hive, Hbase

Architecture Hadoop Batch Processing Systems Job client Job client Job client Job client Message Queue Job tracker Host1 Host2 HostN ShardA ShardA ShardA ShardB ShardB ShardB TT/ DN TT/ DN TT TT ShardC ShardC ShardC ShardD ShardD ShardD

Architecture continued ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System)

Scaling Strategy Scaling Parameters Index size for Traditional OLTP system. Number for task handler capacity in case of batch processing system. Processing time of query/job. Scaling Up Add shard server to OLTP server. Add Task tracker to batch processing system. Scaling Down Reduce task trackers nodes if processing capacity is more. Distribute the data to other nodes before shutting off node. DR 0% capacity active at any time for batch processing system. When failing over launch a Hadoop batch processing system or use EMR.

Deployment strategy offline system minutes Sever Template of Data nodes task tracker. Launched server will come online at start processing data. Through configuration management tool. Launch base server and configure it as cluster node. 5 4 3 2 1 20 % peak capacity 50% peak capacity 80 % peak capacity 0 Chef Server Templates

DATA STORES

Data Stores Will have the raw, meta and summarized data sets. Summarized data is derived by processing raw data and used for historical comparisons. DR Site once operational will get the meta and summarized data sets from data bus. % of Total data 20 5 10 70 Raw Unprocessed Raw Historical Summarized Meta Data Type Raw Unprocessed Raw Historical Summarized Access Frequency Frequent Moderate Rare X X X Meta X

Scaling Strategy Scaling Parameters Storage Capacity Scaling Up Add Capacity if Utilization goes above 80% Scaling Down Not applicable DR Replicate the Raw unprocessed data, meta and summary data. Sync Raw unprocessed data from archive on need basis.

Orchestration Layout Provision Applications Ops System Configure Model Recipe Dashboard Virtual machines Monitor Chef AWS Cloud formation Automated Launch Scripts and Server templatization Application configuration for multiple region

Monitoring Layout Dash Board System metrics / SNMP DFS metric Java app with exposed JMX Apache/ Tomcat stats Zenoss System monitoring System state System count MR metrics Utilization metrics Ganglia Active nodes Functional stats System Load The new host should register them selves to monitoring systems using API. While scaling down the server needs to be removed for monitoring. Ran chef recipe to get node added to monitoring. Auto discovery

OUR BI SYSTEM IN ACTION

Normal Operation Failover - DR Managed DNS Primary Load Balancer App server Load Balancer Log server Data Store Log server Log server Log server Concentrator Secondary Concentrator Orchestration layer Name Node Secondary Name Node Data Node 1 Data Node n Ops Job Scheduler 30 RDBMS (M) RDBMS

DR In transition Managed DNS Failover - DR Primary App server Load Balancer App server Load Balancer Log server Data Store Log server Log server Log server Concentrato r Secondary Concentrato r Secondary Name Node Name Node Orchestration layer Name Node Secondary Name Node Data Node 1 Data Node n Ops Data Node 1 Job Scheduler Data Node n 31 RDBMS RDBMS (M) RDBMS

DR In transition Managed DNS Failover - DR Primary App server Load Balancer App server Load Balancer Log server Log server Log server Data Store Log server Log server Log server Concentrato r Secondary Concentrato r Concentrato r Secondary Concentrato r Secondary Name Node Name Node Orchestration layer Name Node Secondary Name Node Data Node 1 Data Node n Ops Data Node 1 Job Scheduler Data Node n 32 RDBMS RDBMS (M) RDBMS

DR In transition Managed DNS Failover - DR Primary App server Load Balancer App server Load Balancer Log server Log server Log server Data Store Log server Log server Log server Concentrato r Secondary Concentrato r Concentrato r Secondary Concentrato r Secondary Name Node Name Node Orchestration layer Name Node Secondary Name Node Data Node 1 Data Node n Data Node 1 Data Node n Job Scheduler Ops Job Scheduler 33 RDBMS RDBMS (M) RDBMS

Fallback to Primary Site Failover - DR Managed DNS Primary Load Balancer App server Load Balancer Log server Data Store Log server Log server Log server Concentrator Secondary Concentrator Orchestration layer Name Node Secondary Name Node Data Node 1 Data Node n Ops Job Scheduler 34 RDBMS (M) RDBMS

Results Productivity * Implementation time for a base Big data / ETL system reduced from multi-month to less than a day * Developers focus on Business than Infrastructure Efficiency * Better utilization of system resources operating at 30% more utilization than benchmark * Optimized for performance 10% higher than stock configuration Repeatability Opex TCO * Complete BI stack in less than a day regardless of scale * At least 50% better against bench mark * On a 3 yr scale at least 35% lower than the bench mark

Future Opportunities

Cognizant - Global Technology Consulting 7.2 billion gross revenue E-commerce 1000+ customers, 50+ delivery centers 160,000 employees 23+ Verticals (Ecommerce, Banking, Insurance, ) Dedicated practice for internet businesses Large scale, complex implementations with emerging Technologies 1000+ Enterprise Architects Search & Advertising Dedicated practice for Search, Advertising and Analytics Mature Big data / Cloud solutions and frameworks Research & Development Image Innovation & Patents Area Free assessment of your challenges / environment Contact us: Prabhu.Inbarajan@cognizant.com Muralicharan.Gurumoorthy@cognizant.com Praveen.Codur@cognizant.com 2012, Cognizant Credits: Laxmana Gunta Viral Shah Paramasivam Kumarasamy Sundaramoorthy OK