NextGen Infrastructure for Big DATA Analytics.

Similar documents
Big Fast Data Hadoop acceleration with Flash. June 2013

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Hadoop & its Usage at Facebook

Protecting Big Data Data Protection Solutions for the Business Data Lake

Design and Evolution of the Apache Hadoop File System(HDFS)

Oracle Big Data SQL Technical Update

Overview: X5 Generation Database Machines

Virtualizing Apache Hadoop. June, 2012

A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

NoSQL for SQL Professionals William McKnight

Apache Hadoop FileSystem and its Usage in Facebook

Nutanix Solutions for Private Cloud. Kees Baggerman Performance and Solution Engineer

EMC SOLUTION FOR SPLUNK

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Next-Generation Cloud Analytics with Amazon Redshift

Advanced In-Database Analytics

Hadoop. Sunday, November 25, 12

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

Real Time Big Data Processing

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Large scale processing using Hadoop. Ján Vaňo

Survey of Big Data Architecture and Framework from the Industry

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data Performance Growth on the Rise

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop & its Usage at Facebook

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

The 3 questions to ask yourself about BIG DATA

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

THE VIRTUAL DATA CENTER OF THE FUTURE

Data Refinery with Big Data Aspects

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Luncheon Webinar Series May 13, 2013

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

I/O Considerations in Big Data Analytics

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Testing Big data is one of the biggest

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

NetApp Big Content Solutions: Agile Infrastructure for Big Data

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Data Centric Computing Revisited

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

So What s the Big Deal?

BIG DATA TRENDS AND TECHNOLOGIES

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Why DBMSs Matter More than Ever in the Big Data Era

Parallel Data Warehouse

Big Data With Hadoop

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Application Development. A Paradigm Shift

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Impact of Big Data in Oil & Gas Industry. Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India.

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Introduction to Analytics and Big Data - Hadoop. Rob Peglar EMC Isilon

Cost-Effective Business Intelligence with Red Hat and Open Source

Efficient Backup with Data Deduplication Which Strategy is Right for You?

Big + Fast + Safe + Simple = Lowest Technical Risk

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

OnX Big Data Reference Architecture

HITACHI DATA SYSTEMS HADOOP SOLUTION JUNE 12, 2012

ebay Storage, From Good to Great

How To Scale Out Of A Nosql Database

EMC - XtremIO. All-Flash Array evolution - Much more than high speed. Systems Engineer Team Lead EMC SouthCone. Carlos Marconi.

Big Data Analytics Using SAP HANA Dynamic Tiering Balaji Krishna SAP Labs SESSION CODE: BI474

Scalable Architecture on Amazon AWS Cloud

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Building a Scalable Big Data Infrastructure for Dynamic Workflows

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Inge Os Sales Consulting Manager Oracle Norway

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot.

HadoopTM Analytics DDN

Storage Architectures for Big Data in the Cloud

Introducing Oracle Exalytics In-Memory Machine

IBM ELASTIC STORAGE SEAN LEE

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Transcription:

NextGen Infrastructure for Big DATA Analytics.

So What is Big Data? Data that exceeds the processing capacity of conven4onal database systems. The data is too big, moves too fast, or doesn t fit the structures of your database architectures. To gain value from this data, you must choose an alterna4ve way to process it Ed Dumbill, program chair for the O Reilly Strata Conference

Big Data Characteristics Variety: Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text. Structured (logs, business transactions). Semi-structured and unstructured Velocity: Streaming data and large volume data movement. Moves at very high rates Volume: Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs). Valuable for mining patterns, trends and relationships Source : IBM Research Extracting Business Insights from large volume, variety and velocity of data, beyond what was previously possible!!

Big Data Product Metrics Choices

Big Data Enriches the Information Management Ecosystem Active Archive Cost Optimization Master Data Enrichment via Life Events, Hobbies, Roles, +++ Establishing Information as a Service Audit MapReduce Jobs and tasks OLTP Optimization (SAP, checkout, +++) Managing a Governance Initiative Who Ran What, Where, and When?

The Infrastructure The internet has spawned an EXPLOSION in data growth in the form of data sets, called Big Data, which are so large they are difficult to store, manage and analyze using tradi4onal DB and Storage architecture. Not only this new data heavily unstructured, voluminous and streams rapidly and difficult to harness, But also if we look into scale of Volume Only from The 3 V of the Big Data around Volume, Variety and Velocity then someone should imagine about the massive infrastructure requirement around this calcula4on - :.Jet Engine produces ~10PB of data every 30 minutes of flight 4me..Google processes ~20 PB of data per day.if one Exabyte s worth of data has to be placed on to DVD and stored in thin jewellary boxes and subsequently loaded in to Boeing 747 aircra[, it would take 13,513 planes to transport this one Exabyte. To capitalize on the Big Data trend, a new breed of Big Data technologies such as Hadoop, HIVE,PIG,AVRO and NoSQL have emerged which are leveraging new parallelized processing, commodity hardware to capture and analyze these new data sets and provide a price/performance that is 10 4mes beber than exis4ng Database/Data warehousing/business Intelligence Systems. Her, we will understand With higher end systems, there is a lot of data coming from all of the business processes, from managing inventories to analyzing the data for trends for future products. So in these systems, there are a lot of different applica4ons a lot of different usages of the same massive amount of data and how all of these pieces go together with evolving need for new infrastructure around Storage, Networking, Virtualiza4on and Cloud.

Key Technologies Infrastructure Required for Big Data Cloud Infrastructure Virtualization Networking Storage In Memory Database Tiered Storage Software De-Duplication Data Protection

Cloud Infrastructure

Virtualization Infrastructure: Workload Consolidation / TCO Savings

Big Data Infrastructure - Map 20-40 Nodes/Racks 16 cores 48 G RAM 6 12 * 2 TB disk 1-2 GigE to node Reduce Easy to use, developer writes few func4ons Moves compute to data Schedules work on HDFC node with data Scans through the data

Big Data Infrastructure - HDFS Immutable file structure Read, write, synch No Random writes Storage server used for computa>on Move computa>on to data Fault tolerant and easy management Built in redundancy, Tolerates disk & Node failure, Auto managing addi>on/removal of disks, One operator/8k nodes Not a SAN but high bandwidth network access to data via Ethernet Used typically to solve problems not feasible with tradi>onal systems: with Large storage capacity > 100PB raw Large I/O computa>onal BW > 4k node/cluster, scale by adding commodity HW, MR Cluster

Oracle Big Data System

EMC Big Picture

Storage Infrastructure Storage Efficiency Service Efficiency Virtualization Mapping P>V,VM Management Performance In-Memory DB, Auto -Tiering-SSD/HDD Costs Reduction Thin Provisioning De -Duplication Availability RAID/Auto-Discover HA, Snapshots,CDP, Cloning, DRS Security Encryption/DLP Storage -as -a service Service Catalogs by Workload etc. Policy Infrastructure Service Level Attributes Service Measurements Performance Analytics IOPS/Response Time, Bandwidth Automation Unified SAN/NAS Protocols Auto learning workload forensics Provisioning to match workloads Assured auto Recovery.

Problem: seeks are expensive CPU & transfer speed, RAM & disk size double every 18-24 months Seek time nearly constant (~5%/year) Time to read entire drive is growing scalable computing must go at transfer rate Example: Updating a terabyte DB, given: 10MB/s transfer, 10ms/seek, 100B/ entry (10Billion entries), 10kB/page (1Billion pages) Updating 1% of entries (100Million) takes: 1000 days with random B-Tree updates 100 days with batched B-Tree updates 1 day with sort & merge To process 100TB datasets on 1 node: scanning @ 50MB/s = 23 days on 1000 node cluster: scanning @ 50MB/s = 33 min MTBF = 1 day Need framework for distribution efficient, reliable, easy to use

New Data and Management Economics

Storage Infrastructure Big Data Targets Value Potential of Using Big Data by Data Intensive Verticals

Key Take ways Big data creating paradigm shift in IT industry o Leverage the opportunity to optimize your computing infrastructure after making a due diligence in selection of vendors/products, industry testing and interoperability. Optimize Big data analytics for query response time vs # of Users o Improving query response time for a given number of users (IOPs) or Serving for a given query response time. Select Automated Storage Management Software o Data Forensics and Tiered Placement o Every workload has unique I/O access signature. o Historical performance data for a LUN can identify performance skews. Optimize infrastructure to meet the need of Applications/ SLAs