Big Data and the Analytic Race. Copyright 2012, SAS Institute Inc. All rights reserved.

Similar documents
Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data and Data Science: Behind the Buzz Words

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

<Insert Picture Here> Big Data

BIG DATA TRENDS AND TECHNOLOGIES

WHAT S NEW IN SAS 9.4

The Future of Data Management

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Modernizing Your Data Warehouse for Hadoop

Big Data Technologies Compared June 2014

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

In-Memory Analytics for Big Data

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Hadoop. Sunday, November 25, 12

Large scale processing using Hadoop. Ján Vaňo

BIG DATA What it is and how to use?

Data processing goes big

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Microsoft Big Data. Solution Brief

Please give me your feedback

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

QUEST meeting Big Data Analytics

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

High-Performance Analytics

Advanced In-Database Analytics

Data Warehouse design

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

I/O Considerations in Big Data Analytics

Tap into Hadoop and Other No SQL Sources

How To Scale Out Of A Nosql Database

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Bringing the Power of SAS to Hadoop. White Paper

Cost-Effective Business Intelligence with Red Hat and Open Source

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Hadoop implementation of MapReduce computational model. Ján Vaňo

ANALYTICS MODERNIZATION TRENDS, APPROACHES, AND USE CASES. Copyright 2013, SAS Institute Inc. All rights reserved.

How Companies are! Using Spark

Are You Ready for Big Data?

HDP Enabling the Modern Data Architecture

BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting

Hadoop Big Data for Processing Data and Performing Workload

Bringing the Power of SAS to Hadoop

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data on Microsoft Platform

Hadoop IST 734 SS CHUNG

Hadoop & SAS Data Loader for Hadoop

Il mondo dei DB Cambia : Tecnologie e opportunita`

Advanced Big Data Analytics with R and Hadoop

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Big Data and Its Impact on the Data Warehousing Architecture

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

HDP Hadoop From concept to deployment.

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Are You Ready for Big Data?

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Sunnie Chung. Cleveland State University

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Big Data Too Big To Ignore

Apache Hadoop FileSystem and its Usage in Facebook

Bringing Big Data to People

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

Building Analytics and Big Data Capabilities Tom Davenport CDB Annual Conference May 23, 2012

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Hadoop Ecosystem B Y R A H I M A.

Peninsula Strategy. Creating Strategy and Implementing Change

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

The BIg Picture. Dinsdag 17 september 2013

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop & its Usage at Facebook

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Big Data Explained. An introduction to Big Data Science.

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Luncheon Webinar Series May 13, 2013

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

ANALYTICS CENTER LEARNING PROGRAM

Hadoop & its Usage at Facebook

A Brief Outline on Bigdata Hadoop

The Future of Data Management with Hadoop and the Enterprise Data Hub

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Application Development. A Paradigm Shift

Big Data Are You Ready? Thomas Kyte

Implement Hadoop jobs to extract business value from large and varied data sets

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Hur hanterar vi utmaningar inom området - Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop & Spark Using Amazon EMR

Big Data? Definition # 1: Big Data Definition Forrester Research

Bringing Big Data into the Enterprise

Main Memory Data Warehouses

NoSQL for SQL Professionals William McKnight

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Transcription:

Big Data and the Analytic Race

What is Big Data? Big Data When volume, velocity and variety of data exceeds an organization s storage or compute capacity for accurate and timely decision-making

The Data Explosion Vast Quantities of Data Are Being Generated Like Never Before

DATA SIZE How Big is Big? VOLUME VARIETY VELOCITY VALUE TODAY THE FUTURE

Big Data Examples Each Swipe of a Credit Card Each Banking Transaction Insurance Claims / Telematics data / Voice Transcription Mobile devices, signal and location data Each Cell phone ping, call, text, each RFID tag Each Web Site, Each Blog Yahoo alone has 200 petabytes of data! They Are All Producing More and More Data.

Smart Meters Just One Small Example: Smart Meters Before Meters Read Once a Month Now Every 15 Minutes is Typical 3,000 Times as Much Data. For a Utility With One Million Customers it s Like an IT Shop Tracking CPU for One Million Servers.

Smart GRID But Wait! There s More!! New Smart Grid Initiative Control Down to the Appliance Level. So More Like Tracking 10 Million Servers.

So What? So What? So There Is Lot s of Data Out There Why Do We Care? What Is The Big Deal About Big Data?

An Analytical Gold Mine Tens If Not Hundreds of Millions of Dollars! Actionable Information is Buried Deep in the Mountains of Data Information About Customers Can Help Retain Them and Help Them Spend More Then There is FRAUD

Fraud Annual Losses From Credit Fraud Estimated At $48 Billion in 2008! Reducing This Fraud by Just 1% Saves $480 Million / Year! One Large Bank Estimated Their Losses at $6.4 Billion in One Year

Government Fraud Annual Medicare Fraud Estimated Between $60 Billion and $90 Billion per Year Reducing Medicare Fraud by Just 1% Saves $600-$900 Million / Year!

The Problem The Problem Is: How Do You Mine All This DATA? Too Big To Store and Use Too Slow To Analyze Or At Least That Is How It Used To Be

Trends Big Data, Storage, Hadoop & In-memory Technology TRENDS IN BIG DATA, STORAGE, HADOOP & IN-MEMORY TECHNOLOGY $20 $18 $16 $14 $12 $10 $8 $6 $4 $2 $- Hadoop Microsoft PDW Oracle Greenplum Teradata Vertica Cost per Gigabyte Cost per Terabyte $- $20,000 $40,000 $60,000 $80,000 $100,000 Today 2009 Cost of Storage, Memory, Computing In 2000 a GB of Disk $17 today < $0.07 In 2000 a GB of Ram $1800 today < $1 In 2009 a TB of RDBMS was $70K today < $ 20K 10 Terabytes files now reasonable to move Inmemory

Trends Big Data, Storage, Hadoop & In-memory Technology TRENDS IN BIG DATA, STORAGE, HADOOP & IN-MEMORY TECHNOLOGY $20 $18 $16 $14 $12 $10 $8 $6 $4 $2 $- Hadoop Microsoft PDW Oracle Greenplum Teradata Cost per Gigabyte Cost per Terabyte 2011 2012 Major Players & Hadoop Greenplum MapR (May 11) IBM Big Insights (May 11) Microsoft and Hadoop (Oct 11) SAP Sybase IQ & Hadoop (November 11) Oracle & Cloudera Appliance (Jan 12) Teradata Partners w. Hortonworks (Feb 12) SAS LASR Server on Hadoop (Mar 12) In-Memory Technology SAS HP Solutions SAS LASR In-memory Server SAP HANA Oracle Exalytics Vertica $- $20,000 $40,000 $60,000 $80,000 $100,000 Today 2009 IT needs to be involved in setting up these MPP environments

Types of Big Data Processing Grid Clustered Databases In-Database Analytics In-Memory Analytics

Grid Grid o Spreading one or more workloads over multiple, possibly heterogeneous servers o Scheduling, prioritization, redundancy, flexibility o Shared access to data o Some large jobs run in parallel on multiple servers o Mainframe Sysplex

Clustered Database Clustered Database o Spread large datasets across many servers o Spreads database processing across many servers o Analytical Database, Not Transactional Read/Write massive amounts of data at once, not individual records. o Examples: Teradata EMC Greenplum Hadoop

In-Database In-Database o Regular SQL DBs like Oracle and DB2 o Moving computing into the database instead of pulling the data out o Useful for scoring massive amounts of data for models

In-Memory In-Memory o Now take that distributed data with distributed processing and Snap it into memory o Processing was fast before but now is crazy fast! o Tens of terabytes can be handled in memory! 100 servers with 96 GB each is nearly 10 TB Commodity hardware is not nearly as expensive as before

In-Memory Results Business Problem Data Size and Analysis Before In-Memory Probability of Loan Default 1 billion rows of data Regression analysis Optimize Response to Marketing Campaign across multiple channels Calculate Credit Risk Exposure across entire bank 100 million rows of historical contact information 15 million customers 900 offers 20 offers per customer Many business rules 10s of Millions of rows of customer data Regression analysis 11 to 20 hours depending on hardware configuration Less than 54 seconds 2.5 to 5 hours Less than 90 seconds 167 hours (a week) 84 seconds

In-Memory Difference In Analytics Banks s Current Process 5 hours Model lift of 1.6% 1 model per day per modeler In-Memory Data Mining 3 minutes Model lift of 2.5% 1 model per 30 minutes conservatively One algorithm (NN) 7 iterations of NN training Random Forest, SVM, Logistic and other challenger methods More complex network 5000 iterations in 70 minutes

Apache Hadoop

WHAT IS THE SCOOP ON HADOOP? What exactly is Hadoop? Think of it as an infinitely expandable filing cabinet that has the ability to help you summarize what is stored in it Can store any kind of data in it When it gets filled up, just buy more drawers It has built in some nifty space organizers! Hadoop is partitioned into compartments (called clusters ) that can be used for analysis It has its own set of languages Basic versions are free to download and use Different vendors offer their own custom versions

WHAT IS HADOOP? Open Source Software that allows for the distributed processing of large data sets across clusters of commodity computers It isn t a database, it is a file system with a parallel programing model.

WHAT IS THE SCOOP ON HADOOP? Big Internet Companies Are Using Hadoop Origins in Google s MapReduce and Google File System (GFS) Yahoo Is Largest Contributor Amazon Has Its Version (Elastic MapReduce) Also Used by ebay, IBM, SAP, Twitter, Netflix, LinkedIn, Apple, AOL, HP, Intuit, Microsoft and SAS, As Well As Many Others. In 2011 Facebook Had a 30 Petabyte Hadoop Cluster / Yahoo over 200 PB

KEY COMPONENTS OF HADOOP HDFS Stores petabytes of data reliably Simple Just a bunch of disks ~ no RAID Reliable and Redundant ~ expect server failure» Doesn t slow down or lose data even as hardware fails Open Source So Other File Systems Can Be Used MapReduce Allows huge distributed computations Batch processing centric» Hence its great simplicity and scalability, not a fit for all use cases

OTHER COMPONENTS OF HADOOP Hadoop-Related Components Pig Programing Language To Simplify Creation of MapReduce Programs» Grunt interactive shell Hive SQL-Like Front-End to MapReduce» Data still stored as sequential files, not database Hbase Database Built on HDFS» Real-time random read/write» Linearly scalable» Ironically not SQL ZooKeeper Centralized Service for Distributed Applications

HMS (Management) Zookeeper (Coordination) HADOOP ECOSYSTEM & LINGO Pig (Data Flow) Hive (SQL) Programming Languages MapReduce (Distributed Programing Framework) Computation HCatalog (Meta Data) HBase (Columnar Storage) Table Storage HDFS (Hadoop Distributed File System) Object Storage Core Apache Hadoop Related Apache Projects

WHAT YOU NEED TO KNOW ABOUT HADOOP RDBMS Databases = Connectors & adaptors to Hadoop (ie. Oracle,SAP) IBM = Big Insights / Big Sheets SAS/Access to Hadoop SAS/EDI Server 4.4 provides Hadoop Transformations for Data Integration SAS/Metadata & Lineage provide governance of Hadoop data SAS Proc Hadoop enables users to intermix MR & Pig in-line with SAS code R & Revolution = Bunch of MR Packages for Hadoop RHIPE - interface between R and Hadoop RHIVE connect R to HIVE (similar to SAS/Access) Rhadoop - is a collection of three R packages: Mahout = Data Mining for Hadoop Main use today ~ Recommender engines (e.g. Amazon)

Managing Hadoop

Where Does IT Fit In? Business Needs IT To Manage Big Data Systems Possibly Hundreds or Even Thousands of Nodes In One Big Data Cluster o Yahoo has 42,000 Hadoop Nodes o Spread Over 20 Hadoop Clusters o Holding 200 Petabytes of Data

Management of the Hadoop Hive/Cluster How is IT going to manage the workload being dispersed within the Hive Which Hive/cluster components are limiting capacity/performance Where are the bottlenecks or room for increased utilization Any Nodes/Slaves not used at all or hardly engaged Resource statistics and metrics Where is it all going to come from and can we get a 3D view

Operations Monitoring and Reporting Introducing Ganglia Scalable Distributed Monitoring System Targeted at monitoring clusters and grids View Live or Historic Statistics Multicast-based Listen/Announce protocol Leverages widely used technologies such as XML for data representation, XDR for compact portable data transport, RRDtool for data storage and visualization http://ganglia.sourceforge.net or http://www.ganglia.info

Ganglia Architecture Apache Web Frontend Web Client GMETAD Poll Poll GMETAD Failover Poll Failover Failover Poll Cluster 1 Cluster 2 Cluster 3 GMOND Node GMOND Node GMOND Node GMOND Node GMOND Node GMOND Node GMOND Node GMOND Node GMOND Node

Ganglia Web Frontend

Ganglia Gmond Metric Gathering Agent Built-in metrics Various CPU, Network I/O, Disk I/O and Memory Extensible Gmetric Out-of-process utility capable of invoking command line based metric gathering scripts Loadable modules capable of gathering multiple metrics or using advanced metric gathering APIs Built on the Apache Portable Runtime Supports Linux, FreeBSD, Solaris and more

Gmond Metric Gathering Agent (continued) Automatic discovery of nodes Adding a node does not require configuration file changes Each node is configured independently Each node has the ability to listen to and/or talk on the multicast channel Can be configured for unicast connections if needed Heartbeat metric determines the up/down status Thread pools Multicast listeners Listen for metric data from other nodes in the same cluster Data export listeners Listen for client requests for cluster metric data

The Hadoop Datamart

Case Study SAS ITRM / Hadoop Hadoop Based Environment Windows LINUX ITRM Server (Metadata, Mid-Tier, Client) Hadoop Master Node LINUX Hadoop Slave/Nodes SAS/ITRM Custom Adapter to Integrate Hardware Metrics with Hadoop Performance Information 39

40

Case Study Hadoop / Ganglia Adapter for SAS/IT Resource Management Comprehensive Management of the Hadoop Environment with a Field ITRM Adapter Providing: Analysis Reporting Metrics reporting from Ganglia Hadoop Based Environment Windows ITRM Server (Metadata, Mid-Tier, Client) LINUX Hadoop Master Node LINUX Hadoop Slave/Nodes SAS/ITRM Custom Adapter to Integrate Hardware Metrics with Hadoop Performance Information ITRM adapter for Hadoop logs ITRM Datamart based on Hadoop data Integrated LINUX/UNIX OS performance metrics SAR or Ganglia Analysis routines for both memory and storage based Hadoop environments Reports for both engineer and senior management 41

Accessing RRD Files RRDtool Adapter The RRDtool adapter is a new adapter for ITRM 3.3 Reads any data from a RRD that has been created using the RRDtool software Creates a Stage Table based on the contents of the user s RRDs Reads the data even if it has been consolidated. Will read a single round-robin database, or will read all round-robin databases in directory. If multiple round-robin databases are read, the data will be combined into a single staging table.

Map & Reduce Tasks 43

HDFS Data Load Grid1/Slave 4 Core Grid3/Master 8 Core 44

45

46

47

Analytics From Zero to Insight 48

TRADITIONAL ANALYTICS LIFECYCLE BUSINESS MANAGER Domain Expert Makes Decisions Evaluates Processes and ROI EVALUATE / MONITOR RESULTS IDENTIFY / FORMULATE PROBLEM DATA PREPARATION BUSINESS ANALYST Data Exploration Data Visualization Report Creation DEPLOY MODEL DATA EXPLORATION IT SYSTEMS / MANAGEMENT Model Validation Model Deployment Model Monitoring Data Preparation VALIDATE MODEL BUILD MODEL TRANSFORM & SELECT DATA MINER / STATISTICIAN Exploratory Analysis Descriptive Segmentation Predictive Modeling

PREDICTIVE ANALYTICS TECHNIQUES - EXAMPLES Statistics: Lot of math in the tools available for analytics, methods applied to business problems. Data Mining Models Which products are customers likely to buy? Which workers are likely to quit/resign/be fired? Text Models What are people saying about my products and services? Can I detect emerging issues from customer feedback or service claims? Forecasting Models How many products will be sold this year, next year? How does this break down into each product over the next 3 months, 6 months? Operations Research What is the least cost route for transporting goods from warehouses to final destinations? (PRESCRIPTIVE)

EXTERNAL VIEWPOINT CHALLENGES IN ANALYTICS ADOPTION Source: The Current State of Business Analytics: Where Do We Go From Here? Prepared by Bloomberg Businessweek Research Services, 2011

WHAT SHOULD I DO NEXT? ANALYTICALLY IMPAIRED STAGE 1 Limited interest in analytics by senior management 5 STAGES OF ANALYTIC MATURITY ANALYTICAL ASPIRATIONS LOCALIZED ANALYTICS STAGE 2 Line-of-business management are driving analytics momentum STAGE 3 Senior executives are committed to analytics Resources are aligned to create a broad analytics capability ANALYTICAL COMPANY STAGE 4 Enterprise-wide analytics capability is under development as a corporate priority ANALYTICAL COMPETITOR STAGE 5 Organization is routinely reaping the benefits of enterprise-wide analytics Focus on continuous improvement Modernization Analytic Assessment Source: Davenport, Thomas and Harris, Jeanne. Competing on Analytics: The New Science of Winning Positioning HPA

High Performance Analytic Trends Commercial Support and Acceptance for R and Open Source software HPA offerings (Oracle, SAS, IBM, Revolution Analytics, SAP (HANA), RapidMiner, Apache Mahout) Partnerships (ex: Tableau and Cloudera, Revolution Analytics and Cloudera or Netezza, SAS & Teradata, SAS & EMC Greenplum, Teradata and Alpine Miner, Microsoft and Hortonworks) Everything to Everyone Data Visualization vendors become Big Data vendors Hardware and appliance makers become Analytics experts Product stacks become blurry BI vs. Data Viz vs. Big Data vs. Analytics FREE Software ( R, Hadoop) 53

High Performance Analytics Vendors IBM (InfoSphere Big Insights, InfoSphere Streams, SPSS, Cognos, Netezza) SAS (High Performance Analytics / In-Memory Visual Analytics / Access to Hadoop) Oracle (Exalytics, Advanced Analytics, OBIEE, Business Applications) SAP (HANA, Business Objects, Business Applications) Microsoft (Microsoft SQL Server 2012, PowerPivot) MicroStrategy (MicroStrategy 9.x) QlikTech (QlikView10) Tableau (Server and Desktop Edition) Tibco (Spotfire Professional Edition and Server) Revolution Analytics (RevoScale R, Netezza, Cloudera) 54

References APR ( http://apr.apache.org/ ) libconfuse ( http://www.nongnu.org/confuse/ ) expat ( http://expat.sourceforge.net/ ) pkg-config ( http://www.freedesktop.org/wiki/software/pkg-config ) python ( http://www.python.org/ ) PCRE ( http://www.pcre.org/ ) RRDtool ( http://oss.oetiker.ch/rrdtool/ ) Hadoop - The Definitive Guide 3rd Edition SAS High Performance Analytics and Visual Analytics - http://www.sas.com/reg/gen/corp/1909596?gclid=cowh- 8D8srECFUXc4Aod9wMAXw

Questions