Big Data and Advanced Analytics Technologies and Use Cases"



Similar documents
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

The Future of Data Management

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Peninsula Strategy. Creating Strategy and Implementing Change

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data Technologies Compared June 2014

Big Data and Its Impact on the Data Warehousing Architecture

Native Connectivity to Big Data Sources in MSTR 10

Big Data Multi-Platform Analytics (Hadoop, NoSQL, Graph, Analytical Database)

TECHNOLOGY TRANSFER PRESENTS MIKE FERGUSON JUNE 3-4, 2015 JUNE 5, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

Modernizing Your Data Warehouse for Hadoop

WHAT S NEW IN SAS 9.4

TRANSFORM BIG DATA INTO ACTIONABLE INFORMATION

Big Data and Data Science: Behind the Buzz Words

2015 Ironside Group, Inc. 2

Il mondo dei DB Cambia : Tecnologie e opportunita`

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

The Future of Data Management with Hadoop and the Enterprise Data Hub

Big Data Analytics - Accelerated. stream-horizon.com

Architectures for Big Data Analytics A database perspective

Next-Generation Cloud Analytics with Amazon Redshift

Tap into Hadoop and Other No SQL Sources

Please give me your feedback

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Intro to BI. Mul0- dimensional Analysis

BIG DATA TRENDS AND TECHNOLOGIES

In-memory computing with SAP HANA

Introducing Oracle Exalytics In-Memory Machine

BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

The Inside Scoop on Hadoop

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

How To Scale Out Of A Nosql Database

Luncheon Webinar Series May 13, 2013

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

TECHNOLOGY TRANSFER PRESENTS MIKE FERGUSON BIG DATA MULTI-PLATFORM JUNE 25-27, 2014 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Bringing the Power of SAS to Hadoop. White Paper

Integrating Salesforce Using Talend Integration Cloud

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Technology Innovations for Enhanced Database Management and Advanced BI

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Bringing Big Data to People

IBM Big Data Platform

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Big Data Architectures. Tom Cahill, Vice President Worldwide Channels, Jaspersoft

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

WHITE PAPER. Data Migration and Access in a Cloud Computing Environment INTELLIGENT BUSINESS STRATEGIES

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Performance and Scalability Overview

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Cost-Effective Business Intelligence with Red Hat and Open Source

SAP Real-time Data Platform. April 2013

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

I/O Considerations in Big Data Analytics

Getting Started Practical Input For Your Roadmap

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

HDP Enabling the Modern Data Architecture

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

TECHNOLOGY TRANSFER PRESENTS OCTOBER OCTOBER RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

<Insert Picture Here> Big Data

Big Data Success Step 1: Get the Technology Right

Proact whitepaper on Big Data

Ramesh Bhashyam Teradata Fellow Teradata Corporation

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Ganzheitliches Datenmanagement

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Welcome to The Future of Analytics In Action IBM Corporation

Data processing goes big

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Open Source Business Intelligence Intro

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Reference Architecture, Requirements, Gaps, Roles

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Market Overview: Big Data Integration

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Big Data and the SAP Data Platform Including SAP HANA and Apache Hadoop Balaji Krishna, SAP HANA Product Management

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

TRAINING PROGRAM ON BIGDATA/HADOOP

Extend your analytic capabilities with SAP Predictive Analysis

HDP Hadoop From concept to deployment.

System Architecture. In-Memory Database

So What s the Big Deal?

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Transcription:

Big Data and Advanced Analytics Technologies and Use Cases" Colin White President, BI Research DAMA Portland February 2013" Agenda There is considerable interest at present on the topic of big data. Much of the discussion about this topic, however, is focused on the technology supporting big data, rather than on how analytics generated from big data can be leveraged for business benefit. One of the most exciting aspects of big data technology is that it allows organizations to support advanced analytic workloads and applications that were not previously possible for cost or performance reasons. New and evolving big data solutions provide significant business benefits because they help remove these cost and performance barriers. The objectives of this presentation are to discuss the benefits of big data and to present use cases and case studies that demonstrate the value of advanced analytics. It also explains how the existing data warehousing environment can be extended to support big data solutions. Topics that will be covered include: Review the history and evolution of big data and advanced analytics Explain the role of the data scientist in developing advanced analytics Look at the technologies that support big data Explain how the existing data warehousing environment can be extended to support big data and advanced analytics Discuss big data use cases and the benefits they bring to the business 2

The Evolution of Digital Data First OLTP systems Early decision support products First commercial RDBMSs Early data warehousing Big data & advanced analytics 2012 Sabre 84,000 txs/day Increasing Data Volumes Sabre 60,000 txs/sec 25 TB EDW 3 Data Growth: Choose an Analyst! 4

Data Growth: Multi-Structured Data Definition: data that has unknown, ill-defined or overlapping schemas Machine generated data, e.g., sensor data, system logs Internal/external web content including social computing data Text, document and XML data Graph, map and multi-media data Volume increasing faster than structured data Usually not integrated into a data warehouse Increasing number of analytical techniques to extract useful information from this data This information can be used to extend traditional predictive models and analytics 5 Data Growth: Big Data Big data technologies apply to all types of digital data not just multi-structured data Big is a relative term and is different for each organization and application What you do with big data and how you use it for business benefit should be the main consideration analytics play a key role here 6

The Value of Data: IBM 2012 Study 7 The Value of Data: IBM 2012 Study 8

The Changing World of BI Analytics Advanced Analytics Improved analytic tools and techniques for statistical and predictive analytics New tools for exploring and visualizing new varieties of data Operational intelligence with embedded BI services and BI automation Data Management Analytic relational database systems that offer improved price/performance and libraries of analytic functions In-memory computing for high performance Non-relational systems such as Hadoop for handling new types of data Stream processing/cep systems for analyzing in-motion data 9 Advanced Analytics Example: SC Digest 2012 10

Advanced Analytics Example: SC Digest 2012 1. Descriptive analytics using historical data to describe the business. This is usually associated with Business Intelligence (BI) or visibility systems. In supply chain, you use descriptive analytics to better understand your historical demand patterns, to understand how product flows through your supply chain, and to understand when a shipment might be late. 2. Predictive analytics using data to predict trends and patterns. This is commonly associated with statistics. In the supply chain, you use predictive analytics to forecast future demand or to forecast the price of fuel. 3. Prescriptive analytics using data to suggest the optimal solution. This is commonly associated with optimization. In the supply chain, you use prescriptive analytics to set your inventory levels, schedule your plants, or route your trucks. Advanced Analytics in Supply Chain, Dr. Michael Watson, Supply Chain Digest, November 2012 11 The Role of the Data Scientist: CITO Interviews Data scientists turn big data into big value, delivering products that delight users, and insight that informs business decisions. Strong analytical skills are a given: above all, a data scientist needs to be able to derive robust conclusions from data. But a data scientist also needs to possess creativity and strong communication skills. Daniel Tunkelang, Principal Data Scientist, LinkedIn A data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. Data scientists not only are adept at working with data, but appreciate data itself as a first-class product. Hilary Mason, Chief Scientist at bitly... someone who has the both the engineering skills to acquire and manage large data sets, and also has the statistician s skills to extract value from the large data sets and present that data to a large audience. John Rauser, Principal Engineer, Amazon.com Source: citoresearch.com/content/growing-your-own-data-scientists 12

Data Science Skills Requirements Business domain subject matter expert with strong analytical skills Creativity and a good communications Knowledgeable in statistics, machine learning and data visualization Able to develop data analysis solutions using modeling/analysis methods and languages, such as MapReduce, R, SAS, etc. Adept at data engineering, including discovering and mashing/blending large amounts of data Modeling & analysis skills Business expertise Data engineering skills Is this one person or a team of specialists? 13 Data Science: Further Reading 14

What Then is Big Data? Represents analytic and data management solutions that could not previously be supported because of: Technology limitations poor performance, inadequate analytic capabilities, etc. High hardware and software costs Incomplete or limited data for generating the required solutions Set of overlapping technologies that enable customers to deploy analytic systems optimized to suite specific business needs and workloads Optimization may involve improving performance, reducing costs, enabling new types of data to be analyzed, etc. 15 Big Data Application Examples Source: Microsoft 16

Big Data and Data Life Cycle Management Data Management and Analytic Performance Capacity planning Managing data warehouse growth Analytic performance management and optimization Service level agreements Data Governance Security: user access, encryption, masking, etc. Quality: governed/ungoverned data Backup and recovery Archiving and retention: historical analysis, compliance 17 The Impact of Big Data on the Data Life Cycle Need fast time to value to quickly gain business benefits from big data Impractical to use traditional EDW approach for all analytic solutions Extend existing data warehousing environment to support big data and accommodate data growth Need high performance solutions for supporting big data analytic workloads One-size fits all data management is no longer viable Match technologies and costs to business needs and analytic workloads Need improved data governance to handle big data No longer practical to rigidly control and govern all forms of data implement different levels of governance based on security, compliance and quality needs Determine data archiving and policies based on the possible future need to analyze historical data and data compliance requirements 18

The Extended (or Logical) Data Warehouse machine data (sensor events, web logs, ) Embedded* BI*services* Opera&onal* systems* Enterprise* data*warehouse* Data* marts* structured operational data Data* cubes* Real-time analytics Stream* processing/* CEP*system* filtered data Analy&c*RDBMS*or* Non>rela&onal*system* Optimized analytic hardware/software platform structured & multi-structured data from other internal & external sources Analytics accelerator Data hub Investigative computing platform Built-for-purpose LOB application (near-real-time analytics or hybrid analytical/operational processing) IMPROVE EXTEND 19 Beyond the EDW: Optimized Platforms 20

Optimized Analytic Platforms: Variables Analytics Required to Meet Business Needs Complexity reporting, OLAP or advanced analytics Agility latency of data, analytics, decisions, recommendations and actions Workload mix complexity of overall analytic workload; concurrent data modification Data Required to Meet Business Needs Volume amount of data to be managed Velocity rate of data generation or change Variety types of data to be managed Complexity number of data sources and relationships; quality and structure of data 21 Optimized Platforms: Analytic RDBMSs Extend traditional RDBMSs with features designed specifically for analytic processing and new analytic techniques Hardware exploitation Parallel computing New data types New storage structures Data compression Support for hybrid storage Intelligent workload management In-memory data In-memory analytics In-database aggregation & analytics 22

Analytic RDBMSs: Hardware Exploitation Faster processors Multi-core processors Intelligent hardware 64-bit memory spaces Large-capacity disk drives Fast hard-disk and solid-state drives Hybrid storage configurations Scale-up/out parallel processing configurations Lower-cost hardware (blades, clusters) Reduced power and cooling requirements Packaged hardware/software appliances 23 Analytic RDBMSs: New Storage Structures DBMS vendors are enabling new physical storage structures to improve performance, reduce storage requirements and support new types of analyses Examples: compressed columnar, XML, time-series, multi-media These enhancements and their implementation vary by vendor It is important to recognize that physical storage structures should be independent of the logical data model and the data manipulation language (DML) True for the relational model and SQL Often not true for non-relational systems SQL application Relational mapping layer (optimizer) Data manager Storage subsystem 24

Analytic RDBMSs: Data Storage Options Large-capacity hard-disk drives (HDD) More economical, less reliable, slower performance, e.g., SATA drives in white-box H/W Often short-stroked to improve performance High-performance hard-disk drives (HDD) More expensive, more reliable, better performance, e.g., enterprise SAS drives Solid-state drives (SSDs) High and consistent performance Better reliability and more energy efficient Distinguish between commodity SSDs and enterprise SSDs Dynamic RAM (DRAM) Best performance - eliminates I/O overheads Use by in-memory computing systems 25 Memory versus Storage Memory Data is directly addressable by a CPU via a memory bus Eliminates I/O overhead and provides fast access to data Types of solid-state memory: Storage! Processor cache(s) very fast, volatile data! Dynamic RAM fast (nanoseconds), volatile data Data is addressable via a device interconnect or network protocol Several types of storage for persisting data:! Commodity HDD: high capacity, less reliable, low cost (e.g., SATA)! Enterprise HDD: more reliable, higher cost (e.g., SAS)! NAND flash memory devices: fast, very reliable, high cost (e.g., PC SSD, enterprise SSD, flash storage array, hybrid SSD/HDD) 26

What is In-Memory Computing? A workload where all the data being processed is stored in a computer memory that is directly addressable via the CPU s memory bus Provides high-speed performance for OLTP and BI workloads by eliminating I/O to storage devices Especially beneficial for interactive and iterative BI analytic workloads Several types of BI-related in-memory computing 27 The Changing World of BI Analytics Advanced Analytics Improved analytic tools and techniques for statistical and predictive analytics New tools for exploring and visualizing new varieties of data Operational intelligence with embedded BI and BI automation Big Data Management Analytic relational database systems that offer improved price/performance and libraries of analytic functions Non-relational systems such as Hadoop for handling new types of data Stream processing/cep systems for analyzing in-motion data In-memory analytics In-memory data 28

Why In-Memory Computing for BI Analytics? Benefits Technology answer: Improved speed and performance, e.g., quickly run complex analyses on the fly Business answer: What if you could do...? e.g., real-time fraud detection Considerations Types of in-memory data and in-memory analytics and their benefits Relationship to in-database processing, e.g., in-database aggregation and indatabase analytics 29 In-Memory Data Analytic server or application One or more nodes Relational DBMS server In-memory data RAM Virtual cubes Pinned tables Cache/bufferpool In-memory database Local RDBMS database HDD & NAND flash devices Detailed & pre-aggregated data HDD & NAND flash devices & RAM On-storage data Remote or SAN data 30

In-Memory Database Systems: Vendor Examples Relational DBMSs EXASOL EXASolution IBM soliddb and Informix Warehouse Accelerator Kognitio Analytical Platform Microsoft xvelocity and Hekaton Oracle TimesTen SAP HANA and Sybase ASE VoltDB Many vendors also support pinned tables Non-Relational DBMSs Memcached Multi-Dimensional DBMSs IBM Cognos TM1 In-Memory Data Grids Cloud platforms from Amazon, Google, IBM, Microsoft, VMware, etc. 31 In-Memory Data: Important to Note 32

In-Database Technologies In-Database Aggregation Some RDBMSs support pre-aggregation to enhance BI performance Should be transparent to user optimizer decides when to use aggregate Various names materialized views, materialized query tables, etc. In-Database Analytics Brings the processing to the data rather than the data to the processing Consists primarily of predefined analytic functions created by RDBMS vendor, third-party vendor, open source community, user developed 33 In-Database Analytic Functions Analytical functions stored in an RDBMS offer several benefits Users (e.g., data scientists) only need to understand what a function does and how to use it - they do not need to know how to develop the functions Functions can exploit the parallel processing capabilities of an RDBMS moves the processing to the data rather than the data to the processing Important to understand the level of parallel processing and how a function is run, e.g., external to the RDBMS, in RDBMS protected memory, etc. Several approaches to using in-database functions RDBMS built-in functions (arithmetic, string, date, statistical functions) Functions provided by a 3rd-party vendor, e.g., FuzzyLogix Open source functions, e.g., Apache Mahout, R Development options for creating user-defined functions (scripting language, Java, C++, SQL MapReduce, etc.) 34

Analytic RDBMSs: Vendor Examples Traditional RDBMS Products IBM DB2: PureData for Operational Analytics IBM Informix: Ultimate Warehouse Edition (with Warehouse Accelerator) Microsoft SQL Server: Parallel Data Warehouse Oracle Database: Exadata SAP Sybase ASE and IQ Teradata Database: Active EDW, DW Appliance, Extreme Data Appliance, etc. Other Solutions EMC Greenplum Database and Distributed Computing Appliance HP Vertica Analytics Platform IBM Netezza: PureData for Analytics, DB2 Analytics Accelerator Kognitio Analytical Platform Oracle Exalytics (Oracle TimesTen & Oracle Essbase) ParAccel Analytic Platform SAP HANA Teradata Aster Database: MapReduce Platform, Big Data Analytics Appliance InfoBright, MySQL, PostgreSQL, etc. 35 Data Warehouse DBMSs: Gartner 2013 MQ 36

Optimized Platforms: Non-Relational Systems - 1 Several Internet companies developed their own non-relational (NoSQL or NewSQL) systems to support extreme data volumes Google example: Google file system, MapReduce, BigTable, BigQuery Main goal was the processing of large volumes of multi-structured data Several of these developments found their way into the open source community Non-relational systems are not new, but modern versions are often open source Deployed on low-cost white-box hardware in a large-scale distributed computing environment Several types of systems & data stores Key industry focus area is Hadoop 37 Optimized Platforms: Non-Relational Systems - 2 Many types of products, APIs and languages Volume Complexity Key/Value Pair Column Family Document Graph Can handle varieties of data and processing that are difficult to support using a traditional RDBMS 38

Optimized Platforms: Non-Relational Systems - 2 Many types of products, APIs and languages Volume Complexity Key/Value Pair Column Family Document Graph Hadoop HDFS Amazon Dynamo Redis Riak Oracle NoSQL DB Google BigTable HBase Cassandra MS Azure Tables CouchDB MongoDB Neo4J Google Freebase 39 A framework for running applications on a large hardware cluster built of commodity hardware. wiki.apache.org/hadoop/ Provides a distributed file system (HDFS) that stores data across the nodes of the cluster to provide high performance Includes a programming model called MapReduce (MR) where the processing is divided into small fragments of work that can be executed on any node in the cluster Hive and Pig are high-level languages for MR development Related components include HBase, Sqoop, HCatalog, Flume, Storm, Mahout, Impala, etc. Major distributions come from Apache, Cloudera, Hortonworks and MapR HCatalog Hive Pig MapReduce (MR) Sqoop HBase Hadoop Distributed File System (HDFS) 40

Components 41 The Hadoop Ecosystem Source: Hortonworks 42

Data Management: Hadoop Option Distributions Apache: Hadoop, HBase, Hive, Pig, Sqoop, Cassandra, Mahout Cloudera: Enterprise Free (CDH), Enterprise Core, RTD for HBase, RTQ for Impala HortonWorks Data Platform (includes Talend) MapR: M3 (free), M5, M7 Other solutions EMC Greenplum: HD, Isilon NAS for HD, HD Distributed Computing Appliance (DCA)* Hadapt Adaptive Analytical Platform* HP AppSystem for Apache Hadoop IBM InfoSphere BigInsights Basic and Enterprise Editions Microsoft HDInsight Server & Service Oracle Big Data Appliance SAS High-Performance Analytics Server Teradata Aster Big Data Analytics Appliance* Hive HCatalog Pig MapReduce (MR) Hadoop Distributed File System (HDFS) Impala HBase * Hybrid Hadoop/RDBMS system 43 Data Management: Relational vs Non-Relational Given the number of options and a fast changing marketplace comparisons are difficult Focus is on analytic RDBMSs versus Hadoop HDFS DBMS versus file system, which is an apples to oranges comparison At a high-level, an analytic DBMS is suited to complex interactive workloads and Hadoop HDFS for batch processing of multi-structured From an Hadoop perspective, HBase is becoming more important, but lack of SQL support is an inhibitor Hive support for HBase is in development but it still uses batch Map/ Reduce (MR) Cloudera is developing Impala which supports Hive SQL syntax but eliminates MR supports both HDFS and HBase Workload suitability and performance are important, but development and administration effort, and tools support are also key considerations 44

Data Management: Language Considerations 45 Optimized Platforms: Stream Processing/CEP streaming data (e.g., web or sensor events) Stream* processing/* CEP*engine* Fraud detection Network/smart grid optimization Equipment failure prediction models & rules Analy&c* applica&ons* analytics filtered data Analy&c* RDBMS* analytics filtered data Analy&c* applica&ons* Non>rela&onal* system* models & rules Analy&c* applica&ons* analytics rules & models Embedded*BI* Operational systems services* analytics, models & rules from EDW Operational systems 46

Optimized Platforms: Cloud Option Two approaches: Virtual machine image may be provided by the user, a DBMS vendor, or a cloud vendor (e.g., Amazon) Database as a Service (DBaaS) offered by a public cloud vendor Relational DBMS DBaaS Amazon Redshift (ParAccel) and Relational Database Service (MySQL, Oracle, MS SQL Server) Google Cloud SQL (MySQL) and BigQuery Service HP Cloud Relational Database for MySQL Kognitio Cloud Microsoft Azure SQL Database Oracle Cloud Non-Relational DBMS DBaaS Amazon DynamoDB, SimpleDB, ElastiCache Microsoft Azure Tables and Blob storage SalesForce database.com 47 Data Management in the Cloud Example Netflix s on-premises IT infrastructure was too fragile and the traditional operations model didn't respond fast enough to business needs Rapidly growing and highly variable data-center requirements Inability to automate data-center operations As a result the company migrated from an on-premises environment to an Amazon Web Services infrastructure It evaluated how the new environment would affect the IT infrastructure and redesigned applications as appropriate Spreads its processing across many different Amazon data centers and regions to enhance reliability and availability Different service environments are randomly taken offline to confirm that the environment can continue operating in the face of a resource failure Conclusion: Netflix changed its approach because it recognized that the future of its business required a different way of doing things Source: Netflix (slideshare.net/adrianco/netflix-in-the-cloud-at-sv-forum) and CIO Magazine 48

Summary: Big Data Benefits Traditional Decision-Making Environment (determine and analyze current business situation) Integrated data sources Structured data Aggregated and detailed data (with limits) Relational EDW with at rest data Dimensional cubes/marts with at rest data One-size fits all data management Reporting and OLAP Dashboards and scorecards Structured navigation (drill, slice/dice) Humans interpret results, patterns and trends Manual analyses, decisions and actions Big Data Extensions (provide more complete answers, predict future business situations, investigate new business opportunities) Virtualized and blended data sources Multi-structured data Large volumes of detailed data (no limits) Non-relational stores with at rest data Streaming/CEP systems with in motion data Flexible & optimized data management Advanced analytic functions & predictive models Sophisticated visualization of large result sets Flexible exploration of large result sets Sophisticated trend and pattern analysis Analytics & model-driven recommendations & actions 49 Choosing the Right Solution Organizations will likely use multiple analytic solutions and data management systems the challenge is deciding which to use when and how to interconnect the systems 50

Use Cases and Application Examples Use Case Application Example Real-Time Monitoring & Analytics In-line fraud detection to reduce financial losses caused by stolen credit cards Near-Real-Time Analytics Next best customer offer to the channel to increase customer satisfaction & reduce churn Data Integration Hub Analytics Accelerator New LOB Analytic Application Collect and manage all sales-related detailed data (POS, web, supply chain) for down stream analysis Offload & boost the performance of selected financial analyses to increase satisfaction/retention of key clients Manage & monitor spot buying on web advertising exchanges Investigative Computing Platform Evaluate the effectiveness of different social computing channels 51 Software Selection: Some Key Options Data management BI Access Integrate Manage Manage Collaborate Analyze Publish Decide Act Data access - languages & APIs - connectors - data virtualization - search Data integration - modeling - profiling - cleansing - transforming - loading Data manager - relational - non-relational - stream processing/ CEP - in-memory - cloud-based Reporting - batch - interactive Multidimensional analysis - ROLAP - MOLAP Advanced analytics - data mining - machine learning - advanced functions - data exploration - data visualization - operational intelligence 52

Use Cases and Technologies Use Case Stream Processing/ CEP system Embedded BI Services Enterprise Data Warehouse Analytic Relational DBMS Hadoop System Real-Time Monitoring & Analytics Near-Real-Time Analytics Data Integration Hub Analytics Accelerator New LOB Analytic Application Investigative Computing Platform 53 Example Telco Provider: Real-Time Embedded BI IVR Chat Session Web Email Mobile Apps Call Voice Center SMS 2 DM receives list of candidate marketing offers from EMM 3 Optionally EMM calls out to SPSS to help determine candidate offers 4 5 Decision Management determines NBA from: Marketing offers (EMM) Service Problems Billing Information Location Service Issue Issue Resolution Dispute Satisfaction Account Management Advice Self Service Channel Match Agent Match etc. 1 Demographic (DB, surveys) Request for Next Best Action (NBA) from channel Next Best Action delivered to the customer through the appropriate channel Business Rules Optimization Big Data Platform Predictive Analytics Hadoop Core Database Text Analytics Interactions (Call center, Web) Decision Services Stream Computing Entity Analytics Data Warehouse Behavioral (Orders, Payments) Enterprise Marketing Management Cross-channel Campaign Management Real Time Marketing Information Integration & Governance Enterprise Content Management Attitudinal (Surveys, Social / CCI) Source: IBM 54

Example $43 billion retail organization with over 4,000 stores (Sears and Kmart) Numerous legacy systems with applications written in COBOL and Assembler (over 100 million lines) Running out of capacity but at the current cost of $3K-$7K per MIP per year another solution had to be found Requirements: Cost effectively manage increasing data volume Reduce the number of data warehouses and ETL jobs Reduce analytical processing times and provide intra-day analytics Capture and store all detailed transaction data (POS data, web clicks, supply chain events, etc.) for analysis Limit changes to existing user interfaces Primary source: Presentation by Dr. Phillip Shelley (CTO Sears Holdings and CEO MetaScale) at the Hadoop Summit, June 2012 55 Example Solution: Hadoop Data Hub and Analytics Accelerator Enhanced pricing application Issue: only about 10% of the sales data is in the EDW; pricing models were taking taking 8 weeks to setup and run Hadoop MapReduce solution analyzes price elasticity based on 100% of the sales data Pricing models can now be run weekly (or daily if required) Improved customized offers to loyal customers Issue: existing system was not scalable; only a small subset of the data could be analyzed Replaced 6,000 COBOL application with 400 lines of Pig and Java UDFs; implemented in 6 weeks Application can now be run multiple times per day per store per line item per customer reduces impact of competitors such as Amazon 56

Example Solution: Hadoop Data Hub and Analytics Accelerator /cont. Reduce time to run batch BI applications Existing mainframe window of 3.5 hours was becoming insufficient to run 64 batch pricing jobs against 500 million rows of data Batch jobs were rewritten in Pig and run on data FTP d to Hadoop from the mainframe and results FTP d back to the mainframe Jobs run 100% faster (run in 8 minutes; FTP is the main overhead) Reduce time to run batch and interactive BI applications Existing batch and interactive BI applications were taking too long to run and could handle only a subset of the data Data from over 50 sources now stored and analyzed on Hadoop Datameer is used for analytics Pig is used for ETL and for creating output for Excel 57 Example Conclusions Pleased with Hadoop s ability to run enterprise workloads enables detailed data to be stored and analytics to be run that were not previously possible Hadoop is only one component of the BI/DW ecosystem and strategy Hadoop requires significant education and implementation effort and is lacking tools for enterprise integration New Sears subsidiary (MetaScale) formed to help enterprises integrate existing systems with Hadoop because 75% of CEOs and CIOs don t even know what Hadoop is 58

Example: International Bank Trading Desk Solution: Analytic RDBMS as an Analytics Accelerator This international bank offers a wide range of services to its over 40 million customers One of the bank s trading desks uses the appliance for an analytic solution that handles the ad hoc analysis of billions of rows of detailed loan/bond data Month-end loading was reduced from days to 2 hours Key customer queries were reduced from 3-4 days to about 7 minutes Appliance was treated as a black box by the IT group for compliance reasons 59 Example Solution: Analytic RDBMS for New LOB Application MediaMath is a leader in the billion dollar display advertising business Provides a platform called TerminalOne that enables ad agencies and large-scale advertisers to identify, bid on, buy, and optimize ad impressions Automatically matches each impression in real time with ads that are meaningful and relevant to users Analyzes upwards of 15 billion ad impressions a day and calculates the fair value of more than 50,000 ads/sec Source: LUMA Partners 60

Example Solution: Hadoop System for Investigative Computing Phase 1 of the project was to use Hadoop and MapReduce to consolidate web site data (from 10 sites) for e-commerce analysis Phase 2 involved providing users with Hive access to the web (and social) data for investigative purposes Demand for access and use of the system grew dramatically People forgot this was an experimental system! Requirements grew: larger cluster, resource management, SLAs, real-time data, metadata catalog Extended Hadoop to support a data hub containing 10 years of detailed data and reduce data stored in existing DW systems Primary source: Presentation by Stephen O Sullivan and Jeremy King at the Hadoop Summit, June 2012 61 Barriers to Success Educating IT and the business about the use cases and business benefits of big data Lack of skills for enabling data science and investigative computing projects Understanding and selecting the components that are required to build and support a big data analytics ecosystem The immaturity of new non-relational systems and the level of IT development and administration resources and skills required for supporting them The amount of data integration and level of data movement required in a big data environment Developing data governance and data retention approaches to support the big data environment Providing business users with a single and seamless user interface 62