TECH NOTE. Hadoop Alone Is Not Big Data



Similar documents
Three Open Blueprints For Big Data Success

The 3 questions to ask yourself about BIG DATA

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

Speak<geek> Tech Brief. RichRelevance Infrastructure: a robust, retail- optimized foundation. richrelevance

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Why Web Performance Matters: Is Your Site Driving Customers Away?

How To Handle Big Data With A Data Scientist

A Whole New World. Big Data Technologies Big Discovery Big Insights Endless Possibilities

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

A Case Study of Hadoop in Healthcare

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data Zurich, November 23. September 2011

Analytics Industry Trends Survey. Research conducted and written by:

Advanced Big Data Analytics with R and Hadoop

The Future of Data Management

Building Your Big Data Team

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Tableau Visual Intelligence Platform Rapid Fire Analytics for Everyone Everywhere

ENTERPRISE BI AND DATA DISCOVERY, FINALLY

Bringing Big Data into the Enterprise

A Service-oriented Architecture for Business Intelligence

RESEARCH REPORT. The State of Streaming Big Data Analytics: 2014 Survey Results

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

E-Guide THE CHALLENGES BEHIND DATA INTEGRATION IN A BIG DATA WORLD

Investor Presentation. Second Quarter 2015

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Build a Streamlined Data Refinery. An enterprise solution for blended data that is governed, analytics-ready, and on-demand

ENABLING OPERATIONAL BI

Using In-Memory Computing to Simplify Big Data Analytics

UNIFY YOUR (BIG) DATA

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

BIG DATA. Value 8/14/2014 WHAT IS BIG DATA? THE 5 V'S OF BIG DATA WHAT IS BIG DATA?

The big data revolution

The Next Big Thing in the Internet of Things: Real-Time Big Data Analytics"

ANALYTICS CENTER LEARNING PROGRAM

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

Blueprints for Big Data Success

Empowering Operational Business Intelligence with Data Replication

RESEARCH REPORT. The State of Real-time Big Data Analytics: 2013 Survey Results

The 4 Pillars of Technosoft s Big Data Practice

Why Big Data in the Cloud?

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

FAST DATA APPLICATION REQUIRMENTS FOR CTOS AND ARCHITECTS

Buying vs. Building Business Analytics. A decision resource for technology and product teams

Traditional BI vs. Business Data Lake A comparison

OLAP Theory-English version

Big Data at Cloud Scale

6 Steps to Faster Data Blending Using Your Data Warehouse

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

Data Analytics Solution for Enterprise Performance Management

Data Integration Checklist

Whitepaper. 4 Steps to Successfully Evaluating Business Analytics Software.

In-Memory Analytics for Big Data

Big Data Analytics on Cab Company s Customer Dataset Using Hive and Tableau

Tableau 6, Business Intelligence made personal

Achieving Business Value through Big Data Analytics Philip Russom

IST722 Data Warehousing

Turn your information into a competitive advantage

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Analytics: A Next Generation Roadmap

DATAMEER WHITE PAPER. Beyond BI. Big Data Analytic Use Cases

The BIg Picture. Dinsdag 17 september 2013

Guide To Increasing Online Sales - The Back (Office Story)

This Symposium brought to you by

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Architecting for the Internet of Things & Big Data

Using Master Data in Business Intelligence

Big Data Use Cases. To Start Today. Paul Scholey Sales Director, EMEA. 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866)

Three Reasons Why Visual Data Discovery Falls Short

PUSH INTELLIGENCE. Bridging the Last Mile to Business Intelligence & Big Data Copyright Metric Insights, Inc.

Yu Xu Pekka Kostamaa Like Gao. Presented By: Sushma Ajjampur Jagadeesh

Enterprise Information Integration (EII) A Technical Ally of EAI and ETL Author Bipin Chandra Joshi Integration Architect Infosys Technologies Ltd

Exploring the Synergistic Relationships Between BPC, BW and HANA

the missing log collector Treasure Data, Inc. Muga Nishizawa

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Beyond The Hype of Big Data

Genesee Health System RFI-Business Intelligence & Analytics with Dashboard Reporting Questions and Answers

Using Tableau Software with Hortonworks Data Platform

White Paper: Datameer s User-Focused Big Data Solutions

TURN YOUR DATA INTO KNOWLEDGE

The Definitive Guide to Data Blending. White Paper

Architectures for Big Data Analytics A database perspective

REAL-TIME BIG DATA ANALYTICS

Databricks. A Primer

Sunnie Chung. Cleveland State University

Introduction to Predictive Analytics. Dr. Ronen Meiri

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

Microsoft Power BI. Nov 21, 2015

Embedded Analytics Vendor Selection Guide. A holistic evaluation criteria for your OEM analytics project

Kai Wähner. The Next-Generation BPM for a Big Data World: Intelligent Business Process Management Suites (ibpms)

Next-Generation Cloud Analytics with Amazon Redshift

Architected Blended Big Data with Pentaho

Microsoft SQL Server Business Intelligence and Teradata Database

TRANSFORM BIG DATA INTO ACTIONABLE INFORMATION

Transcription:

TECH NOTE Hadoop Alone Is Not Big Data

Twenty-one years ago, a year before the first web browser appeared, Walmart s Teradata data warehouse exceeded a terabyte of data and kicked off a revolution in supply-chain analytics. Today Hadoop is doing the same for demand-chain analytics. The question is, will we just add more zeros to our storage capacity this time or will we learn from our data warehouse infrastructure mistakes? These mistakes include: data silos organizational silos confusing velocity with response time DATA SILOS A data silo is a system that has lots of inputs but few outputs. The Wikipedia page for data warehouse shows an architecture diagram with operational systems on the left, data marts on the right, and a data vault in the middle, but the third definition of vault at Merriam-Webster.com is a burial chamber. All too often, enterprise data warehouses have become data burial chambers, or perhaps, data hospice facilities: places where data goes to die. To prevent this from happening to Hadoop systems, we need more techniques to get data out of the central data store to people and other systems. A few data marts just aren t sufficient anymore for connecting with development partners, ad tech vendors, and the myriad customer touchpoints available to retailers and brands. Data export 2

techniques should cover a variety of performance characteristics so that the best technique can be used for each use case. Such techniques include: good ol batch FTP of flat files, XML files, and compact binary file formats such as Avro publish-subscribe messaging interfaces, AKA enterprise message busses, such as Kafka real-time REST APIs built on high-speed databases such as HBase and Voldemort OLAP and data visualization user interfaces for business analysts who aren t data scientists, such as Pentaho, Tableau, and Simba for Excel. Let s consider the last two in more detail. First, real-time means different things to different people. Fifty milliseconds (1/20th of a second) is real-time for stock trading. Google found that an increase of 500 milliseconds (1/2 a second) in page load time decreases traffic 20% and Amazon found that even a 100 millisecond (1/10th of a second) increase in load time significantly decreases retail website revenue. 1 One-tenth of a second response time is a high bar for APIs to meet. To achieve it at the 95th percentile, retailers need multiple data centers per market so that shoppers always use a data center that is close by, thereby minimizing response times. In short, they need multiple front-end data centers for each Hadoop back-end data center. Secondly, OLAP and data visualization are part of an exciting industry trend toward the democratization of data where the goal is to enable people to access required data themselves, rather than routing queries through some central analytics department. Nike FuelBand, Fitbit, and 23andMe are examples of this trend in consumer products, and OLAP and data visualization are enabling technologies for business users. Democratization of data holds the promise of preventing another big data warehouse mistake from the past: organizational silos. Back-end Data Center Front-end Data Center 1 John Rauser (Amazon) The impact of website performance on conversion, June 8,2004; Greg Linden (Amazon) Make data useful, http://www.scribd.com/doc/4970486/make-data-useful-by-greg-linden-amazoncom; See also Eric Schurman (Microsoft) and Jake Brutlag (Google), Performance related changes and their user impact, O Reilly Velocity, Web Performance and Operations Conference (Velocity 09), June 2009; Philip Dixon (Shopzilla), Shopzilla s site redo - you get what you measure, O Reilly Velocity, Web Performance and Operations Conference (Velocity 09), June 2009 3

ORGANIZATIONAL SILOS An organizational silo, like a data silo, has lots of inputs but few outputs: it s a people bottleneck. Too often, if business analysts wanted data they had to go to a central analytics team, wait in line, get the analytics team to understand their need, wait a few days for the results, realize that the results weren t what they thought they d asked for, and repeat the process until one side gave up. Then, when business analysts complained and asked why on earth it could take so long, analytics just said, There s a lot of math involved. You wouldn t understand. Over the past 20 years, that situation has created a kind of analytics aristocracy that s not very useful. If large companies can create such organizational silos with SQL, BI, and SAS, just imagine the kind of silos they ll be able to create with the new technologies Hadoop, MapReduce, and R. Data democratization is the cure for organizational silos. There s a lot of math involved. You wouldn t understand. 4

VELOCITY VS. RESPONSE TIME The last data warehouse mistake we can avoid with Hadoop systems is confusing velocity for response time. Consider an analogy. Suppose you re shipping a package from Los Angeles to San Francisco, but because of your shipper s infrastructure, it goes through Memphis. If it takes 12 hours from LA to Memphis (1,800 miles) and 12 hours from Memphis to San Francisco (2,000 miles), that s 3,800 miles in 24 hours or 158 miles per hour. Pretty fast. However if you cut out Memphis and go directly from LA to San Francisco (380 miles) in 12 hours then that just 32 miles per hour: pretty slow. Yet the slower route gets the package delivered 12 hours earlier. The point is that velocity should be measured from the customer s point of view, not the infrastructure s, since infrastructure only exists to serve the customer. The following diagram shows what used to be a typical data flow from a customer, through a data warehouse, and then back to the customer, where each of the eight steps was scheduled and run in batch. Even if each link is fast, the whole round trip is rather slow. With cloud-based Hadoop systems we can simplify this and greatly increase response time. Data is pushed directly from Hadoop to front-ends for use by real-time APIs, and to data marts for use by business analysts. Rather than updating customer attributes daily, weekly, or quarterly, this architecture enables real-time updates, click by click. Front-ends Data Mart Hadoop FTP Message Bus Hadoop holds immense promise for adding many more zeros to our storage and analytics capacity, and transforming companies to be more data driven. However to reach its full potential we should avoid the mistakes of the past. Otherwise, we re in for another twenty years of silos, aristocracies, and inadequate response times, or as aristocrats sometimes says, different tree same monkeys. Customer Operational Data Store Operational System Data Mart ETL ETL Staging Area Data Warehouse 5

ABOUT RICHRELEVANCE RichRelevance is the global leader in omni-channel personalization. More than 160 international companies use RichRelevance to turn data into actionable insight, which delivers the most relevant experience for consumers as they shop across web, store and mobile. RichRelevance drives more than one billion decisions every day, and has delivered over $10 billion in attributable sales to its clients, which include Target, Marks & Spencer and Priceminister. Recently, the company opened its cloud-based platform to allow clients to easily merge disparate data sources and build real-time applications tailored to their specific business needs. RichRelevance is headquartered in San Francisco and serves clients in 40 countries from 9 offices around the globe. For more information, please visit www.richrelevance.com. 2014 RichRelevance, Inc. 6