Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012



Similar documents
Bringing Big Data to People

Implement Hadoop jobs to extract business value from large and varied data sets

HDP Hadoop From concept to deployment.

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Modernizing Your Data Warehouse for Hadoop

HDP Enabling the Modern Data Architecture

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Comprehensive Analytics on the Hortonworks Data Platform

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

#TalendSandbox for Big Data

Business Intelligence for Big Data

Workshop on Hadoop with Big Data

BIG DATA TRENDS AND TECHNOLOGIES

Apache Hadoop: The Big Data Refinery

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop and Map-Reduce. Swati Gore

Agile Business Intelligence Data Lake Architecture

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Please give me your feedback

Cloudera Enterprise Data Hub in Telecom:

White Paper: What You Need To Know About Hadoop

Data Analyst Program- 0 to 100

The Future of Data Management

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Community Driven Apache Hadoop. Apache Hadoop Basics. May Hortonworks Inc.

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Building Scalable Big Data Pipelines

Data Mining in the Swamp

Testing Big data is one of the biggest

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Native Connectivity to Big Data Sources in MSTR 10

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Apache Hadoop: Past, Present, and Future

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Testing 3Vs (Volume, Variety and Velocity) of Big Data

A Modern Data Architecture with Apache Hadoop

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Qsoft Inc

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

The Future of Data Management with Hadoop and the Enterprise Data Hub

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

White Paper: Evaluating Big Data Analytical Capabilities For Government Use

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Virtualizing Apache Hadoop. June, 2012

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Large scale processing using Hadoop. Ján Vaňo

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

BIG DATA HADOOP TRAINING

Open source Google-style large scale data analysis with Hadoop

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

The Inside Scoop on Hadoop

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Certified Big Data and Apache Hadoop Developer VS-1221

Has been into training Big Data Hadoop and MongoDB from more than a year now

Hadoop Development & BI- 0 to 100

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Cost-Effective Business Intelligence with Red Hat and Open Source

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

The Evolving Apache Hadoop Eco-System

Introduction to Big Data Training

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data: Making Sense of it all!

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Upcoming Announcements

Tap into Hadoop and Other No SQL Sources

Big Data Infrastructure at Spotify

Big Data Big Data/Data Analytics & Software Development

ITG Software Engineering

Harnessing big data with Hortonworks Data Platform and Red Hat JBoss Data Virtualization

Constructing a Data Lake: Hadoop and Oracle Database United!

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Cisco IT Hadoop Journey

Luncheon Webinar Series May 13, 2013

How To Create A Data Visualization With Apache Spark And Zeppelin

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Information Architecture

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

A Brief Outline on Bigdata Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Cisco IT Hadoop Journey

Transcription:

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster Nov 7, 2012

Who I Am Robert Lancaster Solutions Architect, Hotel Supply Team rlancaster@orbitz.com @rob1lancaster Organizer of Chicago Machine Learning Study Group Co-organizer of Chicago Big Data. page 2

Launched in 2001 Over 160 million bookings page 3

Some History page 4

In 2009 The Machine Learning team is formed to improve site performance. For example, improving hotel search results. This required access to large volumes of behavioral data for analysis. Fortunately, the required data was collected in session data stored in web analytics logs. page 5

The Problem The only archive of the required data went back about two weeks. Non-transactional Data (e.g. searches) Transactional data (e.g. bookings) and aggregated Nontransactional data Data Warehouse page 6

Hadoop Provided a Solution Detailed nontransactional data (what every user sees, clicks, etc.) Transactional data (e.g. bookings) and aggregated Nontransactional data Data Warehouse Hadoop page 7

What is Hadoop? Distributed file system and parallel processing platform. Open source Apache project created by Doug Cutting. Modeled on papers published by Google on the Google File System and MapReduce. Intended to run on a cluster of relatively inexpensive machines (aka commodity hardware). Bring processing to the data. page 8

Sqoop & Flume The Hadoop Ecosystem Zookeeper & Oozie Pig Hive HBase MapReduce Hadoop Distributed File System page 9

Deploying Hadoop Enabled Multiple Applications 100.00% 90.00% 80.00% 71.67% 70.00% Queries Searches 60.00% 50.00% 40.00% 31.87% 30.00% 34.30% 20.00% 10.00% 0.00% 2.78% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 page 10

And Useful Analyses page 11

But Brought New Challenges Most of these efforts are driven by development teams. The challenge now is unlocking the value of this data for nontechnical users. Support for Hadoop via traditional BI/reporting tools still meager. page 12

BI Vendors Are Working on Hadoop Integration Both big (relatively) page 13

And small page 14

In 2011& 2012 Big Data team is formed under Business Intelligence team at Orbitz Worldwide. Allows the Big Data team to work more closely with the data warehouse and BI teams. Reflects the importance of big data to the future of the company. Our production cluster has grown 40-fold since it was launched. page 15

A View Shared Beyond Orbitz We strongly believe that Hadoop is the nucleus of the next-generation cloud EDW but that promise is still three to five years from fruition. * *James Kobielus, Forrester Research, Hadoop, Is It Soup Yet? page 16

Two Primary Ways We Use Hadoop to Complement the EDW Extraction and transformation of data for loading into the data warehouse ETL. Off-loading of analysis from the data warehouse. page 17

ETL Example Proposed Processing Raw logs Hadoop Dimensional model page 18

ETL Example: Click Data Processing Previous Processing in Data Warehouse Web Server Web Server Web Servers Logs ETL DW Data Cleansing (Stored procedure) DW Several hours of processing ~20% original data size page 19

ETL Example: Click Data Processing Moving to Hadoop: Removed load from the data warehouse. Facilitated adding additional attributes for processing. Allowed processing to be run more frequently. Web Server Web Server Web Servers Logs HDFS Data Cleansing (MapReduce) DW Processing in Hadoop page 20

Analysis Example: Geo-Targeting Ads Facilitated analysis that allows for more personalized ad content. Allowed marketing team to analyze over a years worth of search data. Provided analysis that was difficult to perform in the data warehouse. page 21

Example Processing Pipeline for Web Analytics Data page 22

Example Use Case: Selection Errors page 23

Use Case Selection Errors: Introduction Multiple points of entry. Multiple paths through site. Goal: tie events together to form picture of customer behavior. page 24

Use Case Selection Errors: Processing page 25

Use Case Selection Errors: Visualization page 26

Example Use Case: Beta Data page 27

Use Case Beta Data: Introduction Hotel Sort Optimization Compare A vs. B Web Analytics Data What user saw. How user behaved Server Log Data Sorting behavior used. page 28

Use Case Beta Data Processing page 29

Use Case Beta Data: Visualization page 30

Example Use Case: RCDC page 31

Use Case RCDC: Introduction Understand and improve cache behavior. Improve coverage Traditionally search 1 page of hotels at a time. Get just enough information to present to consumers. Increase amount of availability information we have when consumer performs a search. Data needed to support needs beyond reporting. page 32

Use Case RCDC: Processing page 33

Use Case RCDC: Visualization page 34

Conclusions Hadoop market is still immature, but growing quickly. Better tools are on the way. Look beyond the usual (enterprise) suspects. Many of the most interesting companies in the big data space are small startups. Hadoop won t replace your EDW, but any organization with a large EDW should at least be exploring Hadoop as a complement to their BI infrastructure. page 35

Conclusions Work closely with your existing data management teams. Your idea of what constitutes big data might quickly diverge from theirs. The flip-side to this is that Hadoop can be an excellent tool to off-load resource-consuming jobs from your data warehouse. page 36

Thank you! Questions? page 37