SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Similar documents
The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

The Future of Data Management

The Future of Data Management with Hadoop and the Enterprise Data Hub

HDP Hadoop From concept to deployment.

HDP Enabling the Modern Data Architecture

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Information Builders Mission & Value Proposition

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Apache Hadoop: Past, Present, and Future

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Ecosystem B Y R A H I M A.

Cloudera Enterprise Data Hub in Telecom:

Talend Big Data. Delivering instant value from all your data. Talend

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Upcoming Announcements

#TalendSandbox for Big Data

Workshop on Hadoop with Big Data

Bringing Big Data to People

ITG Software Engineering

BIG DATA TRENDS AND TECHNOLOGIES

Data Security in Hadoop

Dell In-Memory Appliance for Cloudera Enterprise

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

Qsoft Inc

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Constructing a Data Lake: Hadoop and Oracle Database United!

Why Spark on Hadoop Matters

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Are You Big Data Ready?

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Data Analyst Program- 0 to 100

Cisco IT Hadoop Journey

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Ali Ghodsi Head of PM and Engineering Databricks

Hadoop Trends and Practical Use Cases. April 2014

Please give me your feedback

Hadoop & Spark Using Amazon EMR

Dominik Wagenknecht Accenture

BIG DATA - HADOOP PROFESSIONAL amron

Interactive data analytics drive insights

How to Hadoop Without the Worry: Protecting Big Data at Scale

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Comprehensive Analytics on the Hortonworks Data Platform

Modernizing Your Data Warehouse for Hadoop

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Taming Operations in the Apache Hadoop Ecosystem. Jon Hsieh, Kate Ting, USENIX LISA 14 Nov 14, 2014

Peers Techno log ies Pv t. L td. HADOOP

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Big Data Explained. An introduction to Big Data Science.

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Native Connectivity to Big Data Sources in MSTR 10

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Deploying Hadoop with Manager

Practical Hadoop by Example

Modern Data Architecture for Predictive Analytics

Hadoop Job Oriented Training Agenda

SAP and Hortonworks Reference Architecture

The Enterprise Data Hub and The Modern Information Architecture

Big Data Management and Security

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Transforming the Telecoms Business using Big Data and Analytics

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Complete Java Classes Hadoop Syllabus Contact No:

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data Big Data/Data Analytics & Software Development

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

MySQL and Hadoop Big Data Integration

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

BIG DATA HADOOP TRAINING

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

More Data in Less Time

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Transcription:

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera

Most FAQ:

Super-Quick Overview!

The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce Sentry Yarn HDFS Flume

The Hadoop Ecosystem Explained! GUI Real Time SQL Free-text Search Workflow Mgmt SQL Proc. Orient ed Query Distributed Parallel Processing Framework (Java fwk) Distributed File System Machine Learning Access Control Resource and Workload Scheduler Process Mgmt KeyValue Store Stream / event-based data ingest

Two Views

#1: Scale Data Processing at Low Cost Do what I usually do, but on a larger set of data Do my complex queries, but within a reasonable time

#2: Break Silos and Ask Bigger Questions What new insights can we achieve by combining siloed data sets? What else can we find by asking questions over new types of data?

Some Typical Use cases

What Organizations Do: Offload ETL Common use case: ETL data processing workload Data volume is growing But fixed time window for data delivery Related side use case Complex queries on the data either take unacceptable time or can t be deployed at all Cost, volume of records involved, response time, or limited data

What Organizations Do: Batch ETL Example: A Network and Storage solution company pro-active support Challenge 600000 phone home machine generated log transmissions needed to be processed every week 40% of the logs need to be transmitted within 18 hours each weekend Expected data growth of ~7TB a month causing SLA bottlenecks! Complex queries taking weeks or not even possible to run Solution Achieved a cost-efficient and linearly scalable storage and data processing solution Can now handle 7TB/month data growth and stay within the 18 hr SLAbound time window Faster and more flexible analytic capabilities Can now correlate disk latency with manufacturer (a 24 billion records report btw) and achieved a 64x query performance improvement (from weeks to hours) Can now run a pattern matching query that would help detect bugs (a 240 billion record query btw!!) TCO freed up budget for other customer-focused projects

Example Architecture Oozie ZooKeeper MapReduce Hive Impala Some BI tool or Dashboard Sqoop HDFS Sqoop

What Organizations Do: Log Processing Common Use Case Too many log types, too high volume, and growing Need for multiple workloads on the same log data Capacity planning Historical load trends in correlation with special activities elsewhere in the org Near real time production issue resolution Anomaly or outlier detection Traditional systems cant easily scale with the load, nor adapt to all the types of data that need to co-exist to answer complex correlation queries

What Organizations Do: Log Processing Example: Global Financial Services Firm anomaly detection Challenge Online trading causing data exponential growth Traditional systems could only handle current load, and it took weeks to process current data loads Could not store more than 1 year of data cost-efficiently Solution Can now store 200-300TB of data and handle a 2-4TB daily ingestion load Uses Impala for real time queries on that data, e.g. a month data scan happens in 4 seconds vs 4 hours.. Monthly reports can now be generated in hours instead of days Saved $30M in IT costs and prevented future growth costs

Example Architecture ZooKeeper Oozie Hue Mahout MapReduce Impala Solr Sqoop HDFS Flume

Example Architecture ZooKeeper Oozie Custom App or Dashboard Hue HBase Impala Solr HDFS Flume

What Organizations Do: Combine Silos Common Use Case Customers seek a 360-view of clients, patients, or customers to provide better services, support, competitive offerings, or marketing Data lives in separate (and sometimes old) silos costly! Maintenance, overlap, access bottlenecks, Some data is impossible to access in a timely manner Traditional systems can t cost-efficiently store all data, handle all data types (and new types added dynamically) and serve the various workloads / clients of the system

What Organizations Do: Combine Silos Example: Global On-line retailer Challenge Need to correlate online/offline data across disparate, costly legacy DWs Detailed data from every cash register at every store over a 10+ year history across 1,000 s product categories (22 subsidiaries?) One data source ~4 weeks to get access to inhibits productivity Solution 250-node cluster of Cloudera + Impala Can now store 1PB over 250 nodes and grow at very low cost Consolidated environment for query and machine learning no data access bottlenecks anymore Able to correlate all customer, product, and sales data for a 360- degree view of their customer

Example Architecture Hue Custom App or BI Tool Oozie ZooKeeper Impala Solr HBase MapReduce Sqoop HDFS Flume

And there are many more. Image processing Suicide prevention / event prediction Product and process improvements Genome sequence processing Hospital treatment patient matching Travel-logistics-path optimization Recommendation engines Clickstream analysis and web experience optimizations

What to Consider Key benefits of moving a workload to Hadoop Linear scale without the extreme price tag Lots of flexibility you can always change your ingest pipelines or data models later with low impact and low cost Ability to combine and analyze previously siloed data sets Opens the door to expand business with new questions cross organizations! Questions to investigate: Make sure to have a validated business use case Does your organization have a need to develop a strategy for handling data growth or a need for combining data sets? What workloads can actually move to Hadoop? Is Hive QL compliant with SQL? What about real time workloads and OLTP? What would be gained that the business side would care about? Clear measurable goals makes life easier! Make sure your organization is prepared What training and support is available? What about supportability and production visibility? How does Hadoop integrate with my environment? Make sure you know what would be required for production in your environment What about Security? PCI compliance? What about production visibility? What about HA and DR?

Summary

What you (Hopefully!) Learned Today

To Learn More 1. Read some good stuff Order the Hadoop Operations book (http://shop.oreilly.com/product/0636920025085.do) and/or the Definitive Guide to Hadoop (http://shop.oreilly.com/product/0636920021773.do) Visit Cloudera s blog: blog.cloudera.com/ 2. Play on your own Cloudera QuickStart VM: https://ccp.cloudera.com/display/support/cloudera+manager+free+edition+dem o+vm View the videos at gethue.com 3. Get help and training Join or send an email to: cdh-user@cloudera.org Visit the Cloudera dev center: cloudera.com/content/dev-center/en/home.html Get training: university.cloudera.com 4. Contact Cloudera eva@cloudera.com On-line contact form: http://cloudera.com/content/cloudera/en/about/contactus/contact-form.html

Quizz: What is the Real Big Data Challenge Technology? Knowledge? People?

Key Take-Away

Transform the Economics of Data Traditional Data Warehouse Add 100 TB = With Cloudera Add 100 TB = TO in incremental spend 1/10th the cost of legacy systems 34

Q&A

36 2012 Cloudera, Inc.

37 Hadoop Distributed File System (HDFS)

MapReduce: A scalable data processing framework

Architecture for Hadoop in the Enterprise