An Introduc@on to Big Data, Apache Hadoop, and Cloudera



Similar documents
Big Data. The Big Picture. Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas

The Future of Data Management

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Using RDBMS, NoSQL or Hadoop?

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Big Data Realities Hadoop in the Enterprise Architecture

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

NextGen Infrastructure for Big DATA Analytics.

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

HDP Enabling the Modern Data Architecture

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

White Paper: What You Need To Know About Hadoop

DNS Big Data

HDP Hadoop From concept to deployment.

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Open source Google-style large scale data analysis with Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Advanced In-Database Analytics

Information Architecture

Tap into Hadoop and Other No SQL Sources

Big Data Big Data/Data Analytics & Software Development

The Future of Data Management with Hadoop and the Enterprise Data Hub

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data + Big Analytics Transforming the way you do business

Bringing the Power of SAS to Hadoop

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

More Data in Less Time

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

#TalendSandbox for Big Data

Hadoop and Map-Reduce. Swati Gore

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Hadoop implementation of MapReduce computational model. Ján Vaňo

Apache Hadoop's Role in Your Big Data Architecture

Hadoop, the Data Lake, and a New World of Analytics

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Cost-Effective Business Intelligence with Red Hat and Open Source

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

SQream Technologies Ltd - Confiden7al

Large scale processing using Hadoop. Ján Vaňo

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Changing the face of Business Intelligence & Information Management

Big Data: Tools and Technologies in Big Data

Big Data and Its Impact on the Data Warehousing Architecture

Hadoop. Sunday, November 25, 12

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Big Data Too Big To Ignore

Chapter 7. Using Hadoop Cluster and MapReduce

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Hadoop Trends and Practical Use Cases. April 2014

Cloudera Enterprise Data Hub in Telecom:

Implement Hadoop jobs to extract business value from large and varied data sets

Building Your Big Data Team

Big Data at Cloud Scale

The 3 questions to ask yourself about BIG DATA

An Open Dynamic Big Data Driven Applica3on System Toolkit

Manifest for Big Data Pig, Hive & Jaql

Virtualizing Apache Hadoop. June, 2012

Data processing goes big

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Data Center Evolu.on and the Cloud. Paul A. Strassmann George Mason University November 5, 2008, 7:20 to 10:00 PM

Data Refinery with Big Data Aspects

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

MySQL and Hadoop. Percona Live 2014 Chris Schneider

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Information Builders Mission & Value Proposition

Financial, Telco, Retail, & Manufacturing: Hadoop Business Services for Industries

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Ten Common Hadoopable Problems

Big Data Success Step 1: Get the Technology Right

Big Data Course Highlights

Big Data for Investment Research Management

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

BIG DATA What it is and how to use?

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Big Data and Industrial Internet

BIRT in the World of Big Data

White Paper: Evaluating Big Data Analytical Capabilities For Government Use

Apache Hadoop: Past, Present, and Future

Transcription:

An Introduc@on to Big Data, Apache Hadoop, and Cloudera Ian Wrigley, Curriculum Manager, Cloudera 1

The Mo@va@on for Hadoop 2

Tradi@onal Large- Scale Computa@on Tradi*onally, computa*on has been processor- bound Rela@vely small amounts of data Significant amount of complex processing performed on that data For decades, the primary push was to increase the compu*ng power of a single machine Faster processor, more RAM 3

The Data Explosion GIGABYTES OF DATA CREATED (IN BILLIONS) 10,000 5,000 1.8 trillion gigabytes of data was created in 2011 More than 90% is unstructured data Approx. 500 quadrillion files Quan@ty doubles every 2 years 0 2005 2015 2010 Source: IDC 2011 STRUCTURED DATA UNSTRUCTURED DATA 4

Current Solu@ons 10,000 GIGABYTES OF DATA CREATED (IN BILLIONS) 5,000 Current Database Solutions are designed for structured data. Op@mized to answer known ques*ons quickly Schemas dictate form/context Difficult to adapt to new data types and new ques@ons Expensive at Petabyte scale 0 2005 2015 2010 10% STRUCTURED DATA UNSTRUCTURED DATA 5

Why Use Hadoop? Move beyond rigid legacy frameworks Hadoop handles any data type, in any quan*ty Structured, unstructured Schema, no schema High volume, low volume All kinds of analy@c applica@ons Hadoop grows with your business Proven at petabyte scale Capacity and performance grow simultaneously Leverages commodity hardware to mi@gate costs Hadoop is 100% Apache licensed and open source No vendor lock- in Community development Rich ecosystem of related projects Hadoop helps you derive the complete value of all your data Drives revenue by extrac@ng value from data that was previously out of reach Controls costs by storing data more affordably than any other pla`orm 1 2 3 6

The Origins of Hadoop Launches SQL support for Hadoop Open source web crawler project created by Doug Cuang Publishes MapReduce and GFS Paper Open Source MapReduce and HDFS project created by Doug Cuang Runs 4,000- node Hadoop cluster Hadoop wins Terabyte sort benchmark Releases CDH and Cloudera Enterprise 2002 2007 2012 7

Core Hadoop: HDFS Self-healing, high bandwidth 1 2 3 HDFS 4 2 1 1 2 1 5 4 5 2 5 3 4 3 5 3 4 HDFS breaks incoming files into blocks and stores them redundantly across the cluster. 8

Core Hadoop: MapReduce framework 1 2 3 MR 4 2 1 1 2 1 5 4 5 2 5 3 4 3 5 3 4 Processes large jobs in parallel across many nodes and combines the results. 9

Hadoop and Databases You need Best Used For: Interac@ve OLAP Analy@cs (<1sec) Mul@step ACID Transac@ons 100% SQL Compliance Best Used For: Structured or Not (Flexibility) Scalability of Storage/Compute Complex Data Processing 10

Typical Datacenter Architecture Enterprise web site Business intelligence apps Interactive database Data export OLAP load Oracle, SAP... 11

Adding Hadoop To The Mix Enterprise web site Dynamic OLAP queries Business intelligence apps Interactive database New data Hadoop Oracle, SAP... Recommendations, etc... 12

Why Cloudera? 13

Cloudera is in Customers and Users in Integrated Partners in banking, telecommunica@ons, mobile services, defense & intelligence, media and retail depend on Cloudera all other Hadoop systems combined than across hardware, pla`orms, database and business intelligence (BI) for hardware, pla`orms, sokware and services in Training and Certification in Nodes Under Management developers, administrators and managers trained on 6 con@nents since 2009 in Open Source Contributions for developers, administrators and managers in Data Science 14

Experienced and Proven Across Hundreds of Deployments Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 15

The Only Vendor With a Complete Solu@on Cloudera s Distribu*on Including Apache Hadoop (CDH) Big Data storage, processing and analy@cs pla`orm based on Apache Hadoop 100% open source STORAGE COMPU- TATION ACCESS INTEGRA- TION Cloudera Enterprise 4.0 Cloudera Manager End- to- end management applica@on for the deployment and opera@on of CDH DEPLO- YMENT CONFIGUR- ATION MONITOR- ING DIAGNOS- TICS AND REPORT- ING Produc*on Support Our team of experts on call to help you meet your Service Level Agreements (SLAs) ISSUE RESOLU- TION ESCALATION PROCESSES OPTIMIZA- TION KNOW- LEDGE BASE Partner Ecosystem 250+ partners across hardware, software, platforms and services Cloudera University Equipping the Big Data workforce 12,000+ trained Professional Services Use case discovery, pilots, process & team development 16

Solving Problems with Hadoop 17

Eight Common Hadoop- able Problems 1. Modeling true risk 2. Customer churn analysis 3. Recommenda*on engine 4. PoS transac*on analysis 5. Analyzing network data to predict failure 6. Threat analysis 7. Search quality 8. Data sandbox 18

1. Modeling True Risk Challenge: How much risk exposure does an organiza*on really have with each customer? Mul@ple sources of data and across mul@ple lines of business Solu*on with Hadoop: Source and aggregate disparate data sources to build data picture e.g. credit card records, call recordings, chat sessions, emails, banking ac@vity Structure and analyze Sen@ment analysis, graph crea@on, pa=ern recogni@on Typical Industry: Financial Services (banks, insurance companies) 19

2. Customer Churn Analysis Challenge: Why is an organiza*on really losing customers? Data on these factors comes from different sources Solu*on with Hadoop: Rapidly build behavioral model from disparate data sources Structure and analyze with Hadoop Traversing Graph crea@on Pa=ern recogni@on Typical Industry: Telecommunica@ons, Financial Services 20

3. Recommenda@on Engine/Ad Targe@ng Challenge: Using user data to predict which products to recommend Solu*on with Hadoop: Batch processing framework Allow execu@on in in parallel over large datasets Collabora*ve filtering Collec@ng taste informa@on from many users U@lizing informa@on to predict what similar users like Typical Industry Ecommerce, Manufacturing, Retail Adver@sing 21

4. Point of Sale Transac@on Analysis Challenge: Analyzing Point of Sale (PoS) data to target promo*ons and manage opera*ons Sources are complex and data volumes grow across chains of stores and other sources Solu*on with Hadoop: Batch processing framework Allow execu@on in in parallel over large datasets Paiern recogni*on Op@mizing over mul@ple data sources U@lizing informa@on to predict demand Typical Industry: Retail 22

5. Analyzing Network Data to Predict Failure Challenge: Analyzing real- *me data series from a network of sensors Calcula@ng average frequency over @me is extremely tedious because of the need to analyze terabytes Solu*on with Hadoop: Take the computa*on to the data Expand from simple scans to more complex data mining Beier understand how the network reacts to fluctua*ons Discrete anomalies may, in fact, be interconnected Iden*fy leading indicators of component failure Typical Industry: U@li@es, Telecommunica@ons, Data Centers 23

6. Threat Analysis/Trade Surveillance Challenge: Detec*ng threats in the form of fraudulent ac*vity or aiacks Large data volumes involved Like looking for a needle in a haystack Solu*on with Hadoop: Parallel processing over huge datasets Paiern recogni*on to iden*fy anomalies, i.e., threats Typical Industry: Security, Financial Services, General: spam figh@ng, click fraud 24

7. Search Quality Challenge: Providing real *me meaningful search results Solu*on with Hadoop: Analyzing search aiempts in conjunc*on with structured data Paiern recogni*on Browsing pa=ern of users performing searches in different categories Typical Industry: Web, Ecommerce 25

8. Data Sandbox Challenge: Data Deluge Don t know what to do with the data or what analysis to run Solu*on with Hadoop: Dump all this data into an HDFS cluster Use Hadoop to start trying out different analysis on the data See paierns to derive value from data Typical Industry: Common across all industries 26

Orbitz: Major Online Travel Booking Service Challenge: Orbitz performs millions of searches and transac@ons daily, which leads to hundreds of gigabytes of log data every day Not all of that data has value (i.e., it is logged for historic reasons) Much is quite valuable Want to capture even more data Solu*on with Hadoop: Hadoop provides Orbitz with efficient, economical, scalable, and reliable storage and processing of these large amounts of data Hadoop places no constraints on how data is processed 27

Before Hadoop Orbitz s data warehouse contains a full archive of all transac*ons Every booking, refund, cancella@on etc. Non- transac*onal data was thrown away because it was uneconomical to store Non-transactional Data (e.g., Searches) Transactional Data (e.g., Bookings) Data Warehouse 28

Aker Hadoop Hadoop was deployed late 2009/early 2010 to begin collec*ng this non- transac*onal data Orbitz has been using CDH for that en@re period with great success. Much of this non- transac*onal data is contained in Web analy*cs logs Non-transactional Data (e.g., Searches) Transactional Data (e.g., Bookings) Hadoop Data Warehouse 29

What Now? Access to this non- transac*onal data enables a number of applica*ons Op@mizing hotel search E.g., op@mize hotel ranking and show consumers hotels more closely matching their preferences User specific product Recommenda@ons Web page performance tracking Analyses to op@mize search result cache performance User segments analysis, which can drive personaliza@on Lots of press coverage in June 2012: company discovered that people using Macs are willing to spend 30% more on hotels that PC users Mac users are now presented with pricier hotels first in the list 30

Major Na@onal Bank Background 100M customers Rela@onal data: 2.5B records/month Card transac@ons, home loans, auto loans, etc. Data volume growing by hundreds of TB/year Needs to incorporate non- rela@onal data as well Web clicks, check images, voice data Uses Hadoop to Iden@fy credit risk, fraud Proac@vely manage capital 31

Financial Regulatory Body Stringent data reliability requirements Must store seven years of data 850TB of data collected from every Wall Street trade each year Data volumes growing at 40% each year Replacing EMC Greenplum + SAN with CDH Goal is to store data from years two to seven in Hadoop Will have 5PB of data in Hadoop by the end of 2013 Cost savings predicted to be 10s of millions of dollars Applica*on performance tes*ng is showing speed gains of 20x in some cases 32

Leading North American Retailer Storing 400TB of data in CDH cluster Capture and analysis of data on individual customers and SKUs across 4,000 loca@ons Using Hadoop for: Loyalty program analy@cs and personal pricing Fraud detec@on Supply chain op@miza@on Marke@ng and promo@ons Loca@ng and pricing overstocked items for clearance 33

Digital Media Company Needs to quickly and reliably process high volume clickstream and pageview data Experienced database boilenecks and reliability issues Now using CDH A cluster of just 20 nodes Inges*ng 75 million clickstream, page view, and user profile events per day 15GB of data Processes 430 million records from six million users in 11 minutes Alterna@ve solu@on would have required 10x more investment in database sokware, high- end servers, developer @me 34

Leader in Real- Time Adver@sing Technology Hundreds of customers need unique views of the data Were using Netezza; unable to run more than 2-3 big jobs per day Too expensive to scale Now using CDH Processing hundreds of jobs concurrently 200-300GB/hour per job Inges@ng 10TB of data per day Moving data between CDH, Netezza, and Ver@ca 35

Ne`lix Before Hadoop Nightly processing of logs Imported into a database Analysis/BI As data volume grew, it took more than 24 hours to process and load a day s worth of logs Today, an hourly Hadoop job processes logs for quicker availability to the data for analysis/bi Currently inges*ng approximately 1TB of data per day 04-36 Copyright @ 2011 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. 36

Hadoop as Cheap Storage Yahoo Before Hadoop: $1 million for 10TB storage With Hadoop: $1 million for1 PB of storage Other Large Company Before Hadoop: $5 million to store data in Oracle With Hadoop: $240K to store the data in HDFS Facebook Hadoop as unified storage 37

Hadoop Jobs 38

The Roles People Play System Administrators Developers Analysts Data Stewards 39

System Administrators Required skills: Strong Linux administra@on skills Networking knowledge Understanding of hardware Job responsibili*es Install, configure and upgrade Hadoop sokware Manage hardware components Monitor the cluster Integrate with other systems (e.g., Flume and Sqoop) 40

Developers Required skills: Strong Java or scrip@ng capabili@es Understanding of MapReduce and algorithms Job responsibili*es: Write, package and deploy MapReduce programs Op@mize MapReduce jobs and Hive/Pig programs 41

Data Analyst/Business Analyst Required skills: SQL Understanding data analy@cs/data mining Job responsibili*es: Extract intelligence from the data Write Hive and/or Pig programs 42

Data Steward Required skills: Data modeling and ETL Scrip@ng skills Job responsibili*es: Cataloging the data (analogous to a librarian for books) Manage data lifecycle, reten@on Data quality control with SLAs 43

Combining Roles System Administrator + Steward analogous to DBA Required skills: Data modeling and ETL Scrip@ng skills Strong Linux administra@on skills Job responsibili*es: Manage data lifecycle, reten@on Data quality control with SLAs Install, configure and upgrade Hadoop sokware Manage hardware components Monitor the cluster Integrate with other systems (e.g., Flume and Sqoop) 44

Finding The Right People Hiring Hadoop experts Strong Hadoop skills are scarce and expensive Hadoop User Groups Key words Developers: MapReduce, Cloudera Cer@fied Developer for Apache Hadoop (CCDH) System Admins: distributed systems (e.g., Teradata, RedHat Cluster), Linux, Cloudera Cer@fied Administrator for Apache Hadoop (CCAH) Consider cross- training, especially system administrators and data librarians 45

Cloudera s Academic Partnership Program 46

Cloudera s Academic Partnerships: Overview Cloudera s Academic Partnerships (CAP) An essen@al component of Cloudera s strategy to provide comprehensive Apache Hadoop training to current and future data professionals Designed to be a mutually beneficial rela*onship Universi@es are enabled to deliver new and relevant areas of study to their students Cloudera is able to help fill the demand for qualified data professionals to help the market con@nue is explosive growth With CDH and Cloudera Manager available for free, and our curriculum and Virtual Machine, we provide universi*es the founda*on to start experimen*ng with Hadoop and developing exper*se among their students 47

Cloudera s Academic Partnerships: Goals Introduce students to Apache Hadoop Provide students and instructors with quality course materials and virtual machine images to complete hands- on labs Grant 50% discount on cer*fica*on costs to students associated with the program who are interested in aiemp*ng Cloudera's industry leading Hadoop cer*fica*on exams Highly recommended they take the class and a=empt the cer@fica@on exam Allow academic ins*tu*ons op*ons to augment their degree program requirements 48

Cloudera s Academic Partnerships: Financial Overview Cloudera does not currently charge Academic Partners for usage of the training materials This is a program designed solely to facilitate students learning of an emerging technology Our reward is helping the industry grow, and ideally the exposure to Cloudera is a posi@ve one which will be remembered when the students we service today are making decisions for their business tomorrow Instructors who are delivering the Cloudera courses are eligible for a 50% discount to commercial training courses delivered by Cloudera We want to make sure the folks leading the classes have the skillset to help their students be successful Normally we provide universi*es with courses focused on the roles of Hadoop Developer or Administrator 49

Ian Wrigley ian@cloudera.com 50 Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent.