Big Data: Making Sense of it all!

Similar documents
Big Data Realities Hadoop in the Enterprise Architecture

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

HDP Enabling the Modern Data Architecture

HDP Hadoop From concept to deployment.

The Evolving Apache Hadoop Eco-System

Upcoming Announcements

Apache Hadoop's Role in Your Big Data Architecture

Modernizing Your Data Warehouse for Hadoop

Comprehensive Analytics on the Hortonworks Data Platform

A Modern Data Architecture with Apache Hadoop

Modern Data Architecture for Predictive Analytics

Stinger Initiative: Introduction

#TalendSandbox for Big Data

Big Data 101 Webinar

Bringing Big Data to People

Please give me your feedback

Apache Hadoop Patterns of Use

Apache Hadoop: The Big Data Refinery

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Using Tableau Software with Hortonworks Data Platform

Community Driven Apache Hadoop. Apache Hadoop Basics. May Hortonworks Inc.

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

The Future of Data Management

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

SAP and Hortonworks Reference Architecture

Hadoop, the Data Lake, and a New World of Analytics

Luncheon Webinar Series May 13, 2013

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

The Future of Data Management with Hadoop and the Enterprise Data Hub

BIG DATA TECHNOLOGY. Hadoop Ecosystem

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Workshop on Hadoop with Big Data

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Evolution from Big Data to Smart Data

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Data Integration Checklist

Talend Big Data. Delivering instant value from all your data. Talend

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

SAP Database Strategy Overview. Uwe Grigoleit September 2013

Getting Started Practical Input For Your Roadmap

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Testing Big data is one of the biggest

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Teradata Unified Big Data Architecture

Advanced In-Database Analytics

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Native Connectivity to Big Data Sources in MSTR 10

Microsoft SQL Server 2012 with Hadoop

Harnessing big data with Hortonworks Data Platform and Red Hat JBoss Data Virtualization

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Hadoop Job Oriented Training Agenda

Information Architecture

Qsoft Inc

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Big data for the Masses The Unique Challenge of Big Data Integration

Ganzheitliches Datenmanagement

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Azure Data Lake Analytics

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

HADOOP. Revised 10/19/2015

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Safe Harbor Statement

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Large scale processing using Hadoop. Ján Vaňo

<Insert Picture Here> Big Data

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

The Next Wave of Data Management. Is Big Data The New Normal?

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

The Inside Scoop on Hadoop

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

Chase Wu New Jersey Ins0tute of Technology

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Microsoft Big Data. Solution Brief

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Hortonworks Data Platform for Hadoop and SAP HANA

Integrating a Big Data Platform into Government:

Hadoop. Sunday, November 25, 12

Big Data and Your Data Warehouse Philip Russom

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Teradata s Big Data Technology Strategy & Roadmap

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Implement Hadoop jobs to extract business value from large and varied data sets

Transcription:

Big Data: Making Sense of it all! Jamie Engesser E-mail : jamie@hortonworks.com Page 1

Data Driven Business? Facts not Intuition! Data driven decisions are better decisions its as simple as that. Using big data enables managers to decide on the basis of evidence rather than intuition. For that reason it has the potential to revolutionize management Harvard Business Review October 2012 110010100001010011101010100010010100100101001001000010010001001000001000100000100 1001001000100001011100001001000100010100100101111010100100010010010100101001001111 1001010010100011111010001001010000010010001010010111101010011001001010010001000111 Page 2

Web giants proved the ROI in data products applying data science to large amounts of data Prediction of click through rates Netflix: 75% of streaming video results from recommendations Amazon: 35% of product sales come from product recommendations Page 3

Page 4

Page 5

Big Data: Optimize Outcomes at Scale Sports Intelligence Finance Advertising Fraud Retail / Wholesale Manufacturing Healthcare Education Government optimize optimize optimize optimize optimize optimize optimize optimize optimize optimize Championships Detection Algorithms Performance Prevention Inventory turns Supply chains Patient outcomes Learning outcomes Citizen services Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation. Page 6

Analytics started with basic transaction history Megabytes ERP Order detail Trade record Payment record Increasing Data Variety and Complexity

then we added customer information Gigabytes CRM Segmentation Megabytes ERP Purchase detail Purchase record Payment record Customer Touches Support Contacts Offer details Increasing Data Variety and Complexity

and the web started to impact Terabytes WEB Web logs A/B testing Gigabytes Megabytes CRM ERP Behavioral Targeting Dynamic Pricing Segmentation Search Marketing Customer Touches Affiliate Networks Purchase detail Support Contacts Dynamic Funnels Purchase record Payment record Offer details Offer history Increasing Data Variety and Complexity

Big Data: Organizational Game Changer Petabytes BIG DATA Mobile Web Sentiment User Click Stream Transactions + Interactions + Observations = BIG DATA SMS/MMS Speech to Text Terabytes Gigabytes Megabytes WEB Social Interactions & Feeds Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing Customer Touches ERP User Generated Content Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Payment record Offer details Offer history Product/Service Logs Increasing Data Variety and Complexity Page 10

A little history it s 2005

A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013 2004 2006 2008 2010 2012 2005: Yahoo! creates team under E14 to work on Hadoop Focus on INNOVATION Enterprise Hadoop 2007: Yahoo team extends focus to operations to support multiple projects & growing clusters Focus on OPERATIONS 2011: Hortonworks created to focus on Enterprise Hadoop. Starts with 24 key Hadoop engineers from Yahoo STABILITY Page 16

Processing Storage Apache Hadoop: Center of Big Data Strategy Open Source data management with scale-out storage & distributed processing HDFS Distributed across nodes Natively redundant Name node tracks locations Map Reduce Splits a task across processors near the data & assembles results Self-Healing, High Bandwidth Clustered Storage Key Characteristics Scalable Efficiently store and process petabytes of data Linear scale driven by additional processing and storage Reliable Redundant storage Failover across nodes and racks Flexible Store all types of data in any format Apply schema on analysis and sharing of the data Economical Use commodity hardware Open source software guards against vendor lock-in Page 17

Hadoop is more? Page 18

Data Services for Full Data Lifecycle HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG HIVE HCATALOG WEBHDFS Distributed Storage & Processing Enterprise Readiness HBASE Provide data services to store, process & access data in many ways Unique Focus Areas: Apache HCatalog Metadata services for consistent table access to Hadoop data Apache Hive Explore & process Hadoop data via SQL & ODBC-compliant BI tools Apache HBase NoSQL database for Hadoop WebHDFS Access Hadoop files via scalable REST API Talend Open Studio for Big Data Graphical data integration tools Page 20

Metadata Service & Table-level Abstractions Apache HCatalog provides flexible metadata services across tools and external access Consistency of metadata and data models across tools (MapReduce, Pig, HBase and Hive) Accessibility: share data as tables in and out of HDFS Availability: enables flexible, thin-client access via REST API Raw Hadoop data Inconsistent, unknown Tool specific access HCatalog Table access Aligned metadata REST API Shared table and schema management opens the platform Page 21

Operational Services for Ease of Use OPERATIONAL SERVICES AMBARI OOZIE DATA SERVICES Store, Process and Access Data Include complete operational services for productive operations & management HADOOP CORE PLATFORM SERVICES Distributed Storage & Processing Enterprise Readiness Unique Focus Area: Apache Ambari: Provision, manage & monitor a cluster; complete REST APIs to integrate with existing operational tools; job & task visualizer to diagnose issues Page 22

New Ambari Features Job Diagnostics Visualize and troubleshoot Hadoop job execution and performance Cluster History View historical job execution & performance Instant Insight View health of Core Hadoop (HDFS, MapReduce) and related projects Cluster Navigation Quick link buttons jump into namenode web UI for a server Apache Ambari Dashboard REST interface provides external access to Ambari for existing tools. Facilitates integration with Microsoft System Center and Teradata Viewpoint Page 23

Deployable Across a Range of Options OPERATIONAL SERVICES Manage & Operate at Scale DATA SERVICES Store, Process and Access Data Hadoop allows you to deploy seamlessly across any deployment option HADOOP CORE PLATFORM SERVICES Distributed Storage & Processing Enterprise Readiness Linux & Windows Azure, Rackspace & other clouds Virtual platforms Big data appliances HORTONWORKS DATA PLATFORM (HDP) OS Cloud VM Appliance Page 24

Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE FLUME SQOOP DATA SERVICES PIG Store, HIVE Process and Access Data HCATALOG HBASE WEBHDFS Distributed MAP REDUCE Storage HDFS & Processing YARN (in 2.0) Hortonworks Data Platform (HDP) Enterprise Hadoop The ONLY 100% open source and complete distribution PLATFORM SERVICES Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) OS Cloud VM Appliance Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability Page 25

Where does it fit in the enterprise? Page 26

DATA SOURCES DATA SYSTEMS APPLICATIONS Existing Data Architecture Business Analytics Custom Applications Enterprise Applications DEV & DATA TOOLS BUILD & TEST OPERATIONAL TOOLS RDBMS EDW MPP TRADITIONAL REPOS MANAGE & MONITOR OLTP, POS SYSTEMS Traditional Sources (RDBMS, OLTP, OLAP) Page 27

DATA SOURCES DATA SYSTEMS APPLICATIONS Existing Data Architecture Business Analytics Custom Applications Enterprise Applications DEV & DATA TOOLS BUILD & TEST RDBMS EDW MPP TRADITIONAL REPOS? OPERATIONAL TOOLS MANAGE & MONITOR OLTP, POS SYSTEMS Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) Page 28

DATA SOURCES DATA SYSTEMS APPLICATIONS An Emerging Data Architecture Business Analytics Custom Applications Enterprise Applications DEV & DATA TOOLS BUILD & TEST OPERATIONAL TOOLS RDBMS EDW MPP TRADITIONAL REPOS MANAGE & MONITOR OLTP, POS SYSTEMS Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) MOBILE DATA Page 29

DATA SOURCES DATA SYSTEMS APPLICATIONS Interoperating With Your Tools Microsoft Applications DEV & DATA TOOLS OPERATIONAL TOOLS TRADITIONAL REPOS Viewpoint OLTP, POS SYSTEMS Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) MOBILE DATA Page 30

Data Refinement Page 31

DATA SOURCES DATA SYSTEMS APPLICATIONS Operational Data Refinery Refine Explore Enric h Business Analytics Custom Applications Enterprise Applications Collect data and apply a known algorithm to it in trusted operational process RDBMS EDW MPP TRADITIONAL REPOS Traditional Sources (RDBMS, OLTP, OLAP) 3 New Sources (web logs, email, sensor data, social media) 1 2 1 Capture Capture all data New and Traditional 2 3 Process Parse, cleanse, apply structure & transform Exchange Push to existing data warehouse for use with existing analytic tools Page 32

Challenges with a Traditional ETL Platform Data discarded due to cost and/or performance No visibility into transactional data Incapable/high complexity when dealing with loosely structured data -Doesn t scale linearly. -License Costs High -Lot of time spent understanding source and defining destination data structures -High latency between data generation and availability Page 33

HDP Based ETL Platform -Provides data for use with minimum delay and latency -Enables real time capture of source data -Store raw transactional data -Store 7+ years of data with no archiving -Data Lineage: Store intermediate stages of data -Becomes a powerful analytics platform -Support for any type of data: structured/ unstructured -Linearly scalable on commodity hardware -Massively parallel storage and compute -Data warehouse can focus less on storage & transformation and more on analytics Page 34

Key Capability in Hadoop: Late binding With traditional ETL, structure must be agreed upon far in advance and is difficult to change. MACHINE GENERATED Store Transformed Data WEB LOGS, CLICK STREAMS OLTP ETL Server Data Mart / EDW Client Apps With Hadoop, capture all data, structure data as business need evolve. MACHINE GENERATED HORTONWORKS DATA PLATFORM Dynamically Apply Transformations WEB LOGS, CLICK STREAMS OPERATIONAL SERVICES HADOOP CORE DATA SERVICES OLTP Hortonworks HDP Data Mart / EDW Client Apps Page 35

Data Exploration Page 36

DATA SOURCES DATA SYSTEMS APPLICATIONS Big Data Exploration & Visualization Refine Explore Enrich Business Analytics Custom Applications Enterprise Applications Collect data and perform iterative investigation for value 3 1 Capture Capture all data RDBMS EDW MPP TRADITIONAL REPOS 2 2 Process Parse, cleanse, apply structure & transform 1 3 Exchange Explore and visualize with analytics tools supporting Hadoop Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) Page 37

Visualization Tooling Robust visualization and business tooling Ensures scalability when working with large datasets Native Excel support Web browser support Mobile support Page 38

Data science is a natural next step after business intelligence Value Data Science Prediction Dashboards Reports Score-cards Discovery Affinity Analysis Outlier Detection Clustering Recommendation Regression Classification Refine Explore Enrich Business Intelligence: measure & count; simple analytics Data Science: discovery & prediction; complex analytics; data product Page 39

Application Enrichment Page 40

DATA SOURCES DATA SYSTEMS APPLICATIONS Application Enrichment Refine Explore Enrich Custom Applications Enterprise Applications Collect data, analyze and present salient results for online apps 3 1 Capture Capture all data RDBMS EDW MPP TRADITIONAL REPOS NOSQL 2 2 Process Parse, cleanse, apply structure & transform 1 3 Exchange Incorporate data directly into applications Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) Page 41

Key use-cases in Finance/Insurance Trading Analysis: How do I predict Trading trends based on market sentiment? How do I test Algorithms against years vs days fo data? Fraud detection: Detect illegal credit card activity and alert bank/consumer Detect illegal insurance claims Customer risk profiling: How likely is this customer to pay back his mortgage? How likely is this customer to get sick? Internal fraud detection (compliance): Is this employee accessing financial information they are not allowed to access? Page 42

HDP Reference Architecture Organize/ Create Metadata Publish Event Signal Data Transformation Transform & Aggregate Publish Exchange Extract & Load Explore Visualize Analyze Report Page 43

From Community to the Enterprise Page 44

What is a Hadoop Distribution A complimentary set of open source technologies that make up a complete data platform Templeton WebHDFS Sqoop Flume HCatalog Pig Hive HBase MapReduce HDFS Ambari Oozie ZooKeeper HA Tested and pre-packaged to ease installation and usage Collects the right versions of the components that all have different release cycles and ensures they work together Page 45

Apache Community Leadership Apache Pig Apache HBase Apache Hive Test & Patch Other Apache Projects Apache Hadoop Design & Develop Apache HCatalo g Release Apache Ambari We have noticed more activity over the last year from Hortonworks engineers on building out Apache Hadoop s more innovative features. These include YARN, Ambari and HCatalog.. - Jeff Kelly: Wikibon Apache Software Foundation Guiding Principles Release early & often Transparency, respect, meritocracy Key Roles held by Hortonworkers VP & PMC Members Arun Murthy (Hadoop), Daniel Dai (Pig), Mahadev Konar (Zookeeper) Release Managers Matt Foley (Hadoop 1.x), Arun Murthy (Hadoop 2.x), Ashutosh Chauhan (Hive), Daniel Dai (Pig), Alan Gates (HCatalog), Mahadev Konar (Ambari) Committers (We can all be Contributors) 54 across all Hadoop-related projects Page 46

Hortonworks Process for Enterprise Hadoop Upstream Community Projects Downstream Enterprise Product Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream Fixed Issues Integrate & Test Apache Pig Apache Hive Test & Patch Apache Hadoop Design & Develop Release Design & Develop Stable Project Releases Hortonworks Data Platform Package & Certify Apache HBase Other Apache Projects Apache HCatalo g Apache Ambari Distribute No Lock-in: Integrated, tested & certified distribution lowers risk by ensuring close alignment with Apache projects Page 47

Hadoop Evolution Starts at the Core Over 1800 Contributors Very fast moving and improving Driving next generation Hadoop YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery 420k+ lines authored since 2006 More than twice nearest contributor Deeply integrating w/ecosystem Enabling new deployment platforms (ex. Windows & Azure, Linux & VMware HA) Creating deeply engineered solutions (ex. Teradata big data appliance) All Apache, NO holdbacks 100% of code contributed to Apache Page 48

Where is Enterprise Apache Hadoop Going? Hive Interactive Query Platform Services Ambari Manage & Operate OPERATIONA L SERVICES DATA SERVICES HBase Online Data Replication, Mirroring, Snapshots, HADOOP CORE Data Services Secure Access PLATFORM SERVICES HORTONWORKS DATA PLATFORM (HDP) Data Movement In support of Refine, Explore, Enrich Operational Services Biz Continuity Manageability, Security, Page 49

Becoming Data Driven Page 50

Path to Becoming Big Data Driven Key Considerations for a Data Driven Business 1. Large web properties were born this way, you may have to adapt a strategy 2. Start with a project tied to a key objective or KPI Don t OVER engineer 3. Make sure your Big Data strategy fits your organization and grow it over time 4. Don t do big data just to do big data you can get lost in all that data Simply put, because of big data, managers can measure, and hence know, radically more about their businesses, and directly translate that knowledge into improved decision making & performance. - Erik Brynjolfsson and Andrew McAfee

Your Fastest On-ramp to Enterprise Hadoop! http://hortonworks.com/products/hortonworks-sandbox/ The Sandbox lets you experience Apache Hadoop from the convenience of your own laptop no data center, no cloud and no internet connection needed! The Hortonworks Sandbox is: A free download: http://hortonworks.com/products/hortonworks-sandbox/ A complete, self contained virtual machine with Apache Hadoop pre-configured A personal, portable and standalone Hadoop environment A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop Page 52

Thank You! Questions & Answers Page 53