Big Data Introduction

Similar documents
MapReduce with Apache Hadoop Analysing Big Data

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Constructing a Data Lake: Hadoop and Oracle Database United!

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Where is... How do I get to...

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

How To Handle Big Data With A Data Scientist

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Chase Wu New Jersey Ins0tute of Technology

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Large scale processing using Hadoop. Ján Vaňo

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

The Future of Data Management

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Google Bing Daytona Microsoft Research

Hadoop Architecture. Part 1

Open source Google-style large scale data analysis with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Chapter 7. Using Hadoop Cluster and MapReduce

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Big Data and Data Science: Behind the Buzz Words

Big Data With Hadoop

Microsoft SQL Server 2012 with Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

HadoopRDF : A Scalable RDF Data Analysis System

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem B Y R A H I M A.

Apache Hadoop: The Big Data Refinery

Big Data Too Big To Ignore

Big Data and Market Surveillance. April 28, 2014

How Cisco IT Built Big Data Platform to Transform Data Management

Data-Intensive Computing with Map-Reduce and Hadoop

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Big Data and Apache Hadoop Adoption:

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data on Microsoft Platform

Hadoop Job Oriented Training Agenda

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Comprehensive Analytics on the Hortonworks Data Platform

A very short Intro to Hadoop

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Big Data Analytics - Accelerated. stream-horizon.com

Big Data in a Relational World Presented by: Kerry Osborne JPMorgan Chase December, 2012

HDP Hadoop From concept to deployment.

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Apache Hadoop new way for the company to store and analyze big data

MapReduce Job Processing

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

BIG DATA TRENDS AND TECHNOLOGIES

Apache Hadoop. Alexandru Costan

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Connecting Hadoop with Oracle Database

Hadoop and Map-Reduce. Swati Gore

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

MapReduce. Tushar B. Kute,

Safe Harbor Statement

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

BIG DATA What it is and how to use?

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Hadoop & Spark Using Amazon EMR

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

How To Scale Out Of A Nosql Database

Architecting for the Internet of Things & Big Data

Qsoft Inc

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Internals of Hadoop Application Framework and Distributed File System

and NoSQL Data Governance for Regulated Industries Using Hadoop Justin Makeig, Director Product Management, MarkLogic October 2013

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Microsoft Big Data. Solution Brief

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

CSE-E5430 Scalable Cloud Computing Lecture 2

Application Development. A Paradigm Shift

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

The Future of Data Management with Hadoop and the Enterprise Data Hub

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Integrating SQL and Hadoop

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

HDP Enabling the Modern Data Architecture

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

White Paper: What You Need To Know About Hadoop

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Certified Big Data and Apache Hadoop Developer VS-1221

Transcription:

Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights

Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights

Map Reduce 3 Copyright 2012, Oracle and/or its affiliates. All rights

In Actuality 4 Copyright 2012, Oracle and/or its affiliates. All rights

What is Map Reduce [,,,,, ] [,,,,, ] 5 Copyright 2012, Oracle and/or its affiliates. All rights

Basics Of Hadoop Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Map Reduce JAR Data Node Data Node Data Node Data Node In Name Memory Node File 1 Piece 1 1 File 1 Piece 2 2 File 1 Piece 3 3 2 5 3 6 4 7 6 Copyright 2012, Oracle and/or its affiliates. All rights

Data Loading 7 Copyright 2012, Oracle and/or its affiliates. All rights

Programming Languages Normal Hadoop PIG DataFu HCatalog 8 Copyright 2012, Oracle and/or its affiliates. All rights

Management ZooKeeper Process Thread 1 Process Thread 2 9 Copyright 2012, Oracle and/or its affiliates. All rights

GUIs 10 Copyright 2012, Oracle and/or its affiliates. All rights

Similar to Oracle 11 Copyright 2012, Oracle and/or its affiliates. All rights

Big Data @ Oracle 12 Copyright 2012, Oracle and/or its affiliates. All rights

Oracle Big Data Solution Decide Oracle Real-Time Decisions Endeca Information Discovery Oracle BI Foundation Suite Oracle Event Processing Apache Flume Oracle GoldenGate Cloudera Hadoop Oracle NoSQL Database Oracle R Distribution Scalable, low-cost data Oracle storage Database and processing engine Oracle Big Data Connectors Oracle Data Integrator Oracle Advanced Analytics Scalable key-value store Oracle Spatial & Graph Statistical analysis framework Stream Acquire Organize Analyze 13 Copyright 2013, Oracle and/or its affiliates. All rights

Big Data Unstructured Data Massive detail data Big batch jobs Unifying data sources Store more raw detail data for less cost, while keeping aggregates in the DB Long running batch jobs can run in Hadoop to make the most of the DB Many data marts merged in Hadoop to provide unified views of data 14 Copyright 2013, Oracle and/or its affiliates. All rights

Big Data Hadoop 15 Copyright 2013, Oracle and/or its affiliates. All rights

Hadoop Can Be Confusing 16 Copyright 2013, Oracle and/or its affiliates. All rights

What is Hadoop? 17 Copyright 2012, Oracle and/or its affiliates. All rights

Hadoop The Apache Framework Hadoop for distributed software library processing is a framework that allows for the distributed processing of large data sets across clusters of computers using Large simple Data programming Sets models. Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver highavailability, Clusters the of library Computers itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster Simple of computers, Computing each Models of which may be prone to failures. Highly Available Service 18 Copyright 2013, Oracle and/or its affiliates. All rights

What to Pay Attention To Distributed Storage HDFS Parallel Processing Framework MapReduce Higher-Level Languages Hive Pig Etc. 19 Copyright 2012, Oracle and/or its affiliates. All rights

HDFS The Distributed Filesystem What is it? The petabyte-scale distributed file system at the core of Hadoop. Benefits Limitations Linearly-scalable on commodity hardware An order of magnitude cheaper per TB Designed around schema-on-read Low security Write-once, read-many model 20 Copyright 2012, Oracle and/or its affiliates. All rights

Interacting with HDFS NameNodes and DataNodes NameNodes contain edits and organization DataNodes store data Command-line access resembles UNIX filesystems ls (list) cat, tail (concatenate or tail file) cp, mv (copy or move within HDFS) get, put (copy between local file system and HDFS) 21 Copyright 2012, Oracle and/or its affiliates. All rights

HDFS Mechanics Suppose we have a large file And a set of DataNodes DataNode DataNode DataNode DataNode DataNode DataNode 22 Copyright 2012, Oracle and/or its affiliates. All rights

HDFS Mechanics The file will be broken up into blocks Blocks are stored in multiple locations Allows for parallelism and fault-tolerance Nodes operate on their local data DataNode DataNode DataNode DataNode DataNode DataNode 23 Copyright 2012, Oracle and/or its affiliates. All rights

MapReduce The Parallel Processing Framework What is it? The parallel processing framework that dominates the Big Data landscape. Benefits Limitations Provides data-local computation Fault-tolerant Scales just like HDFS You are the optimizer Quasi-functional model is counterintuitive Batch-oriented 24 Copyright 2012, Oracle and/or its affiliates. All rights

MapReduce Mechanics Suppose 3 face cards are removed. How do we find which suits are short using MapReduce? 25 Copyright 2012, Oracle and/or its affiliates. All rights

MapReduce Mechanics Map Phase: Each TaskTracker has some data local to it. Map tasks operate on this local data. If face_card: emit(suit, card) TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode 26 Copyright 2012, Oracle and/or its affiliates. All rights

MapReduce Mechanics Shuffle/Sort: Intermediate data is shuffled and sorted for delivery to the reduce tasks Sort To Reducers 27 Copyright 2012, Oracle and/or its affiliates. All rights

MapReduce Mechanics Reduce Phase: Reducers operate on local data to produce final result Emit:key, count(key) TaskTracker TaskTracker TaskTracker TaskTracker Spades: 3 Hearts: 2 Diamonds: 2 Clubs: 2 28 Copyright 2012, Oracle and/or its affiliates. All rights

Hive A move toward declarative language What is it? A SQL-like language for Hadoop. Benefits Limitations Abstracts MapReduce code Schema-on-read via InputFormat and SerDe Provides and preserves metatdata Not ideal for ad hoc work (slow) Subset of SQL-92 Immature optimizer 29 Copyright 2012, Oracle and/or its affiliates. All rights

Storing a Clickstream Storing large amounts of clickstream data is a common use for HDFS Individual clicks aren t valuable by them selves We d like to write queries over all clicks 30 Copyright 2012, Oracle and/or its affiliates. All rights

Defining Tables Over HDFS Hive allows us to define tables over HDFS directories The syntax is simple SQL SerDes allow Hive to deserialize data 31 Copyright 2012, Oracle and/or its affiliates. All rights

How Does It Work Anatomy of a Hive Query How does Hive execute this query? SELECT suit, COUNT(*) FROM cards WHERE face_value > 10 GROUP BY suit; 32 Copyright 2012, Oracle and/or its affiliates. All rights

Shuffle Anatomy of a Hive Query SELECT suit, COUNT(*) FROM cards WHERE face_value > 10 GROUP BY suit; 1. Hive optimizer builds a MapReduce Job 2. Projections and predicates become Map code 3. Aggregations become Reduce code 4. Job is submitted to MapReduce JobTracker Map task If face_card: emit(suit, card) Reduce task emit(suit, count(suit)) 33 Copyright 2012, Oracle and/or its affiliates. All rights

Using Hadoop To Optimize IT 34 Copyright 2012, Oracle and/or its affiliates. All rights

Big Data and Optimized Operations Big Data can handle a lot of heavy lifting It s a complement to the database Big Data allows access to more detail data for less We can use Big Data to make the database do more 35 Copyright 2012, Oracle and/or its affiliates. All rights

Optimizing ETL, Saving SLAs Big Data Problem Long-running batch transformation Base Table Load to Oracle Mission Critical Reporting Ad Hoc Analysis Copy/Move Base Table to Hadoop Long-running batch transformation 36 Copyright 2013, Oracle and/or its affiliates. All rights

Store More Details For Less Big Data Problem Reporting Table Base Table External Table or Aggregate on Hadoop Aggregation 37 Copyright 2013, Oracle and/or its affiliates. All rights

Using Hadoop To Build New Datasets 38 Copyright 2013, Oracle and/or its affiliates. All rights

What Does a Big Data World Look Like? Truck / Motor Manufacturer Collections Internal sensors Miles Per Gallon, Driving techniques Location information Uses Better tailored servicing plans Better targeted marketing Offer better finance deals or related options More data for R&D Sell on to partners 39 Copyright 2013, Oracle and/or its affiliates. All rights

Big Data and Analytics Big Data does not make analytics easier There is no magic bullet Some things work better in a database Big Data allows the collection of new datasets Big Data allows modeling on a more granular level 40 Copyright 2013, Oracle and/or its affiliates. All rights

No Magic Bullets Food monitoring by RFID tags Fridge monitors food usage and sell-by dates Monitor the complete car Better targeted marketing There is a gap between The available dataset The value proposition Big Data helps bridge the gap 41 Copyright 2013, Oracle and/or its affiliates. All rights

Some Things Work Better in RDBMS Clustering on massive data Fine-grained classification Dataset construction Deploying models on many subgroups Time Series Analysis Spatial Analysis Linear and Nonlinear Modeling Interaction with SAS and R 42 Copyright 2013, Oracle and/or its affiliates. All rights

Collecting New Datasets The Complete Car Big Data Problem Minute-byminute MPH GPS Readings On-board Vehicle Diagnostics Trip (Location and Speed) Vehicle Usage Report How does the customer drive? Where does the customer drive? How do we maximize their value? 43 Copyright 2013, Oracle and/or its affiliates. All rights

More Granular Modeling Testing Trip Dynamics Analyst Big Data Problem New Model for Maintenance Alerts Test and Summarize On All Engine Readings Aggregated Test Results 44 Copyright 2013, Oracle and/or its affiliates. All rights

Fitting Fat Tails Modeling outlying customers Analyst Big Data Problem Significant value may exist in the tails Parallelized Locallyweighted Linear Regression Model for all data 45 Copyright 2013, Oracle and/or its affiliates. All rights

Q&A 46 Copyright 2012, Oracle and/or its affiliates. All rights

47 Copyright 2012, Oracle and/or its affiliates. All rights

48 Copyright 2012, Oracle and/or its affiliates. All rights