Big Data Introduction

Basics Of Hadoop Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Map Reduce JAR Data Node Data Node Data Node Data Node In Name Memory Node File 1 Piece 1 1 File 1 Piece 2 2 File 1 Piece 3 3 2 5 3 6 4 7 6 Copyright 2012, Oracle and/or its affiliates. All rights

Oracle Big Data Solution Decide Oracle Real-Time Decisions Endeca Information Discovery Oracle BI Foundation Suite Oracle Event Processing Apache Flume Oracle GoldenGate Cloudera Hadoop Oracle NoSQL Database Oracle R Distribution Scalable, low-cost data Oracle storage Database and processing engine Oracle Big Data Connectors Oracle Data Integrator Oracle Advanced Analytics Scalable key-value store Oracle Spatial & Graph Statistical analysis framework Stream Acquire Organize Analyze 13 Copyright 2013, Oracle and/or its affiliates. All rights

Big Data Unstructured Data Massive detail data Big batch jobs Unifying data sources Store more raw detail data for less cost, while keeping aggregates in the DB Long running batch jobs can run in Hadoop to make the most of the DB Many data marts merged in Hadoop to provide unified views of data 14 Copyright 2013, Oracle and/or its affiliates. All rights

Hadoop The Apache Framework Hadoop for distributed software library processing is a framework that allows for the distributed processing of large data sets across clusters of computers using Large simple Data programming Sets models. Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver highavailability, Clusters the of library Computers itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster Simple of computers, Computing each Models of which may be prone to failures. Highly Available Service 18 Copyright 2013, Oracle and/or its affiliates. All rights

HDFS The Distributed Filesystem What is it? The petabyte-scale distributed file system at the core of Hadoop. Benefits Limitations Linearly-scalable on commodity hardware An order of magnitude cheaper per TB Designed around schema-on-read Low security Write-once, read-many model 20 Copyright 2012, Oracle and/or its affiliates. All rights

Interacting with HDFS NameNodes and DataNodes NameNodes contain edits and organization DataNodes store data Command-line access resembles UNIX filesystems ls (list) cat, tail (concatenate or tail file) cp, mv (copy or move within HDFS) get, put (copy between local file system and HDFS) 21 Copyright 2012, Oracle and/or its affiliates. All rights

HDFS Mechanics The file will be broken up into blocks Blocks are stored in multiple locations Allows for parallelism and fault-tolerance Nodes operate on their local data DataNode DataNode DataNode DataNode DataNode DataNode 23 Copyright 2012, Oracle and/or its affiliates. All rights

MapReduce The Parallel Processing Framework What is it? The parallel processing framework that dominates the Big Data landscape. Benefits Limitations Provides data-local computation Fault-tolerant Scales just like HDFS You are the optimizer Quasi-functional model is counterintuitive Batch-oriented 24 Copyright 2012, Oracle and/or its affiliates. All rights

MapReduce Mechanics Map Phase: Each TaskTracker has some data local to it. Map tasks operate on this local data. If face_card: emit(suit, card) TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode 26 Copyright 2012, Oracle and/or its affiliates. All rights

MapReduce Mechanics Reduce Phase: Reducers operate on local data to produce final result Emit:key, count(key) TaskTracker TaskTracker TaskTracker TaskTracker Spades: 3 Hearts: 2 Diamonds: 2 Clubs: 2 28 Copyright 2012, Oracle and/or its affiliates. All rights

Hive A move toward declarative language What is it? A SQL-like language for Hadoop. Benefits Limitations Abstracts MapReduce code Schema-on-read via InputFormat and SerDe Provides and preserves metatdata Not ideal for ad hoc work (slow) Subset of SQL-92 Immature optimizer 29 Copyright 2012, Oracle and/or its affiliates. All rights

Storing a Clickstream Storing large amounts of clickstream data is a common use for HDFS Individual clicks aren t valuable by them selves We d like to write queries over all clicks 30 Copyright 2012, Oracle and/or its affiliates. All rights

Defining Tables Over HDFS Hive allows us to define tables over HDFS directories The syntax is simple SQL SerDes allow Hive to deserialize data 31 Copyright 2012, Oracle and/or its affiliates. All rights

How Does It Work Anatomy of a Hive Query How does Hive execute this query? SELECT suit, COUNT(*) FROM cards WHERE face_value > 10 GROUP BY suit; 32 Copyright 2012, Oracle and/or its affiliates. All rights

Shuffle Anatomy of a Hive Query SELECT suit, COUNT(*) FROM cards WHERE face_value > 10 GROUP BY suit; 1. Hive optimizer builds a MapReduce Job 2. Projections and predicates become Map code 3. Aggregations become Reduce code 4. Job is submitted to MapReduce JobTracker Map task If face_card: emit(suit, card) Reduce task emit(suit, count(suit)) 33 Copyright 2012, Oracle and/or its affiliates. All rights

Big Data and Optimized Operations Big Data can handle a lot of heavy lifting It s a complement to the database Big Data allows access to more detail data for less We can use Big Data to make the database do more 35 Copyright 2012, Oracle and/or its affiliates. All rights

Optimizing ETL, Saving SLAs Big Data Problem Long-running batch transformation Base Table Load to Oracle Mission Critical Reporting Ad Hoc Analysis Copy/Move Base Table to Hadoop Long-running batch transformation 36 Copyright 2013, Oracle and/or its affiliates. All rights

What Does a Big Data World Look Like? Truck / Motor Manufacturer Collections Internal sensors Miles Per Gallon, Driving techniques Location information Uses Better tailored servicing plans Better targeted marketing Offer better finance deals or related options More data for R&D Sell on to partners 39 Copyright 2013, Oracle and/or its affiliates. All rights

Big Data and Analytics Big Data does not make analytics easier There is no magic bullet Some things work better in a database Big Data allows the collection of new datasets Big Data allows modeling on a more granular level 40 Copyright 2013, Oracle and/or its affiliates. All rights

No Magic Bullets Food monitoring by RFID tags Fridge monitors food usage and sell-by dates Monitor the complete car Better targeted marketing There is a gap between The available dataset The value proposition Big Data helps bridge the gap 41 Copyright 2013, Oracle and/or its affiliates. All rights

Some Things Work Better in RDBMS Clustering on massive data Fine-grained classification Dataset construction Deploying models on many subgroups Time Series Analysis Spatial Analysis Linear and Nonlinear Modeling Interaction with SAS and R 42 Copyright 2013, Oracle and/or its affiliates. All rights

Collecting New Datasets The Complete Car Big Data Problem Minute-byminute MPH GPS Readings On-board Vehicle Diagnostics Trip (Location and Speed) Vehicle Usage Report How does the customer drive? Where does the customer drive? How do we maximize their value? 43 Copyright 2013, Oracle and/or its affiliates. All rights

More Granular Modeling Testing Trip Dynamics Analyst Big Data Problem New Model for Maintenance Alerts Test and Summarize On All Engine Readings Aggregated Test Results 44 Copyright 2013, Oracle and/or its affiliates. All rights

Fitting Fat Tails Modeling outlying customers Analyst Big Data Problem Significant value may exist in the tails Parallelized Locallyweighted Linear Regression Model for all data 45 Copyright 2013, Oracle and/or its affiliates. All rights