Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights
Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights
Map Reduce 3 Copyright 2012, Oracle and/or its affiliates. All rights
In Actuality 4 Copyright 2012, Oracle and/or its affiliates. All rights
What is Map Reduce [,,,,, ] [,,,,, ] 5 Copyright 2012, Oracle and/or its affiliates. All rights
Basics Of Hadoop Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Map Reduce JAR Data Node Data Node Data Node Data Node In Name Memory Node File 1 Piece 1 1 File 1 Piece 2 2 File 1 Piece 3 3 2 5 3 6 4 7 6 Copyright 2012, Oracle and/or its affiliates. All rights
Data Loading 7 Copyright 2012, Oracle and/or its affiliates. All rights
Programming Languages Normal Hadoop PIG DataFu HCatalog 8 Copyright 2012, Oracle and/or its affiliates. All rights
Management ZooKeeper Process Thread 1 Process Thread 2 9 Copyright 2012, Oracle and/or its affiliates. All rights
GUIs 10 Copyright 2012, Oracle and/or its affiliates. All rights
Similar to Oracle 11 Copyright 2012, Oracle and/or its affiliates. All rights
Big Data @ Oracle 12 Copyright 2012, Oracle and/or its affiliates. All rights
Oracle Big Data Solution Decide Oracle Real-Time Decisions Endeca Information Discovery Oracle BI Foundation Suite Oracle Event Processing Apache Flume Oracle GoldenGate Cloudera Hadoop Oracle NoSQL Database Oracle R Distribution Scalable, low-cost data Oracle storage Database and processing engine Oracle Big Data Connectors Oracle Data Integrator Oracle Advanced Analytics Scalable key-value store Oracle Spatial & Graph Statistical analysis framework Stream Acquire Organize Analyze 13 Copyright 2013, Oracle and/or its affiliates. All rights
Big Data Unstructured Data Massive detail data Big batch jobs Unifying data sources Store more raw detail data for less cost, while keeping aggregates in the DB Long running batch jobs can run in Hadoop to make the most of the DB Many data marts merged in Hadoop to provide unified views of data 14 Copyright 2013, Oracle and/or its affiliates. All rights
Big Data Hadoop 15 Copyright 2013, Oracle and/or its affiliates. All rights
Hadoop Can Be Confusing 16 Copyright 2013, Oracle and/or its affiliates. All rights
What is Hadoop? 17 Copyright 2012, Oracle and/or its affiliates. All rights
Hadoop The Apache Framework Hadoop for distributed software library processing is a framework that allows for the distributed processing of large data sets across clusters of computers using Large simple Data programming Sets models. Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver highavailability, Clusters the of library Computers itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster Simple of computers, Computing each Models of which may be prone to failures. Highly Available Service 18 Copyright 2013, Oracle and/or its affiliates. All rights
What to Pay Attention To Distributed Storage HDFS Parallel Processing Framework MapReduce Higher-Level Languages Hive Pig Etc. 19 Copyright 2012, Oracle and/or its affiliates. All rights
HDFS The Distributed Filesystem What is it? The petabyte-scale distributed file system at the core of Hadoop. Benefits Limitations Linearly-scalable on commodity hardware An order of magnitude cheaper per TB Designed around schema-on-read Low security Write-once, read-many model 20 Copyright 2012, Oracle and/or its affiliates. All rights
Interacting with HDFS NameNodes and DataNodes NameNodes contain edits and organization DataNodes store data Command-line access resembles UNIX filesystems ls (list) cat, tail (concatenate or tail file) cp, mv (copy or move within HDFS) get, put (copy between local file system and HDFS) 21 Copyright 2012, Oracle and/or its affiliates. All rights
HDFS Mechanics Suppose we have a large file And a set of DataNodes DataNode DataNode DataNode DataNode DataNode DataNode 22 Copyright 2012, Oracle and/or its affiliates. All rights
HDFS Mechanics The file will be broken up into blocks Blocks are stored in multiple locations Allows for parallelism and fault-tolerance Nodes operate on their local data DataNode DataNode DataNode DataNode DataNode DataNode 23 Copyright 2012, Oracle and/or its affiliates. All rights
MapReduce The Parallel Processing Framework What is it? The parallel processing framework that dominates the Big Data landscape. Benefits Limitations Provides data-local computation Fault-tolerant Scales just like HDFS You are the optimizer Quasi-functional model is counterintuitive Batch-oriented 24 Copyright 2012, Oracle and/or its affiliates. All rights
MapReduce Mechanics Suppose 3 face cards are removed. How do we find which suits are short using MapReduce? 25 Copyright 2012, Oracle and/or its affiliates. All rights
MapReduce Mechanics Map Phase: Each TaskTracker has some data local to it. Map tasks operate on this local data. If face_card: emit(suit, card) TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode 26 Copyright 2012, Oracle and/or its affiliates. All rights
MapReduce Mechanics Shuffle/Sort: Intermediate data is shuffled and sorted for delivery to the reduce tasks Sort To Reducers 27 Copyright 2012, Oracle and/or its affiliates. All rights
MapReduce Mechanics Reduce Phase: Reducers operate on local data to produce final result Emit:key, count(key) TaskTracker TaskTracker TaskTracker TaskTracker Spades: 3 Hearts: 2 Diamonds: 2 Clubs: 2 28 Copyright 2012, Oracle and/or its affiliates. All rights
Hive A move toward declarative language What is it? A SQL-like language for Hadoop. Benefits Limitations Abstracts MapReduce code Schema-on-read via InputFormat and SerDe Provides and preserves metatdata Not ideal for ad hoc work (slow) Subset of SQL-92 Immature optimizer 29 Copyright 2012, Oracle and/or its affiliates. All rights
Storing a Clickstream Storing large amounts of clickstream data is a common use for HDFS Individual clicks aren t valuable by them selves We d like to write queries over all clicks 30 Copyright 2012, Oracle and/or its affiliates. All rights
Defining Tables Over HDFS Hive allows us to define tables over HDFS directories The syntax is simple SQL SerDes allow Hive to deserialize data 31 Copyright 2012, Oracle and/or its affiliates. All rights
How Does It Work Anatomy of a Hive Query How does Hive execute this query? SELECT suit, COUNT(*) FROM cards WHERE face_value > 10 GROUP BY suit; 32 Copyright 2012, Oracle and/or its affiliates. All rights
Shuffle Anatomy of a Hive Query SELECT suit, COUNT(*) FROM cards WHERE face_value > 10 GROUP BY suit; 1. Hive optimizer builds a MapReduce Job 2. Projections and predicates become Map code 3. Aggregations become Reduce code 4. Job is submitted to MapReduce JobTracker Map task If face_card: emit(suit, card) Reduce task emit(suit, count(suit)) 33 Copyright 2012, Oracle and/or its affiliates. All rights
Using Hadoop To Optimize IT 34 Copyright 2012, Oracle and/or its affiliates. All rights
Big Data and Optimized Operations Big Data can handle a lot of heavy lifting It s a complement to the database Big Data allows access to more detail data for less We can use Big Data to make the database do more 35 Copyright 2012, Oracle and/or its affiliates. All rights
Optimizing ETL, Saving SLAs Big Data Problem Long-running batch transformation Base Table Load to Oracle Mission Critical Reporting Ad Hoc Analysis Copy/Move Base Table to Hadoop Long-running batch transformation 36 Copyright 2013, Oracle and/or its affiliates. All rights
Store More Details For Less Big Data Problem Reporting Table Base Table External Table or Aggregate on Hadoop Aggregation 37 Copyright 2013, Oracle and/or its affiliates. All rights
Using Hadoop To Build New Datasets 38 Copyright 2013, Oracle and/or its affiliates. All rights
What Does a Big Data World Look Like? Truck / Motor Manufacturer Collections Internal sensors Miles Per Gallon, Driving techniques Location information Uses Better tailored servicing plans Better targeted marketing Offer better finance deals or related options More data for R&D Sell on to partners 39 Copyright 2013, Oracle and/or its affiliates. All rights
Big Data and Analytics Big Data does not make analytics easier There is no magic bullet Some things work better in a database Big Data allows the collection of new datasets Big Data allows modeling on a more granular level 40 Copyright 2013, Oracle and/or its affiliates. All rights
No Magic Bullets Food monitoring by RFID tags Fridge monitors food usage and sell-by dates Monitor the complete car Better targeted marketing There is a gap between The available dataset The value proposition Big Data helps bridge the gap 41 Copyright 2013, Oracle and/or its affiliates. All rights
Some Things Work Better in RDBMS Clustering on massive data Fine-grained classification Dataset construction Deploying models on many subgroups Time Series Analysis Spatial Analysis Linear and Nonlinear Modeling Interaction with SAS and R 42 Copyright 2013, Oracle and/or its affiliates. All rights
Collecting New Datasets The Complete Car Big Data Problem Minute-byminute MPH GPS Readings On-board Vehicle Diagnostics Trip (Location and Speed) Vehicle Usage Report How does the customer drive? Where does the customer drive? How do we maximize their value? 43 Copyright 2013, Oracle and/or its affiliates. All rights
More Granular Modeling Testing Trip Dynamics Analyst Big Data Problem New Model for Maintenance Alerts Test and Summarize On All Engine Readings Aggregated Test Results 44 Copyright 2013, Oracle and/or its affiliates. All rights
Fitting Fat Tails Modeling outlying customers Analyst Big Data Problem Significant value may exist in the tails Parallelized Locallyweighted Linear Regression Model for all data 45 Copyright 2013, Oracle and/or its affiliates. All rights
Q&A 46 Copyright 2012, Oracle and/or its affiliates. All rights
47 Copyright 2012, Oracle and/or its affiliates. All rights
48 Copyright 2012, Oracle and/or its affiliates. All rights