<Insert Picture Here> Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle.
Agenda Introduction to Oracle Big Data Phases of processing Oracle Big Data Overview of the Oracle Big Data Connectors Methods of connecting Hadoop to an Oracle Database Summary
What Is Big Data? Big data is defined as voluminous unstructured data from many different sources. Social networks Banking and financial services E-commerce services Medical records Weblogs Web-centric services Internet search indexes Scientific searches Document searches
Big Data: Big Goals Discover Intelligent data Understand ecommerce behavior Derive sentiment Support interactions
Why Oracle For Big Data?
Oracle Big Data: Four Phased Solution 1 2 3 4 Acquire Organize Analyze Decide
Oracle s Big Data Solution Decide Oracle Real-Time Decisions Endeca Information Discovery Oracle BI Foundation Suite Oracle Event Processing Apache Flume Oracle GoldenGate Cloudera Hadoop Oracle NoSQL Database Oracle R Distribution Oracle Big Data Connectors Oracle Data Integrator Oracle Database Oracle Advanced Analytics Oracle Spatial & Graph Stream Acquire Organize Analyze
Apache Hadoop Apache Hadoop is a framework for executing applications on large clusters built of commodity hardware. It operates by batch processing. Hadoop consists of two components: Hadoop Distributed File System MapReduce
Oracle Big Data Connectors Facilitate data access to data stored in a Hadoop cluster. Can be licensed for use on either Oracle Big Data Appliance or a Hadoop cluster running on commodity hardware. Cloudera Hadoop Oracle NoSQL Database Oracle R Distribution Oracle Big Data Connectors Oracle Data Integrator Oracle Database Oracle Advanced Analytics Oracle Spatial & Graph Acquire Organize Analyze
Usage Scenarios Bulk loading of large volumes of data Example: Historical data; daily uploads of data gathered during the day Loading at regular frequency Example: 24/7 monitoring of log feeds Loading at irregular frequency Example: Monitoring of sensor feeds Accessing data files in place on HDFS
Installation Details You can download connectors from either of the following locations: Oracle Technology Network http://www.oracle.com/technetwork/bdc/big-dataconnectors/downloads/index.html Oracle Software Delivery Cloud https://edelivery.oracle.com/
Oracle Big Data Connectors Licensed together Oracle Loader for Hadoop Oracle SQL Connector for HDFS Oracle Data Integrator Application Adapter for Hadoop Oracle R Connector for Hadoop Oracle Xquery for Hadoop
Oracle Loader for Hadoop Efficient and high-performance loader for fast movement of data from any Hadoop cluster into a table in Oracle Database Allows you to use Hadoop MapReduce processing Partitions the data and transforms it into an Oracleready format Can be operated in two modes Online mode Offline mode
Data Samples JSON files Sensor data Machine logs {"custid":1046915,"movieid":null,"genreid":null,"time":"2012-07-01:00:33:18","recommended":null,"activity":9} {"custid":1144051,"movieid":768,"genreid":9,"time":"2012-07-01:00:33:39","recommended":"n","activity":6} {"custid":1264225,"movieid":null,"genreid":null,"time":"2012-07-01:00:34:01","recommended":null,"activity":8} {"custid":1085645,"movieid":null,"genreid":null,"time":"2012-07-01:00:34:18","recommended":null,"activity":8} {"custid":1098368,"movieid":null,"genreid":null,"time":"2012-07-01:00:34:28","recommended":null,"activity":8} {"custid":1363545,"movieid":27205,"genreid":9,"time":"2012-07-01:00:35:09","recommended":"y","activity":11,"price":3.99} {"custid":1156900,"movieid":20352,"genreid":14,"time":"2012-07-01:00:35:12","recommended":"n","activity":7} {"custid":1336404,"movieid":null,"genreid":null,"time":"2012-07-01:00:35:27","recommended":null,"activity":9} {"custid":1022288,"movieid":null,"genreid":null,"time":"2012-07-01:00:35:38","recommended":null,"activity":8} {"custid":1129727,"movieid":1105903,"genreid":11,"time":"2012-07-01:00:36:08","recommended":"n","activity":1,"rating":3} {"custid":1305981,"movieid":null,"genreid":null,"time":"2012-07-01:00:36:27","recommended":null,"activity":8} Apache Weblogs Twitter feeds
Online and Offline Modes Partition, sort, and convert into Oracle data types on Hadoop Connect to the database from reducer nodes, load into database partitions in ORACLE LOADER FOR HADOOP parallel MAP MAP MAP SHUFFLE /SORT REDUCE REDUCE MAP REDUCE MAP MAP SHUFFLE /SORT REDUCE REDUCE
Choosing Output Mode in OLH OUTPUT MODE Online load with JDBC Online load with Direct Path Offline load with Data Pump files USE CASE CHARACTERISTICS Simplest, can load into for nonpartitioned tables Fast online load for partitioned tables Fastest load method for external tables
OLH: Advantages OLH offloads database server processing to Hadoop by: Converting the input data to final database format Computing the table partition for a row Sorting rows by primary key within a table partition Generating binary Data Pump files Balancing partition groups across reducers
Oracle SQL Connector for HDFS Oracle SQL Connector for HDFS (OSCH) is a connector that facilitates read access from HDFS to Oracle Database using external tables. It uses the ORACLE_LOADER access driver. It enables you to: Access big data without loading the data Access the data stored in HDFS files Access CSV (comma-separated value) files and Data Pump files generated by Oracle Loader for Hadoop Load data extracted and transformed by Oracle Data Integrator
Oracle SQL Connector for HDFS HDFS Oracle Database Query SQL External Table Direct access from Oracle Database SQL access to HDFS External table view Data query or import OSCH HDFS Client Infini Band
OSCH: Three Simple Steps 1.Create an external table. 2.Run the OSCH utility to publish HDFS content to the external table. 3.Access and load into the database using SQL. >hadoop jar \ $ODCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.hdfs.extab.externaltable\ -conf MyConf.xml \ -publish
OSCH: Features Access and analyze data in place on HDFS via external tables. Query and join data on HDFS with database-resident data. Load into the database using SQL (if required). Automatic load balancing to maximize performance. DML operations and indexes cannot be created on external tables. Data files can be text files or Oracle Data Pump files. Parallelism is controlled by the external table definition. Data files are grouped to distribute load evenly across PQ slaves.
Choosing Connectors Oracle Loader for Hadoop Load into Oracle Database. Input data can be delimited text, data from Hive tables (Oraclesupported Input Formats), or any other format, for example, binary format (by creating your own input format). Oracle SQL Connector for HDFS Access directly from Hive, HDFS, or load into Oracle Database. Input data can be delimited text or Oracle Data Pump files only.
ODI Application Adapter for Hadoop Is a Big Data Connector that allows data integration developers to easily integrate and transform data within Hadoop using Oracle Data Integrator Has preconfigured ODI knowledge modules Oracle Data Integrator Oracle Loader for Hadoop
Oracle R Connector for Hadoop Oracle R Connector for Hadoop (ORCH) is an R package that provides an interface between the local R environment, Oracle Database, and Cloudera CDH. Using simple R functions, you can: Sample data in HDFS Copy data between Oracle Database and HDFS Schedule R programs to execute as MapReduce jobs Return the results to Oracle Database or to your laptop Cloudera CDH Oracle R Connector for Hadoop (ORCH) Oracle R Enterprise (ORE)
Word Count: Example Without ORCH import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapred.mapreducebase; import org.apache.hadoop.mapred.mapper; import org.apache.hadoop.mapred.outputcollector; import org.apache.hadoop.mapred.reporter; public class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String s = value.tostring(); for (String word : s.split("\\w+")) { if (word.length() > 0) { output.collect(new Text(word), new IntWritable(1)); } } Mapper } } import java.io.ioexception; import java.util.iterator; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapred.outputcollector; import org.apache.hadoop.mapred.mapreducebase; import org.apache.hadoop.mapred.reducer; import org.apache.hadoop.mapred.reporter; public class SumReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int wordcount = 0; while (values.hasnext()) { IntWritable value = values.next(); wordcount += value.get(); } Reducer output.collect(key, new IntWritable(wordCount)); } }
Word Count: Example with ORCH input <- hdfs.put(corpus) wordcount <- function (input, output = NULL, pattern = " ") { } res <- hadoop.exec(dfs.id = input, mapper = function(k,v) { lapply( strsplit(x = v, split = pattern)[[1]], function(w) orch.keyval(w,1)[[1]]) }, reducer = function(k,vv) { orch.keyval(k, sum(unlist(vv))) }, config = new("mapred.config", job.name = "wordcount", map.output = data.frame(key=0, val=''), reduce.output = data.frame(key='', val=0) ) ) res
NEW: Oracle XQuery for Hadoop OXH is a transformation engine for Big Data XQuery language executed on the Map/Reduce framework XQuery Map/Reduce Execution Plan Map/Reduce Worker Nodes Oracle Big Data Connectors for $ln in text:collection() let $f := tokenize($ln) where $f[1] = 'x' return text:put($f[2]) M/R M/R M/R M/R Oracle Data Integrator OXH Engine HDFS Oracle Loader for Hadoop Oracle Database Acquire Organize Analyze
Performance 15 TB / hour 25 times faster than third party products Reduced database CPU usage in comparison
Connectors: Engineered to Leverage All Data ODI Application Adapter for Hadoop Oracle Loader for Hadoop Oracle Direct Connector for HDFS Oracle R Connector for Hadoop Oracle Xquery for Hadoop
Summary Oracle Big Data Connectors are products for high speed loading from Hadoop to Oracle Database Cover a range of use cases Several input sources Flexible, easy-to-use, developed and supported by Oracle The fastest load option loads at 15 TB/hour
Further References http://www.oracle.com/us/products/database/big-dataappliance/overview/index.html http://www.oracle.com/us/products/database/exadata/overview/inde x.html http://www.oracle.com/technetwork/bdc/big-dataconnectors/downloads/index.html https://blogs.oracle.com/bigdataconnectors/
Questions Mail your queries to: sharon.stephen@oracle.com