Integrating Apache Spark with Oracle NoSQL Database

Similar documents
An Oracle White Paper July Introducing the Oracle Home User in Oracle Database 12c for Microsoft Windows

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

G Cloud 7 Pricing Document

An Oracle White Paper October BI Publisher 11g Scheduling & Apache ActiveMQ as JMS Provider

October Oracle Application Express Statement of Direction

An Oracle White Paper February Oracle Data Integrator 12c Architecture Overview

An Oracle White Paper September Oracle Database and the Oracle Database Cloud

An Oracle White Paper November Oracle Business Intelligence Standard Edition One 11g

An Oracle White Paper June Creating an Oracle BI Presentation Layer from Imported Oracle OLAP Cubes

An Oracle White Paper January Using Oracle's StorageTek Search Accelerator

An Oracle White Paper February, Oracle Database In-Memory Advisor Best Practices

G Cloud 7 Pricing Document

An Oracle White Paper March Oracle s Single Server Solution for VDI

An Oracle White Paper October Oracle Data Integrator 12c New Features Overview

An Oracle White Paper May Oracle Database Cloud Service

Migrating Non-Oracle Databases and their Applications to Oracle Database 12c O R A C L E W H I T E P A P E R D E C E M B E R

Contract Lifecycle Management for Public Sector A Procure to Pay Management System

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Siebel CRM Reports. Easy to develop and deploy. Administration

Running Oracle s PeopleSoft Human Capital Management on Oracle SuperCluster T5-8 O R A C L E W H I T E P A P E R L A S T U P D A T E D J U N E

Oracle Financial Management Analytics

Primavera Unifier Integration Overview: A Web Services Integration Approach O R A C L E W H I T E P A P E R F E B R U A R Y

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Oracle Internet of Things Cloud Service

Oracle Big Data Discovery The Visual Face of Hadoop

Performance with the Oracle Database Cloud

Oracle Big Data Management System

PeopleSoft Compensation

SIX QUESTIONS TO ASK ANY VENDOR BEFORE SIGNING A SaaS E-COMMERCE CONTRACT

Oracle Database Backup Service. Secure Backup in the Oracle Cloud

Driving the Business Forward with Human Capital Management. Five key points to consider before you invest

March Oracle Business Intelligence Discoverer Statement of Direction

An Oracle White Paper March Managing Metadata with Oracle Data Integrator

An Oracle White Paper February Integration with Oracle Fusion Financials Cloud Service

PeopleSoft Enterprise Directory Interface

An Oracle Technical White Paper June Oracle VM Windows Paravirtual (PV) Drivers 2.0: New Features

Field Service Management in the Cloud

An Oracle White Paper September Advanced Java Diagnostics and Monitoring Without Performance Overhead

An Oracle White Paper August Oracle Database Auditing: Performance Guidelines

One View Report Samples Warehouse Management

Big Data and Natural Language: Extracting Insight From Text

An Oracle White Paper June, Provisioning & Patching Oracle Database using Enterprise Manager 12c.

Siebel CRM Quote and Order Capture - Product and Catalog Management

An Oracle White Paper September Oracle WebLogic Server 12c on Microsoft Windows Azure

An Oracle White Paper September SOA Maturity Model - Guiding and Accelerating SOA Success

An Oracle White Paper July Accelerating Database Infrastructure Using Oracle Real Application Clusters 11g R2 and QLogic FabricCache Adapters

An Oracle White Paper March Oracle Data Guard Broker. Best Practices for Configuring Redo Transport for Data Guard and Active Data Guard 12c

An Oracle White Paper July Oracle Linux and Oracle VM Remote Lab User Guide

Architectures for massive data management

Guide to Instantis EnterpriseTrack for Multi- Initiative EPPM:

An Oracle White Paper June Security and the Oracle Database Cloud Service

An Oracle White Paper June Oracle Linux Management with Oracle Enterprise Manager 12c

Maximizing Profitability with Cloud Collaboration for your Business

An Oracle White Paper June Oracle Database Firewall 5.0 Sizing Best Practices

Oracle Sales Cloud Analytics

Migration Best Practices for OpenSSO 8 and SAM 7.1 deployments O R A C L E W H I T E P A P E R M A R C H 2015

Introduction to Spark

Oracle JD Edwards EnterpriseOne Mobile Sales Order Entry

ORACLE SOCIAL MARKETING CLOUD SERVICE

Oracle Cloud Platform. For Application Development

Oracle Service Cloud and Oracle WebRTC Session Controller ORACLE WHITE PAPER FEBRUARY 2015

The new Manage Requisition Approval task provides a simple and user-friendly interface for approval rules management. This task allows you to:

ORACLE FUSION PROJECT MANAGEMENT CLOUD SERVICE

Best Practices for Optimizing Storage for Oracle Automatic Storage Management with Oracle FS1 Series Storage ORACLE WHITE PAPER JANUARY 2015

Oracle Directory Services Integration with Database Enterprise User Security O R A C L E W H I T E P A P E R F E B R U A R Y

An Oracle White Paper July Oracle Desktop Virtualization Simplified Client Access for Oracle Applications

Oracle Taleo Enterprise Cloud Service. Talent Intelligence for Employee Insight

Oracle Sales Cloud for Consumer Goods

An Oracle White Paper December Cloud Candidate Selection Tool: Guiding Cloud Adoption

Oracle Data Integrator 12c (ODI12c) - Powering Big Data and Real-Time Business Analytics. An Oracle White Paper October 2013

An Oracle White Paper November Upgrade Best Practices - Using the Oracle Upgrade Factory for Siebel Customer Relationship Management

ORACLE CRM ON DEMAND RELEASE 30

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

An Oracle White Paper May Distributed Development Using Oracle Secure Global Desktop

ORACLE SYSTEMS OPTIMIZATION SUPPORT

Oracle Hyperion Planning

Oracle Communications Extension Group: Enterprise Application Guide ORACLE WHITE PAPER AUGUST 2015

Oracle Hyperion Financial Close Management

ORACLE INFRASTRUCTURE AS A SERVICE PRIVATE CLOUD WITH CAPACITY ON DEMAND

An Oracle Best Practice Guide April Best Practices for Knowledgebase and Search Effectiveness

An Oracle Communications White Paper December Serialized Asset Lifecycle Management and Property Accountability

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Oracle Value Chain Planning Inventory Optimization

Driving Down the High Cost of Storage. Pillar Axiom 600

Cloud-Based Content Storage Management with Oracle DIVA Cloud Service

An Oracle White Paper May The Role of Project and Portfolio Management Systems in Driving Business and IT Strategy Execution

Advanced Matching and IHE Profiles

April Oracle Higher Education Investment Executive Brief

ORACLE S PRIMAVERA CONTRACT MANAGEMENT, BUSINESS INTELLIGENCE PUBLISHER EDITION

How To Load Data Into An Org Database Cloud Service - Multitenant Edition

Oracle Sales Cloud Configuration, Customization and Integrations

How To Use Oracle Hyperion Strategic Finance

WEBLOGIC SERVER MANAGEMENT PACK ENTERPRISE EDITION

An Oracle Technical Article November Certification with Oracle Linux 6

Managed Storage Services

OpenLDAP Oracle Enterprise Gateway Integration Guide

Transcription:

Integrating Apache Spark with Oracle NoSQL Database O R A C L E W H I T E P A P E R J A N U A R Y 2 0 1 6 1 INTEGRATING APACHE SPARK WITH ORACLE NOSQL DATABASE

Introduction In recent times, big data analytics has become a key enabler for business success. Big data analytics requires performant data stores and high-level analytics frameworks providing the right abstractions. The low latency and high scalable storage provided by Oracle NoSQL Database 1 coupled with high performance analytics provided by Apache Spark 2 makes an attractive combination for performing big data analytics. In this paper we take distributed log analysis as a simple analytics use case and explain how to store log data in Oracle NoSQL Database and perform analytics on top of it using Apache Spark. In the following sections we discuss: High-level architecture of Oracle NoSQL Database and Apache Spark and the integration between them Log analysis use case How to implement the use case using Oracle NoSQL Database and Apache Spark Oracle NoSQL Database Oracle NoSQL Database is a highly scalable, highly available, fault tolerant, Always On distributed key-value database which you can deploy on low cost commodity hardware in a scale-out manner. Use cases of Oracle NoSQL Database include Distributed Web-scale Applications, Real Time Event Processing, Mobile Data Management, Time Series and Sensor Data Management, Online Gaming etc. Oracle NoSQL Database offers all the features which are common to a typical NoSQL product like Elasticity, Eventually Consistent Transactions, Multiple Data Centers, Secondary Indexes, Security and Schema Flexibility. The key differentiators include ACID Transaction, Online Rolling Upgrade, Streaming Large Object Support, Engineered Systems, and Oracle Technology integrated. Architecture diagram of Oracle NoSQL Database is as shown in Figure 1. Apache Spark Apache Spark is a powerful open source general-purpose cluster computing engine for performing high speed sophisticated analytics. Some of its key features include: Speed: Spark provides high speed analytics by harnessing its advanced Directed Acyclic Graph (DAG) execution engine which supports cyclic data flow and in-memory computing. Ease of use: Spark has easy to use APIs in Java, Scala, Python and R for working on large datasets. Spark offers over 80 high-level operators for performing actions and transformations. 1 http://www.oracle.com/us/products/database/nosql/overview/index.html 2 http://spark.apache.org/ 2 INTEGRATING APACHE SPARK WITH ORACLE NOSQL DATABASE

Generality: Spark supports a rich set of higher-level tools for SQL and structured data processing, machine learning, graph processing, and streaming. You can combine these libraries seamlessly in the same application to create complex workflows. Runs everywhere: Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3, and any Hadoop data source. Architecture diagram of Apache Spark is as shown in Figure 2. Figure 1. Architecture of Oracle NoSQL Database Spark SQL (Structured Data) Spark Streaming (real-time) MLib (machine learning) GraphX (graph processing) Spark Core Standalone Scheduler YARN Mesos Figure 2. High level architecture of Apache Spark 3 INTEGRATING APACHE SPARK WITH ORACLE NOSQL DATABASE

Resilient Distributed Datasets (RDD) Spark is built around the concept of Resilient Distributed Datasets (RDD 3 ). An RDD is a lazily evaluated read-only collection of records. It can be transformed into another RDD by operations such as map, filter, and join. Actions such as count are also supported to compute and output results. Most importantly Spark offers control on how the RDDs are persisted. They can be cached in-memory, or configured to be stored on disk, replicated across machines, etc. Due to the caching abstractions provided, Spark is a compelling framework for applications which operate on a set of data repetitively. For example, iterative machine learning applications, and interactive data exploration and mining are compelling use case classes for Spark. Spark has been known to be used in a broad range of analytics applications such as in-memory analytics, traffic modeling, social network spam analysis, social network analysis, fraud detection, recommendations, log processing, predictive intelligence, customer segmentation, sentiment analysis, and IoT analytics. At a high level, every Spark application consists of a driver program that launches various parallel operations on a cluster. Driver programs access Spark through a SparkContext object, which represents a connection to a computing cluster. Once you have a SparkContext, you can use it to build RDDs. Once an RDD is instantiated, you can apply a series of operations. To run these operations, driver programs typically manage a number of nodes called executors. Figure 3 shows a Spark application using two executors. Worker Node Executor Task Task Driver Program SparkContext Worker Node Executor Task Task Figure 3. An example Spark application using two executors Integrating Spark with Oracle NoSQL Database Spark allows loading of data from a Hadoop file. This can be done by extending Hadoop's InputFormat class. So, any datastore that implements Hadoop's InputFormat specification can act as a data source to Spark. Oracle NoSQL Database provides TableInputFormat class which extends Hadoop's InputFormat class there by allowing data to be loaded from Oracle NoSQL Database. 3 https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf 4 INTEGRATING APACHE SPARK WITH ORACLE NOSQL DATABASE

Spark's newapihadooprdd method allows to specify classes for InputFormat handling and classes for keys and values. In addition a configuration can be specified to be used by the InputFormat class. For more information, see section Log Analysis Using Spark. Log Analysis Use Case The log analyzer program allows keyword-based search of the logs and categorizes the search results into classes for easier log comprehension. In addition it allows for filtering on the various fields of the log entries such as time period, host on which the logs entry had taken place, and the severity level of each entry. For this use case we have taken logs output by Oracle NoSQL Database cluster while running a performance test. The entries get logged on all the nodes of the cluster and have the following format: <timestamp> <log severity> <host> <log message> However, there can be some entries, such as java exception traces, in the log file which do not conform to this format. All such log content is ignored. Log Processing and Storage Log files from all the nodes are collected into a single place before processing. Then for each file the log entries are parsed into fields and pushed into the Oracle NoSQL Database. The definition of the NoSQL table is pretty simple and is as shown below: CREATE TABLE Log id INTEGER, tsmillis LONG, loglevel STRING, host STRING, message STRING, PRIMARY KEY (id)) The tsmillis field captures the timestamp in the form of the number of milliseconds since Epoch time. This allows for filtering of logs based on time periods such as since last hour, day, and week. All the rest of the field definitions are self explanatory. Logs are pushed into Oracle NoSQL Database using the following command: java loganalyzer.pushlogs <log_directory> Log Analysis Using Spark Once the log entries are in the database, Spark is used to analyze the logs as mentioned in the following steps: 1. The driver program creates a SparkContext object as shown below. SparkConf sparkconf = new SparkConf().setAppName("SparkTableInput").setMaster("local[2]"); JavaSparkContext sc = new JavaSparkContext(sparkConf); The above code specifies that two executors should be spawned on the local machine. 5 INTEGRATING APACHE SPARK WITH ORACLE NOSQL DATABASE

2. Read the log data from Oracle NoSQL Database into Spark as shown below: conf.set("oracle.kv.kvstore", STORE_NAME); hconf.set("oracle.kv.tablename", TABLE_NAME); hconf.set("oracle.kv.hosts", KVHOST); jrdd = sc.newapihadooprdd(hconf, TableInputFormat.class, PrimaryKey.class, Row.class); The classes PrimaryKey, Row, and TableInputFormat are provided by Oracle NoSQL Database. PrimaryKey and Row are the classes for keys and values respectively. TableInputFormat is the class that handles the read access from the Oracle NoSQL Database. It reads the configuration parameters kvstore, tablename, and hosts to connect to the database. A successful execution of this method returns an RDD. 3. Specify filters to the tool from the command line. The actual usage looks as follows: Usage: java loganalyzer.loganalyzer [options] options: --for-the-past hour day week process logs for the past given time period --host <hostname> process logs for this host --level INFO SEVERE WARNING Log level to consider --search <keyword> Search keyword If filtering is requested, the filter method is used on this RDD by passing an appropriate filter function. For example, if the entries need to be filtered based on log level, the following code would do the needful: leveldata = logdata.filter(new Function<Row, Boolean>(){ public Boolean call(row row){ String level = row.get("loglevel").asstring().get(); return flevel.equals(level); } }); The call method takes a row as an argument and returns true if the loglevel matches the required level. This method gets executed on all the rows and finally the filter method returns a new RDD containing only the required rows. 4. Categorize the logs for easier comprehension, once the filtering is complete. See the following example log entries: 2015-12-23 12:03:04 SEVERE host1 Process 1234 crashed 2015-12-23 12:05:06 SEVERE host2 Process 4567 crashed 2015-12-23 12:07:08 SEVERE host3 Process 8910 crashed 2015-12-23 12:08:09 SEVERE host4 Process 1112 crashed 2015-12-23 12:09:10 SEVERE host1 Commit of transaction -1112 failed 2015-12-23 12:10:11 SEVERE host1 Commit of transaction -1314 failed 2015-12-23 12:11:12 SEVERE host2 Commit of transaction -1516 failed While doing log analysis, the user may just want to know things such as how many process crashes or transaction failures occurred, without caring about the process ids etc. A simple categorization can be done by replacing the numerical values in the log messages. Note: In a real application, it is possible to do a more complex analysis. After performing this simple categorization, output looks as follows: 4: Process? crashed 3: Commit of transaction -? failed 6 INTEGRATING APACHE SPARK WITH ORACLE NOSQL DATABASE

The process IDs and transaction numbers have been replaced with a '?' character. Also, the number of occurrences of each type of message is counted and printed in the first column. The above output says there were four process crashes and three transaction failures. This makes the log comprehension easier. The code for message categorization is as follows: // Pull the messages only from the rows and replace integers with '?' JavaRDD<String> nointmesgs = logdata.map(new Function<Row, String>(){ @Override public String call(row row){ return row.get("message").asstring().get().replaceall("\\d+", "?"); } }); // Compute distinct counts Map<String, Long> counts = nointmesgs.countbyvalue(); Using the map method the log messages are altered by replacing the numerical values in the message with a '?' and then countbyvalue is used to count the occurrences of each unique message. Conclusion In this paper, we have taken a log analysis use case to explain how to perform analytics in Spark on the data stored in Oracle NoSQL Database. Storing of data from Spark into Oracle NoSQL Database is a feature that we are working on and will be part of a future release. The code for LogAnalyzer is available under the directory spark-nosql in the github repository https://github.com/ashutoshsnaik/oraclenosqldb.git 7 INTEGRATING APACHE SPARK WITH ORACLE NOSQL DATABASE

Oracle Corporation, World Headquarters Worldwide Inquiries 500 Oracle Parkway Phone: +1.650.506.7000 Redwood Shores, CA 94065, USA Fax: +1.650.506.7200 C O N N E C T W I T H U S blogs.oracle.com/oracle facebook.com/oracle twitter.com/oracle oracle.com Copyright 2016, Oracle and/or its affiliates. All rights reserved. This document is provided for information purposes only, and the contents hereof are subject to change without notice. This document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or fitness for a particular purpose. We specifically disclaim any liability with respect to this document, and no contractual obligations are formed either directly or indirectly by this document. This document may not be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without our prior written permission. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group. 0116 Integrating Apache Spark with Oracle NoSQL Database January 2016 Authors: Ravi Prakash Putchala, Ashutosh Naik, Ashok Joshi, Michael Schulman 8 INTEGRATING APACHE SPARK WITH ORACLE NOSQL DATABASE