Integrate Master Data with Big Data using Oracle Table Access for Hadoop
|
|
|
- Silvia Jackson
- 10 years ago
- Views:
Transcription
1 Integrate Master Data with Big Data using Oracle Table Access for Hadoop Kuassi Mensah Oracle Corporation Redwood Shores, CA, USA Keywords: Hadoop, BigData, Hive SQL, Spark SQL, HCatalog, StorageHandler Introduction The goal of Big Data Analytics is to furnish actionable information to help business decisions making. Example: Which of our products got a rating of four stars or higher, on social media in the last quarter? However, producing actionable information requires both Big Data (i.e., facts, usually stored in HDFS or NoSQL DBMS) as well as Master Data (i.e., products, customers and so on, usually stored in Oracle database). Accessing and integrating Master Data in Big Data analytics requires either an ETL copy or direct access from Hadoop or Spark. Oracle Table Access for Hadoop 1 and Spark (OTA4H) turns remote Oracle database tables into Hadoop or Spark data sources. How does OTA4H works? Does it scale? How reliable and secure? This paper discusses the key features and benefits. Oracle Table Access for Hadoop and Spark (OTA4H) The Hadoop 2.x architecture furnishes a set of interfaces and abstraction layers which decouple the compute engines (Hive SQL, Spark, Big data SQL, Impala, Pig, MapReduce, etc), the cluster management (YARN), the computing resources (CPU, memory), and the data sources (HDFS, NoSQL, external tables). OTA4H is an Oracle Big Data Appliance feature which implements Apache Hadoop StorageHandler interface (InputFormat, SerDe) and allows direct, parallel, fast, consistent and secure access to Master Data in Oracle database using Hadoop/Spark query engines (Hive SQL, Spark SQL, and so on), as well as Hadoop/Spark APIs (Pig, MapReduce, and so on). Example Let HiveTab be a local Hive table and OracleTab an external table referring to a remote Oracle database table. The following Hive query 2 will perform a local JOIN between rows from HiveTab and selected 3 rows from Oracle, then return the firstname, lastname and the bonus of employees who have a salary greater than and received a bonus greater than Query= SELECT HiveTab.First_Name, HiveTab.Last_Name, OracleTab.bonus FROM HiveTab join OracleTab on (HiveTab.Emp_ID=OracleTab.Emp_ID) WHERE salary > and bonus > 7000; Here is the sequence of actions upon a Hadoop or Spark JOIN query or API call. 1. OTA4H receives parts of the execution plan related to the Oracle table. 1 Just released as part of Oracle s Big Data Appliance 4.3 release. 2 A simple JOIN operation, not a typical Big Data analytical query. 3 To be more precise, the result set of an Oracle database sub-query.
2 2. It dynamically generates a number of database splits using the specified split pattern (discussed later). To ensure the consistency of the result set, all splits are tagged with the same Oracle database s System Commit Number a.k.a. SCN. 3. An Oracle SQL sub-query is generated for each split using, preferably, a secure connections (Kerberos, SSL, Oracle Wallet), and enforcing predicate pushdown, columns projection pushdown, or partition pruning, when appropriate. 4. Each split is processed by a Hadoop/Spark task (often a Map task with HiveSQL) ). The consistency is enforced during the reading/scanning of the Oracle table, using the SCN captured during the split generation. 5. The matching rows from all tasks are returned to the query coordinator on Hadoop/Spark which performs the JOIN with rows from the local table. Running Hive SQL with OTA4H The demo scripts, Hadoop data, and Oracle database tables are available as part of BigDataLite 4.3 VirtualBox appliance. Configuration HIVE_AUXPATH="/opt/oracle/ota4h/jlib/ojdbc7.jar:/opt/oracle/ota4h/jlib/osh.jar:/opt /oracle/ota4h/jlib/ucp.jar" hive_init.hql add jar /opt/oracle/ota4h/jlib/osh.jar; add jar /opt/oracle/ota4h/jlib/ojdbc7.jar add jar /opt/oracle/ota4h/jlib/ucp.jar Invocation/Execution hive --auxpath ${HIVE_AUXPATH} -i hive_init.hql -e "${QUERY}" Running Spark SQL with OTA4H The compatibility of Spark SQL with Hive metastore allows running Spark against Hive tables (therefore external tables), using the same HiveQL queries. We are looking into further Spark integration with the Oracle database including SChemaRDD, DataFrames, and so on. Configuration: here are the steps for running Spark SQL against Oracle database tables. 1. Add ojdbc7.jar and osh.jar to CLASSPATH in /usr/lib/spark/bin/computeclasspath.sh CLASSPATH="$CLASSPATH:/opt/oracle/ota4h/lib/osh.jar" CLASSPATH="$CLASSPATH:/opt/oracle/ota4h/lib/ojdbc7.jar" 2. Edit SPARK_HOME in /usr/lib/spark/conf/spark-env.sh export SPARK_HOME=/usr/lib/spark:/etc/hive/conf 3. Specify additional environment variables in /usr/lib/spark/conf/spark-env.sh. export DEFAULT_HADOOP=/usr/lib/hadoop export DEFAULT_HIVE=/usr/lib/hive export DEFAULT_HADOOP_CONF=/etc/hadoop/conf
3 export DEFAULT_HIVE_CONF=/etc/hive/conf export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP} export HADOOP_HDFS_HOME=${HADOOP_HDFS_HOME:-${HADOOP_HOME}/../hadoop-hdfs} export HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-${HADOOP_HOME}/../hadoopmapreduce} export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-${HADOOP_HOME}/../hadoop-yarn} export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$DEFAULT_HADOOP_CONF} export HIVE_CONF_DIR=${HIVE_CONF_DIR:-$DEFAULT_HIVE_CONF} CLASSPATH="$CLASSPATH:$HIVE_CONF_DIR" CLASSPATH="$CLASSPATH:$HADOOP_CONF_DIR" if [ "x"!= "x$yarn_conf_dir" ]; then CLASSPATH="$CLASSPATH:$YARN_CONF_DIR" fi 4. Make sure all Hadoop libs are added CLASSPATH="$CLASSPATH:$HADOOP_HOME/client/*" CLASSPATH="$CLASSPATH:$HIVE_HOME/lib/*" CLASSPATH="$CLASSPATH:$($HADOOP_HOME/bin/hadoop classpath)" Invocation/Execution /usr/lib/spark/bin/spark-sql -e "${QUERY}" Key OTA4H Features and Benefits Beyond implementing the StorageHandler and InputFormat interfaces for the Oracle database, OTA4H furnishes unique optimizations in the areas of performance, scalability, high-availability and consistency. Performance and Scalability In order to fully exploit the capacity of Hadoop or Spark clusters and Oracle database servers, OTA4H furnishes the following optimizations.» Splitter patterns. As of this release, OTA4H furnishes four split patterns: SINGLE_SPLITTER, ROW_SPLITTER, BLOCK_SPLITTER, and PARTITION_SPLITTER; these allow dividing the execution of the Oracle sub-query into several tasks which can be run in parallel. Each of these split patterns is fully described in the OTA4H doc (a chapter in BDA 4.3 doc).» Optimized JDBC driver. OTA4H is packaged with a special JDBC driver which is instrumented to create Hadoop Writable types i.e.,text and Date from SQL types -- without materializing intermediate Java objects.» Connection caching. For each query, OTA4H connects to the Oracle database to compute the number of splits, checks whether partition pruning is possible, and so on. Each Hadoop or Spark task executes in a separate JVM but all tasks are not fired simultaneously; OTA4H retains and reuses those JVMs for the remaining tasks of the same query, thereby optimizing the overall query response time.» Integration with Database Resident Connection Pooling (DRCP). Oracle DBAs may control the number of concurrent connections originating from Hadoop or Spark, using an RDBMS-side pool and setting a max connection limit. As documented, the JDBC URL specified in the external table s DDL must refer to DRCP through the :POOLED suffix in short JDBC URLs or (SERVER=POOLED) in long JDBC URLs.
4 » Column projection pushdown. By default 4, Hadoop or Spark queries retrieve only the specified columns from the Oracle database table. OTA4H enforces such projection during the rewriting of the sub-queries for each split.» Predicate pushdown. By default 5, OTA4H incorporates the predicates in the WHERE clause when generating sub-queries for each split thereby ensuring that only the needed rows are retrieved from Oracle.» Partition Pruning. When an external table is associated with a partitioned Oracle table using the PARTITION_SPLITTER, if the query can be resolved using only the relevant partitions, in other words, if predicate is a function of the partition key tasks. OTA4H generates splits only for the relevant partitions. The irrelevant partitions are skipped and not accessed. When the predicate does not refer to the Oracle table partitioning key, OTA4H generates as many tasks as partitions in the table. Security OTA4H ensures secure access to Oracle database through JDBC authentication, encryption and data integrity, and the JVM system properties.» Simple and strong authentication. The value of oracle.hcat.osh.authentication property in the external table declaration i.e., SIMPLE, KERBEROS, WALLET 6 specifies the authentication type. Example: 'oracle.hcat.osh.authentication' = 'KERBEROS'. For Oracle wallets, the OraclePKI provider is used and OTA4H ships the required jars including oraclepki.jar, osdt_cert.jar and osdt_core.jar. Depending on the type of authentication, additional properties must be set (see the OTA4H chapter in the BDA doc for more details).» Encryption and Integrity. OTA4H enforces all oracle.net.* and oracle.jdbc.* table properties related to encryption and data integrity. The JDBC-Thin Driver Support for Encryption and Integrity section in the Oracle JDBC developer s guide 7 has more details.» JVM system properties, Hive/Hadoop/Spark environment variables or configuration properties must also be set. Example: java.security.krb5.conf must be set using HADOOP_OPTS or mapred.child.java.opts in mapred-site.xml for child JVMs or - Djava.security.krb5.conf=<path for kerberos config>; Hive read the value of KRB5_CONFIG.» Cloudera Manager may also be used to set all these properties. High-Availability RAC awareness. In RAC database environments, OTA4H inherits RAC support in Oracle JDBC. The Oracle database listener will route the Hadoop or Spark tasks connection requests to the available nodes either randomly or based on runtime load balancing (if enabled by the DBA), provided the RAC-aware service name is specified in the JDBC URL. 4 hive.io.file.read.all.columns connection property set to false by default. 5 As of Hive-1.1, hive.optimize.ppd configuration property is set to true by default. 6 For this release, only WALLET SSL is supported; other SSL certificate containers including JKS, PKCS12, and so on will be supported in future releases. 7
5 Consistency The maximum number of splits generated by OTA4H is controlled by oracle.hcat.osh.maxsplits however, the number of Hadoop or Spark tasks connecting simultaneously may be limited on the database-side by the DBA (DRCP s maxconnections) or on Hadoop/Spark side by mapred.tasktracker.map.tasks.maximum in conf/mapred-site.xml or spark.dynamicallocation.enabled. In all cases, OTA4H ensures that the Hadoop or Spark tasks execute all the splits of the query as of the same Oracle database s System Commit Number (SCN). Conclusion Big Data Analytics requires the integration of Master Data with Big Data. Oracle Table Access for Hadoop and Spark (OTA4H), a new feature of Oracle Big Data Appliance allows direct access to Oracle tables by turning them into Hadoop or Spark data-sources. OTA4H is a fast, scalable, secure, and reliable implementation of Apache StorageHandler and InputFormat interfaces for the Oracle database. Kuassi Mensah Oracle Corporation 400 Oracle Parkway 94065, Redwood Shores, CA USA Phone: Fax: [email protected] db360.blogspot.com, and
Oracle Big Data SQL. Architectural Deep Dive. Dan McClary, Ph.D. Big Data Product Management Oracle
Oracle Big Data SQL Architectural Deep Dive Dan McClary, Ph.D. Big Data Product Management Oracle Copyright 2014, Oracle and/or its affiliates. All rights reserved. Safe Harbor Statement The following is
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Big Data SQL and Query Franchising
Big Data SQL and Query Franchising An Architecture for Query Beyond Hadoop Dan McClary, Ph.D. Big Data Product Management Oracle Copyright 2014, Oracle and/or its affiliates. All rights reserved. Safe Harbor
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Oracle Big Data Fundamentals Ed 1 NEW
Oracle University Contact Us: +90 212 329 6779 Oracle Big Data Fundamentals Ed 1 NEW Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big
Lofan Abrams Data Services for Big Data Session # 2987
Lofan Abrams Data Services for Big Data Session # 2987 Big Data Are you ready for blast-off? Big Data, for better or worse: 90% of world s data generated over last two years. ScienceDaily, ScienceDaily
Peers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
Native Connectivity to Big Data Sources in MSTR 10
Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Complete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Seamless Access from Oracle Database to Your Big Data
Seamless Access from Oracle Database to Your Big Data Brian Macdonald Big Data and Analytics Specialist Oracle Enterprise Architect September 24, 2015 Agenda Hadoop and SQL access methods What is Oracle
Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
How To Manage Big Data In A Microsoft Cloud (Hadoop)
Oracle Database 12c and the Future of Data Warehousing in the Era of Big Data George Lumpkin Data Warehousing Neil Mendelson Big Data & Advanced AnalyEcs Vice Presidents Server Technologies September 29,
How to Choose Between Hadoop, NoSQL and RDBMS
How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1
How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta [email protected] Radu Chilom [email protected] Buzzwords Berlin - 2015 Big data analytics / machine
Cloudera Backup and Disaster Recovery
Cloudera Backup and Disaster Recovery Important Note: Cloudera Manager 4 and CDH 4 have reached End of Maintenance (EOM) on August 9, 2015. Cloudera will not support or provide patches for any of the Cloudera
OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)
Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are
Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco
Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco About the Speaker Mark Rittman, Co-Founder of Rittman Mead
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015
Leveraging the Power of SOLR with SPARK Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015 Welcome Johannes Weigend - CTO QAware GmbH - Software architect / developer - 25 years
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Apache Sentry. Prasad Mujumdar [email protected] [email protected]
Apache Sentry Prasad Mujumdar [email protected] [email protected] Agenda Various aspects of data security Apache Sentry for authorization Key concepts of Apache Sentry Sentry features Sentry architecture
Oracle Database 12c Plug In. Switch On. Get SMART.
Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.
Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.
Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Integration of Apache Hive and HBase
Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1 About Me User and committer of Hadoop since 2007 Contributor to Apache Hadoop, HBase, Hive and Gora Joined
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Version 3.0 Please note: This appliance is for testing and educational purposes only; it is unsupported and not
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy
Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics
Big Data Introduction
Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam [email protected]
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam [email protected] Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A
Upcoming Announcements
Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC [email protected] Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within
Data storing and data access
Data storing and data access Plan Basic Java API for HBase demo Bulk data loading Hands-on Distributed storage for user files SQL on nosql Summary Basic Java API for HBase import org.apache.hadoop.hbase.*
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Hive Interview Questions
HADOOPEXAM LEARNING RESOURCES Hive Interview Questions www.hadoopexam.com Please visit www.hadoopexam.com for various resources for BigData/Hadoop/Cassandra/MongoDB/Node.js/Scala etc. 1 Professional Training
Cloudera Backup and Disaster Recovery
Cloudera Backup and Disaster Recovery Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
and Hadoop Technology
SAS and Hadoop Technology Overview SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS and Hadoop Technology: Overview. Cary, NC: SAS Institute
Oracle Big Data, In-memory, and Exadata - One Database Engine to Rule Them All Dr.-Ing. Holger Friedrich
Oracle Big Data, In-memory, and Exadata - One Database Engine to Rule Them All Dr.-Ing. Holger Friedrich Agenda Introduction Old Times Exadata Big Data Oracle In-Memory Headquarters Conclusions 2 sumit
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Big Data Management and Security
Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
Supported Platforms. HP Vertica Analytic Database. Software Version: 7.0.x
HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 5/7/2014 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements
Trafodion Operational SQL-on-Hadoop
Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL
Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.
Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda Goals; user view of Impala Impala performance Impala internals Comparing Impala to other systems Impala Overview:
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Using RDBMS, NoSQL or Hadoop?
Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved. Data Ingest 2 Ingest
SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide
SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24 Data Federation Administration Tool Guide Content 1 What's new in the.... 5 2 Introduction to administration
Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Bringing Intergalactic Data Speak (a.k.a.: SQL) to Hadoop Martin Willcox [@willcoxmnk], Director Big Data Centre of Excellence (Teradata
Bringing Intergalactic Data Speak (a.k.a.: SQL) to Hadoop Martin Willcox [@willcoxmnk], Director Big Data Centre of Excellence (Teradata International) 4 th June 2015 Agenda A (very!) short history of
Performance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp
Performance Management in Big Data Applica6ons Michael Kopp, Technology Strategist NoSQL: High Volume/Low Latency DBs Web Java Key Challenges 1) Even Distribu6on 2) Correct Schema and Access paperns 3)
GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION
GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.
Copyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Oracle Big Data Appliance Releases 2.5 and 3.0 Ralf Lange Global ISV & OEM Sales Agenda Quick Overview on BDA and its Positioning Product Details and Updates Security and Encryption New Hadoop Versions
docs.hortonworks.com
docs.hortonworks.com Hortonworks Data Platform: Hive Performance Tuning Copyright 2012-2015 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Certified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
ITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
Integrating with Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x
HPE Vertica Analytic Database Software Version: 7.2.x Document Release Date: 5/18/2016 Legal Notices Warranty The only warranties for Hewlett Packard Enterprise products and services are set forth in the
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
American International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
BIG DATA HADOOP TRAINING
BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)
Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here
Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here JusIn Erickson Senior Product Manager, Cloudera Speaker Name or Subhead Goes Here May 2013 DO NOT USE PUBLICLY PRIOR TO 10/23/12 Agenda
This is a brief tutorial that explains the basics of Spark SQL programming.
About the Tutorial Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
