Processing Big Data With SQL on Hadoop. Jens Albrecht

Size: px
Start display at page:

Download "Processing Big Data With SQL on Hadoop. Jens Albrecht jens.albrecht@th-nuernberg.de"

Transcription

1 Processing Big Data With SQL on Hadoop Jens Albrecht

2 Why SQL for Big Data? Mature technology Broad knowledge available Powerful query language High interactive performance Many third party tools for data analysis and visualization Flexible data structures Semi-structured data Changing schemas Self-Service Data integration on-the-fly Scalability Analysis, integration, volumen (Relatively) Low Cost CommodityHardware, Open Source Prof. Dr. Jens Albrecht SQL on Hadoop 3

3 Why SQL for Big Data? + Extended DWH Data Lake Agility Prof. Dr. Jens Albrecht SQL on Hadoop 4

4 Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 5

5 Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 6

6 Hive Hadoop Meta Store Hive HiveQL Execution SerDe General Developed initially by Facebook SQL-processing for HDFS and HBase Table definitions in Hive Meta Store Generation of MapReduce Code Schema-on-Read via SerDe MapReduce Advantages Mature part of every Hadoop distribution Simple setup Java-API for UDFs Usage of many data formats via SerDe HDFS/HBase Disadvantages Batch-oriented, slow Prof. Dr. Jens Albrecht SQL on Hadoop 7

7 Schema-on-Write vs. Schema-on-Read Relational Database: Schema-on-Write Multi-structured Source Data Relational DBMS ETL SQL Big Data Processing: Schema-on-Read Multi-structured Source Data Load as-is Hadoop Schema mapped to original files SQL Prof. Dr. Jens Albrecht SQL on Hadoop 8

8 Schema-on-Read: Hive & CSV CREATE TABLE gps_data( userid INT, deviceid INT, longitude STRING, latitude STRING, utctime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; -- load data = copy LOAD DATA LOCAL INPATH 'new_data/gps.dat' OVERWRITE INTO TABLE gps_data; SELECT COUNT(*) FROM gps_data; Sample Data N E 10/01/2014@10:00:00UTC Prof. Dr. Jens Albrecht SQL on Hadoop 9

9 Schema-on-Read: Hive& Regexp CREATE TABLE weblog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.regexserde' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (- \\[[^\\]]*\\]) ([^ \"]* \"[^\"]*\") (- [0-9]*) (- [0-9]*)(?: ([^ \"]* \"[^\"]*\") ([^ \"]* \"[^\"]*\"))?" ) STORED AS TEXTFILE; Prof. Dr. Jens Albrecht SQL on Hadoop 10

10 Schema-Read: Hive & JSON JSON Data Relational Mapping in Hive Source: Prof. Dr. Jens Albrecht SQL on Hadoop 11

11 Hive on Tez (Stinger) Stinger-Initiative Hadoop Meta Store Hive HiveQL Execution SerDe "Make Hive 100x faster" Finished with Hive 0.13 (April 2014) Replace MapReduce with Tez Native columnar data format (ORC) Stinger.Next MapReduce Tez/ Yarn HDFS/HBase Phase 1: Hive 0.14 (November 2014) ACID transactions Phase 2: (Q2 2015) Subsecond Queries mit LLAP Machine Learning Integration Phase 3: (Q4 2015) SQL:2011 Analytic Functions Materialized Views Prof. Dr. Jens Albrecht SQL on Hadoop 12

12 Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 13

13 Pure Hadoop SQL Engines Distributed SQL Engine SQL Query Coordination Local Agent Local Agent Local Agent Local Agent Approach Distributed, parallel SQL engine Often usage of Hive Metadata Support of optimized data formats Hadoop as mandatory basis Advantages and Disadvantages Significantly faster as Hive Low latency through dedicated engine Operator pipelining and result caching Data Files Data Files Data Files HDFS/HBase Data Files Diffentiation of solutions Supported SQL functionality Point querying Cost-based optimizer / performance Transaction support Prof. Dr. Jens Albrecht SQL on Hadoop 14

14 Pure Hadoop SQL Engines Distributed SQL Engine SQL Query Coordination Big Insights Local Agent Local Agent Local Agent Local Agent Data Files Data Files Data Files Data Files HDFS/HBase Prof. Dr. Jens Albrecht SQL on Hadoop 15

15 Pure Hadoop SQL Engines Example: IBM Big SQL Source: IBM Prof. Dr. Jens Albrecht SQL on Hadoop 16

16 Apache Spark & Spark SQL Schema RDD Spark SQL SQL Execution SerDe General SQL engine based on Spark Data access via data frames (former SchemaRDD) In-Memory columnar format HDFS / HBase as file format Apache Spark Advantages Spark as general-purpose parallel computing framework Support of Hive extensions like UDFs and SerDes and Hive metadata HDFS/HBase Disadvantages Not yet fully mature Not yet as fast as competitors Prof. Dr. Jens Albrecht SQL on Hadoop 17

17 Apache Spark Distributed In-Memory Computing Framework Data caching General framework for all kinds of SQL and non-sql analytics Support for out-of-the box libraries as well as Java, Python and Scala in the same engine and for the same data New datasources API allows to write plugins for non-hadoop sources Spark SQL Spark Streaming Machine Learning (MLlib) Graph Computation (GraphX) Spark Execution Engine ZooKeeper Hadoop YARN (optional) HDFS Prof. Dr. Jens Albrecht SQL on Hadoop 18

18 Apache Spark lines = spark.textfile("hdfs://...") errors = lines.filter(_.startswith(error)) errors.persist() // Return the time fields of errors mentioning // assuming time is field number 3 in a tsv file hdfs_errors = errors.filter(_.contains(hdfs)) time_fields = hdfs_errors.map(_.split( \t )(3)).collect() Action Resilient Distributed Data Sets (RDDs) Transformations Source: Prof. Dr. Jens Albrecht SQL on Hadoop 19

19 Spark Transformations Transformations: Create a new RDD from an existing one Lazy evaluation results are not materialized Much more than map reduce map filter sample groupbykey sortbykey reducebykey union pipe repartition join leftouterjoin rightouterjoin Actions: Return a value or dataset to the calling program reduce collect count first take(n) saveastextfile Prof. Dr. Jens Albrecht SQL on Hadoop 20

20 Adding Schema to RDDs Source: Prof. Dr. Jens Albrecht SQL on Hadoop 22

21 Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access Distributed SQL Engine HDFS/ SQL NoSQL Hive RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 23

22 SQL-Engine with Pluggable Storage Distributed SQL Engine SQL Query Coordination Local Agent Local Agent Local Agent Local Agent HDFS Hive JSON Parquet Cassandra Connector Plugins MySQL Oracle Ansatz Verteilte, parallele SQL-Engine Oftmals Nutzung von Hive Metadaten Unterstützung optimierter Dateiformate Hadoop obligatorisch als Basis Allgemeine Vorteile Deutlich schneller als Hive Geringe Latenzzeiten durch Vermeidung von Map-Reduce Operatoren-Pipelining und Caching Skalierbar Allgemeine Nachteile In der Regel keine Transaktionsunterstützung Prof. Dr. Jens Albrecht SQL on Hadoop 24

23 Google Dremel Dremel Prof. Dr. Jens Albrecht SQL on Hadoop 25

24 Apache Drill Quelle: Self-Service Data Integration No metadata repository required Dynamic schema discovery: Metadata automatically extracted for data sources RDBMS and Hive (comprehensive), HBase (partial) or files (on-the-fly) Utilizes self-describing data formats (Parquet, JSON, AVRO) SQL-DDL can be used to create metadata explicitly ANSI-SQL plus Flexible Data Model Fully ANSI compliant SQL DrQL with SQL extensions for nested data structures (like JSON) Prof. Dr. Jens Albrecht SQL on Hadoop 26

25 Apache Drill: SQL for JSON select name, flatten(fillings) as f from dfs.users.`/donuts.json` where f.cal < 300; Source: Prof. Dr. Jens Albrecht SQL on Hadoop 27

26 Apache Drill: SQL for Heterogeneous Data Formats JSON CSV ORC Parquet HBasetables MongoDB select USERS.name, USERS. s.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/profiles.json` USERS, where LOGS.uid = USERS.uid and errorlevel > 5 order by count(*); Prof. Dr. Jens Albrecht SQL on Hadoop 28

27 Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 29

28 External Tables in HDFS RDBMS Logical Mapping HDFS CREATE TABLE SCOTT.SALES_HDFS_EXT_TAB ( PROD_ID NUMBER(6), CUST_ID NUMBER, TIME_ID DATE, CHANNEL_ID CHAR(1), PROMO_ID NUMBER(6), QUANTITY_SOLD NUMBER(3), AMOUNT_SOLD NUMBER(10,2) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY SALES_EXT_DIR ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE FIELDS TERMINATED BY ',' ( PROD_ID DECIMAL EXTERNAL, ) PREPROCESSOR HDFS_BIN_PATH:hdfs_stream ) LOCATION ( 'file_sales_1', 'file_sales_2', 'files_sales_3') ); Prof. Dr. Jens Albrecht SQL on Hadoop 30

29 RDBMS with Hadoop Integration Parallel Database SQL Query Coordination Approach Towards genuine integration of Hadoop into RDBMS Utilize Hadoop's computational power Cost-based choice External Tables MR Loader Map Reduce HDFS/HBase Relational Tables Advantages Easiest way to use Hadoop as data source Combined access to traditional and new data sources Disadvantages Cost Limited data sources Vendor lock-in Prof. Dr. Jens Albrecht SQL on Hadoop 31

30 RDBMS with Hadoop Integration External Tables HDFS/HBase Parallel Database SQL Query Coordination MR Loader Map Reduce Relational Tables Products Microsoft Polybase (part of MS Analytics Platform) Oracle Big Data SQL (part of Oracle Big Data Appliance in combination with Exadata) Use Cases Extension of traditional BI System Data-lake scenario with RDBMS as primary system and Hadoop for mass data Mix of analytic and transactional load Prof. Dr. Jens Albrecht SQL on Hadoop 32

31 Oracle Big Data SQL Integrates Hive Metadata Allows hybrid queries Include Hadoop and NoSQL in relational queries Use Exadata Smartscan- Technology Source: Prof. Dr. Jens Albrecht SQL on Hadoop 33

32 Hadoop-SQL Integration Hive(Native Hadoop) Pure Hadoop SQL Engines HiveQL MR / Tez HDFS Stinger Distributed SQL Engine HDFS Big Insights Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 34

33 > Hadoop File Formats Prof. Dr. Jens Albrecht SQL on Hadoop 35

34 File formats for Big Data Text Formats High storage usage bad scan performance Low compression bad scan performance Dedicated Formats (e.g. DB internal) Not open, no interoperability Requirements for Big Data Interoperability Low storage / good compression High performance Flexible schema A schema for a file format?? Query performance for a file format?? Big Data often have a nested structure and multiple schema variants and versions Prof. Dr. Jens Albrecht SQL on Hadoop 36

35 Motivation for file formats Quelle: Prof. Dr. Jens Albrecht SQL on Hadoop 37

36 Considerations for File Formats Query Tools none Frameworks like MapReduce, Spark, Cascading Query Engines like Pig, Hive, Impala Schema Versioning Schema present? If so, can it change? Splittability Partitioning Splitting of files possible for distributed processing? Example: CSV: yes, XML: partial, MP4: no Block Compression Can blocks be independently compressed and distributed? Block compression is a prerequisite for partition compresssion! File Size Size in bytes and number of files? Hadoop likes big, splittable files! Lots of small files cost performance Load Profile Write Performance Filter operations Reading of single columns Full scans Source: Prof. Dr. Jens Albrecht SQL on Hadoop 38

37 Example Row-Format: Avro Schema specification internally stored in binary format self-describing files record Person { string username; union { null, long } favouritenumber; array<string> interests; } Reader- vs. Writer Schema Allows different"views" on files Read Avro Parser Write Reader Schema Resolution Rules Writer Schema Writer Schema Avro Data Prof. Dr. Jens Albrecht SQL on Hadoop 39

38 Example Column Format: Columnar formats in general Trade faster reads for slower writes Very good compression Parquet Files Hybrid Partitioning sets of records in blocks, columnar within blocks Zone maps per block as kind of index (min/max values per column) Image: Prof. Dr. Jens Albrecht SQL on Hadoop 40

39 Databases as Lego Construction Kit!? Traditional monolithic RDBMS SQL Hadoop DB building blocks SQL SQL Prozessor SQL Prozessor Data Dictionary Verteilte Ausführung Speicherverwaltung Map Reduce Spark CSV Seq Avro JSON ORC Parquet Generic Execution Engine Metadata sharing in Hive Repository or self-describing file formats Operator push-down to intelligente file interfaces Prof. Dr. Jens Albrecht SQL on Hadoop 41

40 > Summary Prof. Dr. Jens Albrecht SQL on Hadoop 42

41 Considerations for SQL on Hadoop Solutions SQL Functionality Coverage of SQL standard User-defined functions Transactional Safety Performance and Stability Multi-user workloads Efficiency of joins and aggregations (I/O problems? Size limits?) Supported Data and Storage Formats Logical Format: relational, JSON, none, Physical Formats: CSV, Parquet, ORC, Avro, Intelligent Storage Plugins / Data Federation Access to various data sources beyond Hadoop Pushdown predicates, access selected columns only Support for your Hadoop Distribution Prof. Dr. Jens Albrecht SQL on Hadoop 43

42 Hadoop vs. SQL Hadoop SQL Technologies Supplement traditional RDBMS Extend traditional RDBMS Develop new RDBMS SQL Hadoop Hadoopand SQL move closely together. SQL universe gets wider. Database systems become open and modular. Prof. Dr. Jens Albrecht SQL on Hadoop 44

43 References L. Chang et al.: HAWQ: A Massively Parallel Processing SQL Engine in Hadoop. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of Data, Pages A. Floratour, U. Minhas, F. Özcan: SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures. Proceedings of the VLDB Endowment, Vol. 7, No. 12, 2014 M. Hausenblas, J. Nadeau: Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big Data Magazine, June 2013 M. Kornacker, e.a.: Impala: A Modern, Open-Source SQL Engine for Hadoop. 7th Biennial Conference on Innovative Data Systems Research (CIDR 15) D. J. DeWitt, e.a.: Split Query Processing in Polybase. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, Pages S. Melnik, e.a.: Dremel: Interactive Analysis of Web-scale Datasets. PVLDB, 3(1-2): , 2010 M. Zaharia, e.a.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012 J. Albrecht, S. Alexander: Hadoop und SQL rücken enger zusammen. Computerwoche, Nov. 2013, C. Deptula: Hadoop File Formats: Ist not Just CSV Anymore. Blog Eintrag, 2014, J. Le Dem: Efficient Data Storage Analytics with Apache Parquet 2.0, Hadoop Summit 2014, M. Rathbone: 8 SQL-on-Hadoop frameworks worth checking out. Blog Eintrag, 2014, P. Srivati: Resilient Distributed Datasets (RDD) for the impatient. Blog Eintrag, 2014, S. Yegulalp: 10 ways to query Hadoop with SQL. Infoworld, 2014, Prof. Dr. Jens Albrecht SQL on Hadoop 45

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

SQL on NoSQL (and all of the data) With Apache Drill

SQL on NoSQL (and all of the data) With Apache Drill SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016 Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets

More information

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

More information

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Data Is Doubling Every Two Years Unstructured data will account for more than 80% of the data

More information

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Management - MCS MapR Data Platform for Hadoop and NoSQL APACHE HADOOP AND OSS ECOSYSTEM Batch

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Buzzwords Berlin - 2015 Big data analytics / machine

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by Christian Tzolov @christzolov Whoami Christian Tzolov Technical Architect at Pivotal, BigData, Hadoop, SpringXD,

More information

Beyond Hadoop with Apache Spark and BDAS

Beyond Hadoop with Apache Spark and BDAS Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc. Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda Goals; user view of Impala Impala performance Impala internals Comparing Impala to other systems Impala Overview:

More information

This is a brief tutorial that explains the basics of Spark SQL programming.

This is a brief tutorial that explains the basics of Spark SQL programming. About the Tutorial Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types

More information

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385 brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

David Teplow! Integra Technology Consulting!

David Teplow! Integra Technology Consulting! David Teplow! Integra Technology Consulting! Your Presenter! Database architect / developer since the very first release of the very first RDBMS (Oracle v2; 1981)! Helped launch the Northeast OUG in 1983!

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information

Parquet. Columnar storage for the people

Parquet. Columnar storage for the people Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala Outline Context from various

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team Hive Development (~15 minutes) Yongqiang He Software Engineer Facebook Data Infrastructure Team Agenda 1 Introduction 2 New Features 3 Future What is Hive? A system for managing and querying structured

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

How Companies are! Using Spark

How Companies are! Using Spark How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

More information

Integration of Apache Hive and HBase

Integration of Apache Hive and HBase Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1 About Me User and committer of Hadoop since 2007 Contributor to Apache Hadoop, HBase, Hive and Gora Joined

More information

Open Source Technologies on Microsoft Azure

Open Source Technologies on Microsoft Azure Open Source Technologies on Microsoft Azure A Survey @DChappellAssoc Copyright 2014 Chappell & Associates The Main Idea i Open source technologies are a fundamental part of Microsoft Azure The Big Questions

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

TE's Analytics on Hadoop and SAP HANA Using SAP Vora TE's Analytics on Hadoop and SAP HANA Using SAP Vora Naveen Narra Senior Manager TE Connectivity Santha Kumar Rajendran Enterprise Data Architect TE Balaji Krishna - Director, SAP HANA Product Mgmt. -

More information

Impala: A Modern, Open-Source SQL

Impala: A Modern, Open-Source SQL Impala: A Modern, Open-Source SQL Engine Headline for Goes Hadoop Here Marcel Speaker Kornacker Name Subhead marcel@cloudera.com Goes Here CIDR 2015 Cloudera Impala Agenda Overview Architecture and Implementation

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

APACHE DRILL: Interactive Ad-Hoc Analysis at Scale

APACHE DRILL: Interactive Ad-Hoc Analysis at Scale APACHE DRILL: Interactive Ad-Hoc Analysis at Scale Michael Hausenblas and Jacques Nadeau MapR Technologies Abstract Apache Drill is a distributed system for interactive ad-hoc analysis of large-scale datasets.

More information

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData HPE Vertica & Hadoop Tapping Innovation to Turbocharge Your Big Data #SeizeTheData The HPE Vertica portfolio One Vertica Engine running on Cloud, Bare Metal, or Hadoop Data Nodes HPE Vertica OnDemand &

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

Big Data Analytics with Cassandra, Spark & MLLib

Big Data Analytics with Cassandra, Spark & MLLib Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine

More information

Business Intelligence for Big Data

Business Intelligence for Big Data Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Bringing Big Data to People

Bringing Big Data to People Bringing Big Data to People Microsoft s modern data platform SQL Server 2014 Analytics Platform System Microsoft Azure HDInsight Data Platform Everyone should have access to the data they need. Process

More information

Analytics on Spark & Shark @Yahoo

Analytics on Spark & Shark @Yahoo Analytics on Spark & Shark @Yahoo PRESENTED BY Tim Tully December 3, 2013 Overview Legacy / Current Hadoop Architecture Reflection / Pain Points Why the movement towards Spark / Shark New Hybrid Environment

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics

More information

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY) LA TECHNOLOGY TRANSFER PRESENTS PRESENTA DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY) info@technologytransfer.it

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Spark: Making Big Data Interactive & Real-Time

Spark: Making Big Data Interactive & Real-Time Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Tap into Hadoop and Other No SQL Sources

Tap into Hadoop and Other No SQL Sources Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data

More information

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING

More information

Spark Application Carousel. Spark Summit East 2015

Spark Application Carousel. Spark Summit East 2015 Spark Application Carousel Spark Summit East 2015 About Today s Talk About Me: Vida Ha - Solutions Engineer at Databricks. Goal: For beginning/early intermediate Spark Developers. Motivate you to start

More information

How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5

How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5 Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark

More information

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren News and trends in Data Warehouse Automation, Big Data and BI Johan Hendrickx & Dirk Vermeiren Extreme Agility from Source to Analysis DWH Appliances & DWH Automation Typical Architecture 3 What Business

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap

Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap Sumit Pal Independent Big Data and Data Science Consultant, Boston 1 Data Center World Certified Vendor Neutral Each presenter

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Practical Hadoop by Example

Practical Hadoop by Example Practical Hadoop by Example for relational database professioanals Alex Gorbachev 12-Mar-2013 New York, NY Alex Gorbachev Chief Technology Officer at Pythian Blogger OakTable Network member Oracle ACE

More information

Integrating Apache Spark with an Enterprise Data Warehouse

Integrating Apache Spark with an Enterprise Data Warehouse Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco About the Speaker Mark Rittman, Co-Founder of Rittman Mead

More information

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved. EMC Federation Big Data Solutions 1 Introduction to data analytics Federation offering 2 Traditional Analytics! Traditional type of data analysis, sometimes called Business Intelligence! Type of analytics

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

Using RDBMS, NoSQL or Hadoop?

Using RDBMS, NoSQL or Hadoop? Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved. Data Ingest 2 Ingest

More information

NoSQL for SQL Professionals William McKnight

NoSQL for SQL Professionals William McKnight NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to

More information

Trafodion Operational SQL-on-Hadoop

Trafodion Operational SQL-on-Hadoop Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL

More information

#TalendSandbox for Big Data

#TalendSandbox for Big Data Evalua&on von Apache Hadoop mit der #TalendSandbox for Big Data Julien Clarysse @whatdoesdatado @talend 2015 Talend Inc. 1 Connecting the Data-Driven Enterprise 2 Talend Overview Founded in 2006 BRAND

More information

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems

More information

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant

More information

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case

More information

Data Management in SAP Environments

Data Management in SAP Environments Data Management in SAP Environments the Big Data Impact Berlin, June 2012 Dr. Wolfgang Martin Analyst, ibond Partner und Ventana Research Advisor Data Management in SAP Environments Big Data What it is

More information

How to Choose Between Hadoop, NoSQL and RDBMS

How to Choose Between Hadoop, NoSQL and RDBMS How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015 Data Governance in the Hadoop Data Lake Kiran Kamreddy May 2015 One Data Lake: Many Definitions A centralized repository of raw data into which many data-producing streams flow and from which downstream

More information

How, What, and Where of Data Warehouses for MySQL

How, What, and Where of Data Warehouses for MySQL How, What, and Where of Data Warehouses for MySQL Robert Hodges CEO, Continuent. Introducing Continuent The leading provider of clustering and replication for open source DBMS Our Product: Continuent Tungsten

More information

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)

More information

Oracle Database 12c Plug In. Switch On. Get SMART.

Oracle Database 12c Plug In. Switch On. Get SMART. Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.

More information

Next-Gen Big Data Analytics using the Spark stack

Next-Gen Big Data Analytics using the Spark stack Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our

More information

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering MySQL and Hadoop: Big Data Integration Shubhangi Garg & Neha Kumari MySQL Engineering 1Copyright 2013, Oracle and/or its affiliates. All rights reserved. Agenda Design rationale Implementation Installation

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

Play with Big Data on the Shoulders of Open Source

Play with Big Data on the Shoulders of Open Source OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

More information

MapR: Best Solution for Customer Success

MapR: Best Solution for Customer Success 2015 MapR Technologies 2015 MapR Technologies 1 MapR: Best Solution for Customer Success Best Product High Growth 700+ Customers Premier Investors Apache Open Source 2X 2X Growth In Direct Customers Growth

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB

More information

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A

More information

Big Data Research in the AMPLab: BDAS and Beyond

Big Data Research in the AMPLab: BDAS and Beyond Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84 Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics

More information

Conquering Big Data with BDAS (Berkeley Data Analytics)

Conquering Big Data with BDAS (Berkeley Data Analytics) UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»

More information

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate

More information