Moving From Hadoop to Spark

Size: px
Start display at page:

Download "Moving From Hadoop to Spark"

Transcription

1 + Moving From Hadoop to Spark Sujee Maniyam Founder / Bay Area ACM meetup ( )

2 + HI, Featured in Hadoop Weekly #109

3 + About Me : Sujee Maniyam n 15 years+ software development experience n Consulting & Training in Big Data n Author n Hadoop illuminated open source book n HBase Design Patterns coming soon n Open Source contributor (including HBase) n Founder / Organizer of Big Data Guru meetup n n Contact :

4 + Hadoop in 20 Seconds n The Big data platform n Very well field tested n Scales to peta-bytes of data n MapReduce : Batch oriented compute

5 + Hadoop Eco System Real Time Batch ElephantScale.com, 2014

6 + Hadoop Ecosystem n HDFS n provides distributed storage n Map Reduce n Provides distributed computing n Pig n High level MapReduce n Hive n SQL layer over Hadoop n HBase n Nosql storage for realtime queries ElephantScale.com, 2014

7 + Spark in 20 Seconds n Fast & Expressive Cluster computing engine n Compatible with Hadoop n Came out of Berkeley AMP Lab n Now Apache project n Version 1.2 just released (Dec 2014) First Big Data platform to integrate batch, streaming and interactive computations in a unified framework stratio.com

8 + Spark Eco-System Schema / sql Real Time Machine Learning Graph processing Spark SQL Spark Streaming ML lib GraphX Spark Core Stand alone YARN MESOS Cluster managers

9 + Hypo-meter J

10 + Spark Job Trends

11 + Spark Benchmarks Source : stratio.com

12 + Spark Code / Activity Source : stratio.com

13 + Timeline : Hadoop & Spark

14 + Hadoop Vs. Spark Hadoop Spark Source :

15 + Comparison With Hadoop Hadoop Distributed Storage + Distributed Compute MapReduce framework Usually data on disk (HDFS) Not ideal for iterative work Batch process Spark Distributed Compute Only Generalized computation On disk / in memory Great at Iterative workloads (machine learning..etc) - Upto 2x - 10x faster for data on disk - Upto 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration

16 + Hadoop + Yarn : Universal OS for Distributed Compute Batch (mapreduce) Streaming (storm, S4) In-memory (spark) Applications YARN Cluster Management HDFS Storage

17 + Spark Is Better Fit for Iterative Workloads

18 + Spark Programming Model n More generic than MapReduce

19 + Is Spark Replacing Hadoop? n Spark runs on Hadoop / YARN n Complimentary n Spark programming model is more flexible than MapReduce n Spark is really great if data fits in memory (few hundred gigs), n Spark is storage agnostic (see next slide)

20 + Spark & Pluggable Storage Spark (compute engine) HDFS Amazon S3 Cassandra???

21 + Spark & Hadoop Use Case Other Spark Batch processing Hadoop s MapReduce (Java, Pig, Hive) SQL querying Hadoop : Hive Spark SQL Stream Processing / Real Time processing Storm Kafka Spark RDDs (java / scala / python) Spark Streaming Machine Learning Mahout Spark ML Lib Real time lookups NoSQL (Hbase, Cassandra..etc) No Spark component. But Spark can query data in NoSQL stores

22 + Hadoop & Spark Future???

23 + Why Move From Hadoop to Spark? n Spark is easier than Hadoop n friendlier for data scientists / analysts n Interactive shell n fast development cycles n adhoc exploration n API supports multiple languages n Java, Scala, Python n Great for small (Gigs) to medium (100s of Gigs) data

24 + Spark : Unified Stack n Spark supports multiple programming models n Map reduce style batch processing n Streaming / real time processing n Querying via SQL n Machine learning n All modules are tightly integrated n Facilitates rich applications n Spark can be only stack you need! n No need to run multiple clusters (Hadoop cluster, Storm cluster..etc) Image: buymeposters.com

25 + Migrating From Hadoop à Spark Functionality Hadoop Spark Distributed Storage HDFS Cloud storage like Amazon S3 Or NFS mounts SQL querying Hive Spark SQL ETL work flow Pig - Spork : Pig on Spark - Mix of Spark SQL..etc Machine Learning Mahout ML Lib NoSQL DB Hbase???

26 + Moving From Hadoop à Spark 1. Data size 2. File System 3. SQL 4. ETL 5. Machine Learning

27 + Hadoop To Spark Real Time Spark can help Batch ElephantScale.com, 2014

28 + Big Data

29 + Data Size : You Don t Have Big Data

30 + 1) Data Size (T-shirt sizing) Spark < few G 10 G G + 1 TB TB + PB + Hadoop Image credit : blog.trumpi.co.za

31 + 1) Data Size n Lot of Spark adoption at SMALL MEDIUM scale n Good fit n Data might fit in memory!! n Hadoop may be overkill n Applications n Iterative workloads (Machine learning..etc) n Streaming n Hadoop is still preferred platform for TB + data

32 + Next : 2) File System ElephantScale.com, 2014

33 + 2) File System n Hadoop = Storage + Compute Spark = Compute only Spark needs a distributed FS n File system choices for Spark n HDFS - Hadoop File System n Reliable n Good performance (data locality) n Field tested for PB of data n S3 : Amazon n Reliable cloud storage n Huge scale n NFS : Network File System ( shared FS across machines)

34 + Spark File Systems

35 + File Systems For Spark Data locality Throughput Latency Reliability HDFS NFS Amazon S3 High (best) High (best) Low (best) Very High (replicated) Local enough Medium (good) Low Low None (ok) Low (ok) High Very High Cost Varies Varies $30 / TB / Month

36 + File System Throughput Comparison (HDFS Vs. S3) n Data : 10G + (11.3 G) n Each file : ~1+ G ( x 10) n 400 million records total n Partition size : 128 M n On HDFS & S3 n Cluster : n 8 Nodes on Amazon m3.xlarge (4 cpu, 15 G Mem, 40G SSD ) n Hadoop cluster, Latest Horton Works HDP v2.2 n Spark : on same 8 nodes, stand-alone, v 1.2

37 + File System Throughput Comparison (HDFS Vs. S3) val hdfs = sc.textfile("hdfs:/// /10G/") val s3 = sc.textfile("s3n:// /10G/") // count # records hdfs.count() s3.count()

38 + HDFS Vs. S3

39 + HDFS Vs. S3 (lower is better)

40 + HDFS Vs. S3 Conclusions HDFS Data locality à much higher throughput Need to maintain an Hadoop cluster Large data sets (TB + ) S3 Data is streamed à lower throughput No Hadoop cluster to maintain à convenient Good use case: - Smallish data sets (few gigs) - Load once and cache and re-use

41 + Next : 3) SQL ElephantScale.com, 2014

42 + 3) SQL in Hadoop / Spark Hadoop Spark Engine Hive Spark SQL Language HiveQL - HiveQL - RDD programming in Java / Python / Scala Scale Petabytes Terabytes? Inter operability Can read Hive tables or stand alone data Formats CSV, JSON, Parquet CSV, JSON, Parquet

43 + SQL In Hadoop / Spark n Input Billing Records / CDR Timestamp Customer_id Resource_id Qty cost Milliseconds String Int Int int Phone 10 10c SMS 1 4c Data 3M 5c n Query: Find top-10 customers n Data Set n 10G + data n 400 million records n CSV Format

44 + SQL In Hadoop / Spark n Hive Table: CREATE EXTERNAL TABLE billing ( ts BIGINT, customer_id INT, resource_id INT, qty INT, cost INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ', stored as textfile LOCATION hdfs location' ; n Hive Query (simple aggregate) select customer_id, SUM(cost) as total from billing group by customer_id order by total DESC LIMIT 10;

45 + Hive Query Results

46 + Spark + Hive Table n Spark code to access Hive table import org.apache.spark.sql.hive.hivecontext val hivectx = new org.apache.spark.sql.hive.hivecontext(sc) val top10 = hivectx.sql("select customer_id, SUM(cost) as total from billing group by customer_id order by total DESC LIMIT 10") top10.collect()

47 + Spark SQL Vs. Hive Fast on same HDFS data!

48 + SQL In Hadoop / Spark : Conclusions n Spark can readily query Hive tables n Speed! n Great for exploring / trying-out n Fast iterative development n Spark can load data natively n CSV n JSON (Schema automatically inferred) n Parquet (Schema automatically inferred)

49 + Next : 3) ETL In Hadoop / Spark ElephantScale.com, 2014

50 + ETL? Data 1 Data 2 (clean) Data 4 Data 3

51 + 3) ETL on Hadoop / Spark Hadoop Spark ETL Tools Pig, Cascading, Oozie Native RDD programming (Scala, Java, Python) Pig High level ETL workflow Spork : Pig on Spark Cascading High level Spark-scalding

52 + ETL On Hadoop / Spark n Pig n High level, expressive data flow language (Pig Latin) n Easier to program than Java Map Reduce n Used for ETL (data cleanup / data prep) n Spork : Run Pig on Spark (as simple as $ pig -x spark..) n https://github.com/sigmoidanalytics/spork n Cascading n High level data flow declarations n Many sources (Cassandra / Accumulo / Solr) n Spark-Scalding n https://github.com/tresata/spark-scalding

53 + ETL On Hadoop / Spark : Conclusions n Try spork or spark-scalding n Code re-use n Not re-writing from scratch n Program RDDs directly n More flexible n Multiple language support : Scala / Java / Python n Simpler / faster in some cases

54 + 4) Machine Learning : Hadoop / Spark Hadoop Spark Tool Mahout MLLib API Java Java / Scala / Python Iterative Algorithms Slower Very fast (in memory) In Memory processing No Efforts to port Mahout into Spark YES Lots of momentum!

55 + Spark Is Better Fit for Iterative Workloads

56 + Spark Caching! n Reading data from remote FS (S3) can be slow n For small / medium data ( s of GB) use caching n Pay read penalty once n Cache n Then very high speed computes (in memory) n Recommended for iterative work-loads

57 + Caching Demo!

58 + Caching Results Cached!

59 + Spark Caching n Caching is pretty effective (small / medium data sets) n Cached data can not be shared across applications (each application executes in its own sandbox)

60 + Sharing Cached Data n 1) spark job server n Multiplexer n All requests are executed through same context n Provides web-service interface n 2) Tachyon n Distributed In-memory file system n Memory is the new disk! n Out of AMP lab, Berkeley n Early stages (very promising)

61 + Spark Job Server

62 + Spark Job Server n Open sourced from Ooyala n Spark as a Service simple REST interface to launch jobs n Sub-second latency! n Pre-load jars for even faster spinup n Share cached RDDs across requests (NamedRDD) App1 : ctx.saverdd( my cached rdd, rdd1) App2: RDD rdd2 = ctx.loadrdd ( my cached rdd ) n https://github.com/spark-jobserver/spark-jobserver

63 + Tachyon + Spark

64 + Next : New Big Data Applications With Spark

65 + Big Data Applications : Now n Analysis is done in batch mode (minutes / hours) n Final results are stored in a real time data store like Cassandra / Hbase n These results are displayed in a dashboard / web UI n Doing interactive analysis???? n Need special BI tools

66 + With Spark n Load data set (Giga bytes) from S3 and cache it (one time) n Super fast (sub-seconds) queries to data n Response time : seconds (just like a web app!)

67 + Lessons Learned n Build sophisticated apps! n Web-response-time (few seconds)!! n In-depth analytics n Leverage existing libraries in Java / Scala / Python n data analytics as a service

68 + Final Thoughts n Already on Hadoop? n Try Spark side-by-side n Process some data in HDFS n Try Spark SQL for Hive tables n Contemplating Hadoop? n Try Spark (standalone) n Choose NFS or S3 file system n Take advantage of caching n Iterative loads n Spark Job servers n Tachyon n Build new class of big / medium data apps

69 + Thanks! Sujee Maniyam Expert consulting & training in Big Data (Now offering Spark training)

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Big Data Visualization. Apache Spark and Zeppelin

Big Data Visualization. Apache Spark and Zeppelin Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark

More information

Ali Ghodsi Head of PM and Engineering Databricks

Ali Ghodsi Head of PM and Engineering Databricks Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data

More information

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

More information

Hadoop vs Apache Spark

Hadoop vs Apache Spark Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Upcoming Announcements

Upcoming Announcements Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC jmarkham@hortonworks.com Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within

More information

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant

More information

How Companies are! Using Spark

How Companies are! Using Spark How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

More information

Next-Gen Big Data Analytics using the Spark stack

Next-Gen Big Data Analytics using the Spark stack Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our

More information

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

More information

Why Spark on Hadoop Matters

Why Spark on Hadoop Matters Why Spark on Hadoop Matters MC Srivas, CTO and Founder, MapR Technologies Apache Spark Summit - July 1, 2014 1 MapR Overview Top Ranked Exponential Growth 500+ Customers Cloud Leaders 3X bookings Q1 13

More information

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving

More information

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Making Big Data Processing Simple with Spark. Matei Zaharia

Making Big Data Processing Simple with Spark. Matei Zaharia Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Spark: Making Big Data Interactive & Real-Time

Spark: Making Big Data Interactive & Real-Time Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency

More information

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us DATA INTELLIGENCE FOR ALL Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO Agenda 1. Challenges & Motivation 2. DDF Overview 3. DDF Design

More information

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine

More information

Conquering Big Data with BDAS (Berkeley Data Analytics)

Conquering Big Data with BDAS (Berkeley Data Analytics) UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»

More information

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Management - MCS MapR Data Platform for Hadoop and NoSQL APACHE HADOOP AND OSS ECOSYSTEM Batch

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Analytics on Spark & Shark @Yahoo

Analytics on Spark & Shark @Yahoo Analytics on Spark & Shark @Yahoo PRESENTED BY Tim Tully December 3, 2013 Overview Legacy / Current Hadoop Architecture Reflection / Pain Points Why the movement towards Spark / Shark New Hybrid Environment

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Data Is Doubling Every Two Years Unstructured data will account for more than 80% of the data

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016 Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

Beyond Hadoop with Apache Spark and BDAS

Beyond Hadoop with Apache Spark and BDAS Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth MAKING BIG DATA COME ALIVE Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth Steve Gonzales, Principal Manager steve.gonzales@thinkbiganalytics.com

More information

Information Builders Mission & Value Proposition

Information Builders Mission & Value Proposition Value 10/06/2015 2015 MapR Technologies 2015 MapR Technologies 1 Information Builders Mission & Value Proposition Economies of Scale & Increasing Returns (Note: Not to be confused with diminishing returns

More information

Introduction to Big Data Training

Introduction to Big Data Training Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB

More information

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.

More information

HDP Enabling the Modern Data Architecture

HDP Enabling the Modern Data Architecture HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,

More information

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

TE's Analytics on Hadoop and SAP HANA Using SAP Vora TE's Analytics on Hadoop and SAP HANA Using SAP Vora Naveen Narra Senior Manager TE Connectivity Santha Kumar Rajendran Enterprise Data Architect TE Balaji Krishna - Director, SAP HANA Product Mgmt. -

More information

This is a brief tutorial that explains the basics of Spark SQL programming.

This is a brief tutorial that explains the basics of Spark SQL programming. About the Tutorial Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types

More information

Big Data Analytics with Cassandra, Spark & MLLib

Big Data Analytics with Cassandra, Spark & MLLib Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE

More information

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at spoozhikala@stratapps.com.

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015 Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015 We Do Hadoop Fall 2014 Page 1 HDP delivers a comprehensive data management platform GOVERNANCE Hortonworks Data Platform

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Unified Batch & Stream Processing Platform

Unified Batch & Stream Processing Platform Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built

More information

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects

More information

Spark and the Big Data Library

Spark and the Big Data Library Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and

More information

CSE-E5430 Scalable Cloud Computing Lecture 11

CSE-E5430 Scalable Cloud Computing Lecture 11 CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24 Distributed Coordination Systems Consensus

More information

SQL on NoSQL (and all of the data) With Apache Drill

SQL on NoSQL (and all of the data) With Apache Drill SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Dominik Wagenknecht Accenture

Dominik Wagenknecht Accenture Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna

More information

Introduction to MapReduce, Hadoop, & Spark. Jonathan Carroll-Nellenback Center for Integrated Research Computing

Introduction to MapReduce, Hadoop, & Spark. Jonathan Carroll-Nellenback Center for Integrated Research Computing Introduction to MapReduce, Hadoop, & Spark Jonathan Carroll-Nellenback Center for Integrated Research Computing Big Data Outline Analytics Map Reduce Programming Model Hadoop Ecosystem HDFS, Pig, Hive,

More information

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Comprehensive Analytics on the Hortonworks Data Platform

Comprehensive Analytics on the Hortonworks Data Platform Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Buzzwords Berlin - 2015 Big data analytics / machine

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia Compliments of Learning Spark LIGHTNING-FAST DATA ANALYTICS Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia Bring Your Big Data to Life Big Data Integration and Analytics Learn how to power

More information

Big Data Analytics Platform @ Nokia

Big Data Analytics Platform @ Nokia Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform

More information

Write Once, Run Anywhere Pat McDonough

Write Once, Run Anywhere Pat McDonough Write Once, Run Anywhere Pat McDonough Write Once, Run Anywhere Write Once, Run Anywhere You Might Have Heard This Before! Java, According to Wikipedia Java, According to Wikipedia Java is a computer programming

More information

Big Data Research in the AMPLab: BDAS and Beyond

Big Data Research in the AMPLab: BDAS and Beyond Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned

More information

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai jason.dai@intel.com Intel Software and Services Group

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai jason.dai@intel.com Intel Software and Services Group Real-Time Analytical Processing (RTAP) Using the Spark Stack Jason Dai jason.dai@intel.com Intel Software and Services Group Project Overview Research & open source projects initiated by AMPLab in UC Berkeley

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://

More information

Databricks. A Primer

Databricks. A Primer Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Databricks. A Primer

Databricks. A Primer Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful

More information

Data Security in Hadoop

Data Security in Hadoop Data Security in Hadoop Eric Mizell Director, Solution Engineering Page 1 What is Data Security? Data Security for Hadoop allows you to administer a singular policy for authentication of users, authorize

More information

HADOOP. Revised 10/19/2015

HADOOP. Revised 10/19/2015 HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData HPE Vertica & Hadoop Tapping Innovation to Turbocharge Your Big Data #SeizeTheData The HPE Vertica portfolio One Vertica Engine running on Cloud, Bare Metal, or Hadoop Data Nodes HPE Vertica OnDemand &

More information

Brave New World: Hadoop vs. Spark

Brave New World: Hadoop vs. Spark Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,

More information

Apache Spark and Distributed Programming

Apache Spark and Distributed Programming Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache

More information

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC, Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC, Bellevue, WA Legal disclaimer The information in this

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Data Services Advisory

Data Services Advisory Data Services Advisory Modern Datastores An Introduction Created by: Strategy and Transformation Services Modified Date: 8/27/2014 Classification: DRAFT SAFE HARBOR STATEMENT This presentation contains

More information

Apache Spark. Christopher Homa. October 11, Apache Spark is an open source cluster computing framework.

Apache Spark. Christopher Homa. October 11, Apache Spark is an open source cluster computing framework. Apache Spark Christopher Homa October 11, 2016 Overview Apache Spark is an open source cluster computing framework. Initially developed at UC Berkeley s AMPLab in 2009, Spark was donated to Apache and

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache

More information

Streaming items through a cluster with Spark Streaming

Streaming items through a cluster with Spark Streaming Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member

More information

Distributed Computing" with Open-Source Software

Distributed Computing with Open-Source Software Distributed Computing" with Open-Source Software Reza Zadeh Presented at Infosys OSSmosis Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use

More information

xpaaerns on Spark, Shark, Tachyon and Mesos

xpaaerns on Spark, Shark, Tachyon and Mesos xpaaerns on Spark, Shark, Tachyon and Mesos Spark Summit 2014 Claudiu Barbura Sr. Director of Engineering A>geo Agenda xpa&erns Architecture From Hadoop to BDAS & our contribu

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information