Moving From Hadoop to Spark
|
|
- Herbert McDaniel
- 8 years ago
- Views:
Transcription
1 + Moving From Hadoop to Spark Sujee Maniyam Founder / sujee@elephantscale.com Bay Area ACM meetup ( )
2 + HI, Featured in Hadoop Weekly #109
3 + About Me : Sujee Maniyam n 15 years+ software development experience n Consulting & Training in Big Data n Author n Hadoop illuminated open source book n HBase Design Patterns coming soon n Open Source contributor (including HBase) n Founder / Organizer of Big Data Guru meetup n n Contact : sujee@elephantscale.com
4 + Hadoop in 20 Seconds n The Big data platform n Very well field tested n Scales to peta-bytes of data n MapReduce : Batch oriented compute
5 + Hadoop Eco System Real Time Batch ElephantScale.com, 2014
6 + Hadoop Ecosystem n HDFS n provides distributed storage n Map Reduce n Provides distributed computing n Pig n High level MapReduce n Hive n SQL layer over Hadoop n HBase n Nosql storage for realtime queries ElephantScale.com, 2014
7 + Spark in 20 Seconds n Fast & Expressive Cluster computing engine n Compatible with Hadoop n Came out of Berkeley AMP Lab n Now Apache project n Version 1.2 just released (Dec 2014) First Big Data platform to integrate batch, streaming and interactive computations in a unified framework stratio.com
8 + Spark Eco-System Schema / sql Real Time Machine Learning Graph processing Spark SQL Spark Streaming ML lib GraphX Spark Core Stand alone YARN MESOS Cluster managers
9 + Hypo-meter J
10 + Spark Job Trends
11 + Spark Benchmarks Source : stratio.com
12 + Spark Code / Activity Source : stratio.com
13 + Timeline : Hadoop & Spark
14 + Hadoop Vs. Spark Hadoop Spark Source :
15 + Comparison With Hadoop Hadoop Distributed Storage + Distributed Compute MapReduce framework Usually data on disk (HDFS) Not ideal for iterative work Batch process Spark Distributed Compute Only Generalized computation On disk / in memory Great at Iterative workloads (machine learning..etc) - Upto 2x - 10x faster for data on disk - Upto 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration
16 + Hadoop + Yarn : Universal OS for Distributed Compute Batch (mapreduce) Streaming (storm, S4) In-memory (spark) Applications YARN Cluster Management HDFS Storage
17 + Spark Is Better Fit for Iterative Workloads
18 + Spark Programming Model n More generic than MapReduce
19 + Is Spark Replacing Hadoop? n Spark runs on Hadoop / YARN n Complimentary n Spark programming model is more flexible than MapReduce n Spark is really great if data fits in memory (few hundred gigs), n Spark is storage agnostic (see next slide)
20 + Spark & Pluggable Storage Spark (compute engine) HDFS Amazon S3 Cassandra???
21 + Spark & Hadoop Use Case Other Spark Batch processing Hadoop s MapReduce (Java, Pig, Hive) SQL querying Hadoop : Hive Spark SQL Stream Processing / Real Time processing Storm Kafka Spark RDDs (java / scala / python) Spark Streaming Machine Learning Mahout Spark ML Lib Real time lookups NoSQL (Hbase, Cassandra..etc) No Spark component. But Spark can query data in NoSQL stores
22 + Hadoop & Spark Future???
23 + Why Move From Hadoop to Spark? n Spark is easier than Hadoop n friendlier for data scientists / analysts n Interactive shell n fast development cycles n adhoc exploration n API supports multiple languages n Java, Scala, Python n Great for small (Gigs) to medium (100s of Gigs) data
24 + Spark : Unified Stack n Spark supports multiple programming models n Map reduce style batch processing n Streaming / real time processing n Querying via SQL n Machine learning n All modules are tightly integrated n Facilitates rich applications n Spark can be only stack you need! n No need to run multiple clusters (Hadoop cluster, Storm cluster..etc) Image: buymeposters.com
25 + Migrating From Hadoop à Spark Functionality Hadoop Spark Distributed Storage HDFS Cloud storage like Amazon S3 Or NFS mounts SQL querying Hive Spark SQL ETL work flow Pig - Spork : Pig on Spark - Mix of Spark SQL..etc Machine Learning Mahout ML Lib NoSQL DB Hbase???
26 + Moving From Hadoop à Spark 1. Data size 2. File System 3. SQL 4. ETL 5. Machine Learning
27 + Hadoop To Spark Real Time Spark can help Batch ElephantScale.com, 2014
28 + Big Data
29 + Data Size : You Don t Have Big Data
30 + 1) Data Size (T-shirt sizing) Spark < few G 10 G G + 1 TB TB + PB + Hadoop Image credit : blog.trumpi.co.za
31 + 1) Data Size n Lot of Spark adoption at SMALL MEDIUM scale n Good fit n Data might fit in memory!! n Hadoop may be overkill n Applications n Iterative workloads (Machine learning..etc) n Streaming n Hadoop is still preferred platform for TB + data
32 + Next : 2) File System ElephantScale.com, 2014
33 + 2) File System n Hadoop = Storage + Compute Spark = Compute only Spark needs a distributed FS n File system choices for Spark n HDFS - Hadoop File System n Reliable n Good performance (data locality) n Field tested for PB of data n S3 : Amazon n Reliable cloud storage n Huge scale n NFS : Network File System ( shared FS across machines)
34 + Spark File Systems
35 + File Systems For Spark Data locality Throughput Latency Reliability HDFS NFS Amazon S3 High (best) High (best) Low (best) Very High (replicated) Local enough Medium (good) Low Low None (ok) Low (ok) High Very High Cost Varies Varies $30 / TB / Month
36 + File System Throughput Comparison (HDFS Vs. S3) n Data : 10G + (11.3 G) n Each file : ~1+ G ( x 10) n 400 million records total n Partition size : 128 M n On HDFS & S3 n Cluster : n 8 Nodes on Amazon m3.xlarge (4 cpu, 15 G Mem, 40G SSD ) n Hadoop cluster, Latest Horton Works HDP v2.2 n Spark : on same 8 nodes, stand-alone, v 1.2
37 + File System Throughput Comparison (HDFS Vs. S3) val hdfs = sc.textfile("hdfs:/// /10G/") val s3 = sc.textfile("s3n:// /10G/") // count # records hdfs.count() s3.count()
38 + HDFS Vs. S3
39 + HDFS Vs. S3 (lower is better)
40 + HDFS Vs. S3 Conclusions HDFS Data locality à much higher throughput Need to maintain an Hadoop cluster Large data sets (TB + ) S3 Data is streamed à lower throughput No Hadoop cluster to maintain à convenient Good use case: - Smallish data sets (few gigs) - Load once and cache and re-use
41 + Next : 3) SQL ElephantScale.com, 2014
42 + 3) SQL in Hadoop / Spark Hadoop Spark Engine Hive Spark SQL Language HiveQL - HiveQL - RDD programming in Java / Python / Scala Scale Petabytes Terabytes? Inter operability Can read Hive tables or stand alone data Formats CSV, JSON, Parquet CSV, JSON, Parquet
43 + SQL In Hadoop / Spark n Input Billing Records / CDR Timestamp Customer_id Resource_id Qty cost Milliseconds String Int Int int Phone 10 10c SMS 1 4c Data 3M 5c n Query: Find top-10 customers n Data Set n 10G + data n 400 million records n CSV Format
44 + SQL In Hadoop / Spark n Hive Table: CREATE EXTERNAL TABLE billing ( ts BIGINT, customer_id INT, resource_id INT, qty INT, cost INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ', stored as textfile LOCATION hdfs location' ; n Hive Query (simple aggregate) select customer_id, SUM(cost) as total from billing group by customer_id order by total DESC LIMIT 10;
45 + Hive Query Results
46 + Spark + Hive Table n Spark code to access Hive table import org.apache.spark.sql.hive.hivecontext val hivectx = new org.apache.spark.sql.hive.hivecontext(sc) val top10 = hivectx.sql("select customer_id, SUM(cost) as total from billing group by customer_id order by total DESC LIMIT 10") top10.collect()
47 + Spark SQL Vs. Hive Fast on same HDFS data!
48 + SQL In Hadoop / Spark : Conclusions n Spark can readily query Hive tables n Speed! n Great for exploring / trying-out n Fast iterative development n Spark can load data natively n CSV n JSON (Schema automatically inferred) n Parquet (Schema automatically inferred)
49 + Next : 3) ETL In Hadoop / Spark ElephantScale.com, 2014
50 + ETL? Data 1 Data 2 (clean) Data 4 Data 3
51 + 3) ETL on Hadoop / Spark Hadoop Spark ETL Tools Pig, Cascading, Oozie Native RDD programming (Scala, Java, Python) Pig High level ETL workflow Spork : Pig on Spark Cascading High level Spark-scalding
52 + ETL On Hadoop / Spark n Pig n High level, expressive data flow language (Pig Latin) n Easier to program than Java Map Reduce n Used for ETL (data cleanup / data prep) n Spork : Run Pig on Spark (as simple as $ pig -x spark..) n n Cascading n High level data flow declarations n Many sources (Cassandra / Accumulo / Solr) n Spark-Scalding n
53 + ETL On Hadoop / Spark : Conclusions n Try spork or spark-scalding n Code re-use n Not re-writing from scratch n Program RDDs directly n More flexible n Multiple language support : Scala / Java / Python n Simpler / faster in some cases
54 + 4) Machine Learning : Hadoop / Spark Hadoop Spark Tool Mahout MLLib API Java Java / Scala / Python Iterative Algorithms Slower Very fast (in memory) In Memory processing No Efforts to port Mahout into Spark YES Lots of momentum!
55 + Spark Is Better Fit for Iterative Workloads
56 + Spark Caching! n Reading data from remote FS (S3) can be slow n For small / medium data ( s of GB) use caching n Pay read penalty once n Cache n Then very high speed computes (in memory) n Recommended for iterative work-loads
57 + Caching Demo!
58 + Caching Results Cached!
59 + Spark Caching n Caching is pretty effective (small / medium data sets) n Cached data can not be shared across applications (each application executes in its own sandbox)
60 + Sharing Cached Data n 1) spark job server n Multiplexer n All requests are executed through same context n Provides web-service interface n 2) Tachyon n Distributed In-memory file system n Memory is the new disk! n Out of AMP lab, Berkeley n Early stages (very promising)
61 + Spark Job Server
62 + Spark Job Server n Open sourced from Ooyala n Spark as a Service simple REST interface to launch jobs n Sub-second latency! n Pre-load jars for even faster spinup n Share cached RDDs across requests (NamedRDD) App1 : ctx.saverdd( my cached rdd, rdd1) App2: RDD rdd2 = ctx.loadrdd ( my cached rdd ) n
63 + Tachyon + Spark
64 + Next : New Big Data Applications With Spark
65 + Big Data Applications : Now n Analysis is done in batch mode (minutes / hours) n Final results are stored in a real time data store like Cassandra / Hbase n These results are displayed in a dashboard / web UI n Doing interactive analysis???? n Need special BI tools
66 + With Spark n Load data set (Giga bytes) from S3 and cache it (one time) n Super fast (sub-seconds) queries to data n Response time : seconds (just like a web app!)
67 + Lessons Learned n Build sophisticated apps! n Web-response-time (few seconds)!! n In-depth analytics n Leverage existing libraries in Java / Scala / Python n data analytics as a service
68 + Final Thoughts n Already on Hadoop? n Try Spark side-by-side n Process some data in HDFS n Try Spark SQL for Hive tables n Contemplating Hadoop? n Try Spark (standalone) n Choose NFS or S3 file system n Take advantage of caching n Iterative loads n Spark Job servers n Tachyon n Build new class of big / medium data apps
69 + Thanks! Sujee Maniyam Expert consulting & training in Big Data (Now offering Spark training)
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationHow To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
More informationUnified Big Data Analytics Pipeline. 连 城 lian@databricks.com
Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
More informationAli Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
More informationUpcoming Announcements
Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC jmarkham@hortonworks.com Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More informationNext-Gen Big Data Analytics using the Spark stack
Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our
More informationHow Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
More informationWhy Spark on Hadoop Matters
Why Spark on Hadoop Matters MC Srivas, CTO and Founder, MapR Technologies Apache Spark Summit - July 1, 2014 1 MapR Overview Top Ranked Exponential Growth 500+ Customers Cloud Leaders 3X bookings Q1 13
More informationApache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
More informationCS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
More informationBig Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
More informationSpark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
More informationDistributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us
DATA INTELLIGENCE FOR ALL Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO Agenda 1. Challenges & Motivation 2. DDF Overview 3. DDF Design
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationSelf-service BI for big data applications using Apache Drill
Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Data Is Doubling Every Two Years Unstructured data will account for more than 80% of the data
More informationConquering Big Data with BDAS (Berkeley Data Analytics)
UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationHadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationSelf-service BI for big data applications using Apache Drill
Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Management - MCS MapR Data Platform for Hadoop and NoSQL APACHE HADOOP AND OSS ECOSYSTEM Batch
More informationAnalytics on Spark & Shark @Yahoo
Analytics on Spark & Shark @Yahoo PRESENTED BY Tim Tully December 3, 2013 Overview Legacy / Current Hadoop Architecture Reflection / Pain Points Why the movement towards Spark / Shark New Hybrid Environment
More informationHDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationHadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
More informationBeyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationBig Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
More informationThe Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
More informationBig Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth
MAKING BIG DATA COME ALIVE Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth Steve Gonzales, Principal Manager steve.gonzales@thinkbiganalytics.com
More informationHDP Enabling the Modern Data Architecture
HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,
More informationInformation Builders Mission & Value Proposition
Value 10/06/2015 2015 MapR Technologies 2015 MapR Technologies 1 Information Builders Mission & Value Proposition Economies of Scale & Increasing Returns (Note: Not to be confused with diminishing returns
More informationFast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationIntroduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
More informationMonitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
More informationSpark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
More informationGAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION
GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.
More informationBig Data Analytics with Cassandra, Spark & MLLib
Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE
More informationNative Connectivity to Big Data Sources in MSTR 10
Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single
More informationLearning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia
Compliments of Learning Spark LIGHTNING-FAST DATA ANALYTICS Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia Bring Your Big Data to Life Big Data Integration and Analytics Learn how to power
More informationHortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015
Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015 We Do Hadoop Fall 2014 Page 1 HDP delivers a comprehensive data management platform GOVERNANCE Hortonworks Data Platform
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationBig Data Open Source Stack vs. Traditional Stack for BI and Analytics
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at spoozhikala@stratapps.com.
More informationUsing distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
More informationCSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24 Distributed Coordination Systems Consensus
More informationTE's Analytics on Hadoop and SAP HANA Using SAP Vora
TE's Analytics on Hadoop and SAP HANA Using SAP Vora Naveen Narra Senior Manager TE Connectivity Santha Kumar Rajendran Enterprise Data Architect TE Balaji Krishna - Director, SAP HANA Product Mgmt. -
More informationThis is a brief tutorial that explains the basics of Spark SQL programming.
About the Tutorial Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types
More informationHadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
More informationNear Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationHADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
More informationUnified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
More informationComprehensive Analytics on the Hortonworks Data Platform
Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Buzzwords Berlin - 2015 Big data analytics / machine
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationReal-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai jason.dai@intel.com Intel Software and Services Group
Real-Time Analytical Processing (RTAP) Using the Spark Stack Jason Dai jason.dai@intel.com Intel Software and Services Group Project Overview Research & open source projects initiated by AMPLab in UC Berkeley
More informationArchitectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
More informationInfomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
More informationDominik Wagenknecht Accenture
Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationBig Data Research in the AMPLab: BDAS and Beyond
Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationApache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
More informationSurvey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects
More informationIntroduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
More informationDatabricks. A Primer
Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically
More informationBrave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
More informationDatabricks. A Primer
Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful
More informationSQL on NoSQL (and all of the data) With Apache Drill
SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of
More informationHPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData
HPE Vertica & Hadoop Tapping Innovation to Turbocharge Your Big Data #SeizeTheData The HPE Vertica portfolio One Vertica Engine running on Cloud, Bare Metal, or Hadoop Data Nodes HPE Vertica OnDemand &
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationSession 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,
Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC, Bellevue, WA Legal disclaimer The information in this
More informationBuilding Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
More informationData Security in Hadoop
Data Security in Hadoop Eric Mizell Director, Solution Engineering Page 1 What is Data Security? Data Security for Hadoop allows you to administer a singular policy for authentication of users, authorize
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationData Services Advisory
Data Services Advisory Modern Datastores An Introduction Created by: Strategy and Transformation Services Modified Date: 8/27/2014 Classification: DRAFT SAFE HARBOR STATEMENT This presentation contains
More informationStreaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
More informationBig Data Analytics Platform @ Nokia
Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform
More informationWrite Once, Run Anywhere Pat McDonough
Write Once, Run Anywhere Pat McDonough Write Once, Run Anywhere Write Once, Run Anywhere You Might Have Heard This Before! Java, According to Wikipedia Java, According to Wikipedia Java is a computer programming
More informationSOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
More informationAn Open Source Memory-Centric Distributed Storage System
An Open Source Memory-Centric Distributed Storage System Haoyuan Li, Tachyon Nexus haoyuan@tachyonnexus.com September 30, 2015 @ Strata and Hadoop World NYC 2015 Outline Open Source Introduction to Tachyon
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
More informationLecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
More informationHDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationSpark Application Carousel. Spark Summit East 2015
Spark Application Carousel Spark Summit East 2015 About Today s Talk About Me: Vida Ha - Solutions Engineer at Databricks. Goal: For beginning/early intermediate Spark Developers. Motivate you to start
More informationCommunicating with the Elephant in the Data Center
Communicating with the Elephant in the Data Center Who am I? Instructor Consultant Opensource Advocate http://www.laubersoltions.com sml@laubersolutions.com Twitter: @laubersm Freenode: laubersm Outline
More informationDesigning Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera
Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,
More informationxpaaerns on Spark, Shark, Tachyon and Mesos
xpaaerns on Spark, Shark, Tachyon and Mesos Spark Summit 2014 Claudiu Barbura Sr. Director of Engineering A>geo Agenda xpa&erns Architecture From Hadoop to BDAS & our contribu
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More information