Next-Gen Big Data Analytics using the Spark stack

Similar documents
Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Hadoop Applications on High Performance Computing. Devaraj Kavali

Extended Attributes and Transparent Encryption in Apache Hadoop

Moving From Hadoop to Spark

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Unified Big Data Analytics Pipeline. 连 城

Significantly Speed up real world big data Applications using Apache Spark

How To Create A Data Visualization With Apache Spark And Zeppelin

Conquering Big Data with BDAS (Berkeley Data Analytics)

Dell* In-Memory Appliance for Cloudera* Enterprise

An Open Source Memory-Centric Distributed Storage System

Unified Big Data Processing with Apache Spark. Matei

Beyond Hadoop with Apache Spark and BDAS

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

The Internet of Things and Big Data: Intro

Big Data Processing. Patrick Wendell Databricks

Fast, Low-Overhead Encryption for Apache Hadoop*

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

Big Data Research in the AMPLab: BDAS and Beyond

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Hadoop & Spark Using Amazon EMR

Analytics on Spark &

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hetero Streams Library 1.0

Big Data and Industrial Internet

Spark: Making Big Data Interactive & Real-Time

Tachyon: A Reliable Memory Centric Storage for Big Data Analytics

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Architectures for massive data management

Dell In-Memory Appliance for Cloudera Enterprise

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Accelerating Business Intelligence with Large-Scale System Memory

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

How Companies are! Using Spark

Accelerating Business Intelligence with Large-Scale System Memory

Hadoop Ecosystem B Y R A H I M A.

What s next for the Berkeley Data Analytics Stack?

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

HiBench Introduction. Carson Wang Software & Services Group

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Intel Media SDK Library Distribution and Dispatching Process

Ali Ghodsi Head of PM and Engineering Databricks

CS 294: Big Data System Research: Trends and Challenges

Why Spark on Hadoop Matters

Oracle Big Data SQL Technical Update

Real Time Data Processing using Spark Streaming

Big Data Analytics - Accelerated. stream-horizon.com

The Case for Rack Scale Architecture

Enterprise Operational SQL on Hadoop Trafodion Overview

Page Modification Logging for Virtual Machine Monitor White Paper

Hur hanterar vi utmaningar inom området - Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER

Shark Installation Guide Week 3 Report. Ankush Arora

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Implement Hadoop jobs to extract business value from large and varied data sets

Processing NGS Data with Hadoop-BAM and SeqPig

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Intel Platform and Big Data: Making big data work for you.

YARN Apache Hadoop Next Generation Compute Platform

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

GridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

This is a brief tutorial that explains the basics of Spark SQL programming.

Brave New World: Hadoop vs. Spark

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Intel Service Assurance Administrator. Product Overview

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Cloud Computing. Big Data. High Performance Computing

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

From Spark to Ignition:

Big Data Course Highlights

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Intel Unite. User Guide

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Conquering Big Data with Apache Spark

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

The Berkeley Data Analytics Stack: Present and Future

Reference Architecture, Requirements, Gaps, Roles

The Future of Data Management

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Upcoming Announcements

Putting Apache Kafka to Use!

Transcription:

Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel

Agenda Overview Apache Spark stack Next-gen big data analytics Our vision and efforts Highlights of our work on Spark stack Reliability of Spark Streaming SQL processing on Spark Spark Stream-SQL Tachyon hierarchical storage Analytics & SparkR 2

Next-Gen Big Data Analytics Volume Massive scale & exponential growth Variety Multi-structured, diverse sources & inconsistent schemas Value Simple (SQL) descriptive analytics Complex (non-sql) predictive analytics Velocity Interactive the speed of thought Streaming/online drinking from the firehose The Spark stack 3

Project Overview Research & open source projects initiated by AMPLab in UC Berkeley BDAS: Berkeley Data Analytics Stack (Ref: https://amplab.cs.berkeley.edu/software/) Intel closely collaborating with AMPLab & community on open source development Started in 2012 when still a research project Intel among top 3 Spark contributors Multiple Intel committers since the start of the project Many critical contributions Netty based shuffle, FairScheduler, metrics system, yarn-client mode, Continuous collaborations with AMPLab on new innovations in BDAS Major industry partners for Tachyon, GraphX & other machine learning efforts 4

Next Generations of Big Data Analytics Next-gen big data analytics paradigm Data captured & processed in a (semi-) real-time / streaming fashion Data mined using SQL queries as well as large-scale advanced analytics (e.g., machine learning, graph analysis, etc.) Interactive and iterative computations leveraging distributed in-memory data cache Events / Logs Messaging / Queue Stream Processing In-Memory Store Low Latency Query Engine Ad-hoc, interactive query & OLAP / BI Persistent Storage NoSQL Data Warehouse Low Latency Processing Engine Iterative machine learning and data mining 5

Next-Gen Big Data Analytics using the Spark stack Build next-gen Big Data analytics with the Spark stack Analysis on streaming data: real-time decisions Interactive analysis: faster decisions Advanced analytics: better decisions Events / Logs Kafka Spark Streaming RDD Cache Spark (Spark-SQL, Hive on Spark, etc.) Ad-hoc, interactive query & OLAP / BI HDFS NoSQL Data Warehouse Spark, (MLlib, GraphX, etc.) Iterative machine learning and data mining For more details, see our talk Real-Time Analytical Processing (RTAP) using Spark and Shark at Strata Conference + Hadoop World NYC 2013 6

Agenda Overview Apache Spark stack Next-gen big data analytics Our vision and efforts Highlights of our work on Spark stack Reliability of Spark Streaming SQL processing on Spark Spark Stream-SQL Tachyon hierarchical storage Analytics & SparkR 7

Spark Streaming Spark Streaming: discrete stream (DStream) Run streaming computation as a series of very small, deterministic (mini-batch) Spark jobs As frequent as ~1/2 second Better fault tolerance, straggler handling & state consistency High reliability of Spark Streaming Collaborations between Databricks, Cloudera and Intel Improving Spark driver reliability through WAL https://issues.apache.org/jira/browse/spark-3129 Improving Kafka receiver reliability by better managing Kafka offset tacking & WAL operations https://issues.apache.org/jira/browse/spark-4062 input time = 0-1: input time = 1-2: input stream Spark (mini-batch) job state / output state / output stream 8

SQL Processing on Spark: Hive on Spark Hive on Spark Spark as a new execution engine for Hive Combine the strength of Hive and Spark Support full Hive feature set Utilize Spark as the powerful execution engine, greatly boost SQL performance Smooth migration for existing Hive users Hive M/R TEZ Spark Community efforts https://issues.apache.org/jira/browse/hive-7292 Collaborations among Cloudera, Intel, MapR, Databricks, IBM, etc. Current plan Merge back to Hive trunk in February 2015 Community Beta release in 1H 2015 9

SQL Processing on Spark: Spark SQL Spark SQL Structured data analysis using SQL queries on Spark Hive tables, Parquet files, etc. Integration with analytics pipelines Intel s efforts Functionalities (e.g., HiveQL compatibility) Data types (Timestamp, Date, Union, etc.) GroupingSet/ROLLUP/CUBE (https://issues.apache.org/jira/browse/spark-2663) Performance (esp. complex queries) Join performance (https://issues.apache.org/jira/browse/spark-2211) Aggregation performance (https://issues.apache.org/jira/browse/spark-4366) Hive UDF performance (https://issues.apache.org/jira/browse/spark-4093) 10

Spark Stream-SQL: SQL + Streaming Spark Stream-SQL Processing and analyzing input data stream (potentially combined with history/reference data) using SQL queries Built on top of Spark Streaming and Spark SQL frameworks Events / Logs Kafka Spark Streaming RDD Cache Spark-SQL Ad-hoc, interactive query & OLAP / BI HDFS NoSQL Data Warehouse Spark, (MLLib, GraphX, etc.) Iterative machine learning and data mining CREATE STREAM IF NOT EXISTS people_stream1 (name STRING, age INT) STORED AS LOCATION kafka:// ; CREATE STREAM IF NOT EXISTS people_stream2 (name STRING, zipcode INT) STORED AS LOCATION kafka:// ; SELECT zipcode, AVG(age) FROM people_stream1 JOIN people_stream2 ON people_stream1.name = people_stream2.name GROUPBY zipcode;

Spark Stream-SQL: SQL + Streaming Open sourced under Apache 2.0 license https://github.com/intel-spark/stream-sql Developer preview (based on Spark 0.7) available Currently under active development An update based on latest Spark version will be available soon Many more features & optimizations are being added Plan to contribute back to the main Spark project Welcome Collaboration! 12

Tachyon Hierarchical Storage Tachyon Distributed, reliable in-memory file system unifying different underlying storage systems (Source: http://www.slideshare.net/haoyuanli/tachyon20141121ampcamp5-41881671) Spark MapRe duce Spark H2O GraphX Impala SQL Memory across different servers organized as a cache pool for memory-speed data sharing HDFS S3 Gluster FS Orange FS NFS Ceph Tachyon hierarchical storage The cache pool manages multiple storage tiers for memory-speed data sharing Efficient support for flash/ssd, cloud, and/ or HPC environment Currently under active development T A C H Y O N Spark Server Ramdisk Local SSD Local HDD MapReduce Server Ramdisk Local SSD Local HDD Spark Streaming Cache Pool HDFS Server Ramdisk Local SSD Local HDD MLlib... https://tachyon.atlassian.net/browse/tachyon-33 Will be available soon in Tachyon 0.6 release https://github.com/amplab/tachyon 13

Analytics & SparkR SparkR Distributed R frontend for analytics on Spark RDD Distributed list in R Alpha developer release from UC Berkeley https://github.com/amplab-extras/sparkr-pkg Our efforts/plans Complete SparkR RDD APIs Analytics performance improvements Better integrations of SparkR and Spark analytics frameworks MLlib, GraphX, etc. Distributed DataFrame support Leveraging Spark SQL (SchmeRDD, optimizer, etc.) 14

Summary Driving next generations of Big Data Analytics with Spark Stack Analysis on streaming data: real-time decisions Interactive analysis: faster decisions Advanced analytics: better decisions Partnering with open source community BDAS (Berkeley Data Analytics Stack) Apache Spark project Tachyon project SparkR project Welcome Collaboration! 15

Notices and Disclaimers Copyright 2014 Intel Corporation. Intel, the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. See Trademarks on intel.com for full list of Intel trademarks. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer. No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate from publish. 16