Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai jason.dai@intel.com Intel Software and Services Group



Similar documents
Next-Gen Big Data Analytics using the Spark stack

Significantly Speed up real world big data Applications using Apache Spark

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Fast, Low-Overhead Encryption for Apache Hadoop*

Moving From Hadoop to Spark

Accelerating Business Intelligence with Large-Scale System Memory

Cloud based Holdfast Electronic Sports Game Platform

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Intel Service Assurance Administrator. Product Overview

Analytics on Spark &

Accelerating Business Intelligence with Large-Scale System Memory

Unified Batch & Stream Processing Platform

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Extended Attributes and Transparent Encryption in Apache Hadoop

Configuring RAID for Optimal Performance

Beyond Hadoop with Apache Spark and BDAS

Putting Apache Kafka to Use!

Intel Media SDK Library Distribution and Dispatching Process

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop Ecosystem B Y R A H I M A.

Unified Big Data Processing with Apache Spark. Matei

Native Connectivity to Big Data Sources in MSTR 10

Intel Ethernet and Configuring Single Root I/O Virtualization (SR-IOV) on Microsoft* Windows* Server 2012 Hyper-V. Technical Brief v1.

SQL Server 2012 Performance White Paper

Ali Ghodsi Head of PM and Engineering Databricks

Intel X38 Express Chipset Memory Technology and Configuration Guide

What s next for the Berkeley Data Analytics Stack?

COSBench: A benchmark Tool for Cloud Object Storage Services. Jiangang.Duan@intel.com

Intel Platform and Big Data: Making big data work for you.

Hadoop & Spark Using Amazon EMR

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

HiBench Introduction. Carson Wang Software & Services Group

High Performance Computing and Big Data: The coming wave.

Intel Solid-State Drive Pro 2500 Series Opal* Compatibility Guide

The Internet of Things and Big Data: Intro

Unified Big Data Analytics Pipeline. 连 城

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Intel SSD 520 Series Specification Update

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Benefits of Intel Matrix Storage Technology

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Intel Core i5 processor 520E CPU Embedded Application Power Guideline Addendum January 2011

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data Research in the AMPLab: BDAS and Beyond

Big Data Analytics - Accelerated. stream-horizon.com

Testing Big data is one of the biggest

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Dell* In-Memory Appliance for Cloudera* Enterprise

Different NFV/SDN Solutions for Telecoms and Enterprise Cloud

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

Oracle Big Data SQL Technical Update

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

* * * Intel RealSense SDK Architecture

Managing large clusters resources

Intel Matrix Storage Console

System Event Log (SEL) Viewer User Guide

The Future of Data Management

Intel Desktop Board D945GCPE

Intelligent Business Operations

Big Data Analytics - Accelerated. stream-horizon.com

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

CSE-E5430 Scalable Cloud Computing Lecture 11

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Comprehensive Analytics on the Hortonworks Data Platform

Cloud Service Brokerage Case Study. Health Insurance Association Launches a Security and Integration Cloud Service Brokerage

Intel Desktop Board D945GCPE Specification Update

Big Data Analytics with IBM Cognos BI Dynamic Query IBM Redbooks Solution Guide

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Real-Time Big Data Analytics for the Enterprise

Online Courses. Version 9 Comprehensive Series. What's New Series

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Roadmap for Transforming Intel s Business with Advanced Analytics

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

An Oracle White Paper October Oracle: Big Data for the Enterprise

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Dell In-Memory Appliance for Cloudera Enterprise

VNF & Performance: A practical approach

iscsi Quick-Connect Guide for Red Hat Linux

Intel RAID RS25 Series Performance

An Oracle White Paper June Oracle: Big Data for the Enterprise

How Companies are! Using Spark

Trafodion Operational SQL-on-Hadoop

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Large scale processing using Hadoop. Ján Vaňo

Reference Architecture, Requirements, Gaps, Roles

Transcription:

Real-Time Analytical Processing (RTAP) Using the Spark Stack Jason Dai jason.dai@intel.com Intel Software and Services Group

Project Overview Research & open source projects initiated by AMPLab in UC Berkeley BDAS: Berkeley Data Analytics Stack (Ref: https://amplab.cs.berkeley.edu/software/) Intel closely collaborating with AMPLab & the community on open source development Apache incubation since June 2013 The most active cluster data processing engine after Hadoop MapReduce Intel partnering with several big websites Building next-gen big data analytics using the Spark stack Real-time analytical processing (RTAP) E.g., Alibaba, Baidu iqiyi, Youku, etc. 2

Next-Gen Big Data Analytics Volume Massive scale & exponential growth Variety Multi-structured, diverse sources & inconsistent schemas Value Simple (SQL) descriptive analytics Complex (non-sql) predictive analytics Velocity Interactive the speed of thought Streaming/online drinking from the firehose 3

Next-Gen Big Data Analytics Volume Massive scale & exponential growth Variety Multi-structured, diverse sources & inconsistent schemas Value Simple (SQL) descriptive analytics Complex (non-sql) predictive analytics Velocity Interactive the speed of thought Streaming/online drinking from the firehose The Spark stack 4

Real-Time Analytical Processing (RTAP) A vision for next-gen big data analytics Data captured & processed in a (semi) streaming/online fashion Real-time & history data combined and mined interactively and/or iteratively Complex OLAP / BI in interactive fashion Iterative, complex machine learning & graph analysis Predominantly memory-based computation Events / Logs Messaging / Queue Stream Processing In-Memory Store Persistent Storage Online Analysis / Dashboard Low Latency Processing Engine Ad-hoc, interactive OLAP / BI NoSQL Data Warehouse Iterative, complex machine learning & graph analysis 5

Real-World Use Case #1 (Semi) Real-Time Log Aggregation & Analysis Logs continuously collected & streamed in Through queuing/messaging systems Incoming logs processed in a (semi) streaming fashion Aggregations for different time periods, demographics, etc. Join logs and history tables when necessary Aggregation results then consumed in a (semi) streaming fashion Monitoring, alerting, etc. 6

(Semi) Real-Time Log Aggregation: Mini- Batch Jobs Log Collector Kafka Client Log Collector Kafka Client Server Kafka Spark Mesos Server Kafka Spark Mesos Server Kafka Spark Mesos Architecture Log collected and published to Kafka continuously Kafka cluster collocated with Spark Cluster Independent Spark applications launched at regular interval A series of mini-batch apps running in fine-grained mode on Mesos Process newly received data in Kafka (e.g., aggregation by keys) & write results out In production in the user s environment today 10s of seconds latency RDBMS 7 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

(Semi) Real-Time Log Aggregation: Mini- Batch Jobs Design decisions Kafka and Spark servers collocated A distinct Spark task for each Kafka partition, usually reading from local file system cache Individual Spark applications launched periodically Apps logging current (Kafka partition) offsets in ZooKeeper Data of failed batch simply discarded (other strategies possible) Spark running in fine-grained mode on Mesos Allow the next batch to start even if the current one is late Performance Input (log collection) bottlenecked by network bandwidth (1GbE) Processing (log aggregation) bottlenecked by CPUs Easily scaled to several seconds 8 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

(Semi) Real-Time Log Aggregation: Spark Streaming Log Collectors Kafka Cluster Spark Cluster RDBMS Implications Better streaming framework support Complex (e.g., stateful) analysis, fault-tolerance, etc. Kafka & Spark not collocated DStream retrieves logs in background (over network) and caches blocks in memory Memory tuning to reduce GC is critical spark.cleaner.ttl (throughput * spark.cleaner.ttl < spark mem free size) Storage level (MEMORY_ONLY_SER2) Lowe latency (several seconds) No startup overhead (reusing SparkContext) 9 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Real-World Use Case #2 Complex, Interactive OLAP / BI Significant speedup of ad-hoc, complex OLAP / BI Spark/Shark cluster runs alongside Hadoop/Hive clusters Directly query on Hadoop/Hive data (interactively) No ETL, no storage overhead, etc. In-memory, real-time queries Data of interest loaded into memory create table XYZ tblproperties ("shark.cache" = "true") as select... Typically frontended by lightweight UIs Data visualization and exploration (e.g., drill down / up) 10

Time Series Analysis: Unique Event Occurrence Computing unique event occurrence across the time range Input time series <TimeStamp, ObjectId, EventId,...> E.g., watch of a particular video by a specific user E.g., transactions of a particular stock by a specific account Output: <ObjectId, TimeRange, Unique Event#, Unique Event( 2)#,, Unique Event( n)#> E.g., accumulated unique event# for each day in a week (staring from Monday) 1 2 7 Implementation 2-level aggregation using Spark Specialized partitioning, general execution graph, in-memory cached data Speedup from 20+ hours to several minutes 11 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Time Series Analysis: Unique Event Occurrence Input time series TimeStamp ObjectId EventId Partitioned by (ObjectId, EventId) Aggregation (shuffle) Potentially cached in mem ObjectId EventId Day (1, 2,,7) 1 ObjectId EventId Day (1, 2,,7) Count 12

Time Series Analysis: Unique Event Occurrence Input time series TimeStamp ObjectId EventId Partitioned by (ObjectId, EventId) Aggregation (shuffle) Potentially cached in mem Day Day ObjectId EventId 1 ObjectId EventId Count (1, 2,,7) (1, 2,,7) ObjectId EventId TimeRage [1, D] Accumulated Count ObjectId TimeRage [1, D] 1 1 (if count 2) 0 otherwise 1 (if count n) 0 otherwise Aggregation (shuffle) ObjectId TimeRage [1, D] Unique Event# Unique Event( 2)# Unique Event( n)# Partitioned by ObjectId 13

Real-World Use Case #3 Complex Machine Learning & Graph Analysis Algorithm: complex math operations Mostly matrix based Multiplication, factorization, etc. Sometime graph-based E.g., sparse matrix Iterative computations Matrix (graph) cached in memory across iterations 14

Graph Analysis: N-Degree Association N-degree association in the graph Computing associations between two vertices that are n-hop away E.g., friends of friend Graph-parallel implementation Bagel (Pregel on Spark) and GraphX Memory optimizations for efficient graph caching critical Speedup from 20+ minutes to <2 minutes Weight 1 (u, v) = edge(u, v) (0, 1) Weight n (u, v) = Weight n 1( u, x) Weight 1 (x, v) x v 15 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Graph Analysis: N-Degree Association u State[u] = list of Weight(x, u) (for current top K weights to vertex u) v w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) 16

Graph Analysis: N-Degree Association u State[u] = list of Weight(x, u) (for current top K weights to vertex u) v w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) u Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v w 17

Graph Analysis: N-Degree Association u State[u] = list of Weight(x, u) (for current top K weights to vertex u) v w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) u Weight(u, x) = D(x, u) D x,u in Messages State[u] = list of Weight(x, u) (for current top K weights to vertex w) Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) u Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v w v w 18

Graph Analysis: N-Degree Association u State[u] = list of Weight(x, u) (for current top K weights to vertex u) v w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) u Weight(u, x) = D(x, u) D x,u in Messages State[u] = list of Weight(x, u) (for current top K weights to vertex w) Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) u Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v w v w 19

Contributions by Intel Netty based shuffle for Spark FairScheduler for Spark Spark job log files Metrics system for Spark Configurations system for Spark Spark shell on YARN Spark (standalone mode) integration with security Hadoop Byte code generation for Shark Co-partitioned join in Shark... 20

Summary The Spark stack: lightning-fast analytics over Hadoop data Active communities and early adopters evolving Apache incubator project A growing stack: SQL, streaming, graph-parallel, machine learning, Work with us on next-gen big data analytics using the Spark stack Interactive and in-memory OLAP / BI Complex machine learning & graph analysis Near real-time, streaming processing And many more! 21

Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel may make changes to specifications, product descriptions, and plans at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. All dates provided are subject to change without notice. Intel and Intel logoare trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright 2013, Intel Corporation. All rights reserved. 22