xpaaerns on Spark, Shark, Tachyon and Mesos

Similar documents
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Hadoop & Spark Using Amazon EMR

Unified Big Data Processing with Apache Spark. Matei

Moving From Hadoop to Spark

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Wisdom from Crowds of Machines

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

Analytics on Spark &

Workshop on Hadoop with Big Data

Next-Gen Big Data Analytics using the Spark stack

Implement Hadoop jobs to extract business value from large and varied data sets

How To Create A Data Visualization With Apache Spark And Zeppelin

Upcoming Announcements

Getting Real Real Time Data Integration Patterns and Architectures

Beyond Hadoop with Apache Spark and BDAS

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Building your Big Data Architecture on Amazon Web Services

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

How Companies are! Using Spark

Delivering Intelligence to Publishers Through Big Data

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

So What s the Big Deal?

How to Choose Between Hadoop, NoSQL and RDBMS

Online Transaction Processing in SQL Server 2008

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

Scalable Architecture on Amazon AWS Cloud

Using RDBMS, NoSQL or Hadoop?

Unified Big Data Analytics Pipeline. 连 城

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Databricks. A Primer

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Conquering Big Data with BDAS (Berkeley Data Analytics)

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Dell In-Memory Appliance for Cloudera Enterprise

Complete Java Classes Hadoop Syllabus Contact No:

XpoLog Competitive Comparison Sheet

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Analytics Nokia

CitusDB Architecture for Real-Time Big Data

Oracle Big Data SQL Technical Update

Databricks. A Primer

SQL Server 2012 Performance White Paper

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Big Data Analytics - Accelerated. stream-horizon.com

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Deploying Hadoop with Manager

Hadoop & its Usage at Facebook

Spark Job Server. Evan Chan and Kelvin Chu. Date

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Shark Installation Guide Week 3 Report. Ankush Arora

MySQL and Hadoop Big Data Integration

Hadoop: The Definitive Guide

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Research in the AMPLab: BDAS and Beyond

HiBench Introduction. Carson Wang Software & Services Group

Chase Wu New Jersey Ins0tute of Technology

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Dell* In-Memory Appliance for Cloudera* Enterprise

Integrating Apache Spark with an Enterprise Data Warehouse

GridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs

Reference Architecture, Requirements, Gaps, Roles

Peers Techno log ies Pv t. L td. HADOOP

An Open Source Memory-Centric Distributed Storage System

Real Time Big Data Processing

Data processing goes big

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Microsoft SQL Server 2008 R2 Enterprise Edition and Microsoft SharePoint Server 2010

Server Consolidation with SQL Server 2008

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Performance and Scalability Overview

Spark: Cluster Computing with Working Sets

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Oracle Database 12c Plug In. Switch On. Get SMART.

Apache Kylin Introduction Dec 8,

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Transcription:

xpaaerns on Spark, Shark, Tachyon and Mesos Spark Summit 2014 Claudiu Barbura Sr. Director of Engineering A>geo

Agenda xpa&erns Architecture From Hadoop to BDAS & our contribu<ons Lessons learned with Spark: from 0.8.0 to 0.9.1 Demo: xpa&erns APIs and GUIs Q & A Inges<on (EL) Transforma<on (T) Jaws H&p SharkServer (warehouse explorer) Export to NoSql API (data publishing) xpa&erns monitoring and instrumenta<on (Demo) 2

3

4

Hadoop MR - > Spark o core + graphx Hadoop to BDAS Hive - > Shark o Cli + SharkServer2 +... Jaws! NO resource manager - > Mesos o Spark Job Servers, Jaws, SharkServer2, Hadoop, Aurora No Cache - > Tachyon o sharing data between contexts, satellite cluster file system, faster for long running queries GC friendlier, survives JVM crashes Hadoop distro dashboards- > Ganglia o + Nagios & Graphite 5

BDAS to BDAS++ Jaws, xpa&erns h&p Shark server, open sourcing today! h&p://github.com/a<geo/h&p- shark- server Spark Job Server mul<ple contexts in same JVM job submission in Java + Scala Mesos framework starva<on bug fixed detailed Tech Blog link at h&p://xpa&erns.com/sparksummit *SchedulerBackend update, job cancella<on in Mesos fine- grained mode, 0.9.0 patches (shuffle spill, Mesos fine- grained) Databricks cer<fied! 6

Spark 0.8.0 to 1.0 0.8.0 - first POC lots of OOM 0.8.1 first produc<on deployment, s<ll lots of OOM 20 billion healthcare records, 200 TB of compressed hdfs data Hadoop MR: 100 m1.xlarge (4c x 15GB) BDAS: 20 cc2.8xlarge (32c x 60.8 GB), s<ll lots of OOM map & reducer side Perf gains of 4x to 40x, required individual dataset and query fine- tuning Mixed Hive & Shark workloads where it made sense Daily processing reduced from 14 hours to 1.5hours! 0.9.0 - fixes many of the problems, but s<ll requires patches! (spill & mesos fine- grained) 1.0 upgrade in progress, Jaws being migrated to Spark SQL set mapreduce.job.reduces=, set shark.column.compress=true, spark.default.parallelism=, spark.storage.memoryfrac<on=0.3, spark.shuffle.memoryfrac<on=0.6, spark.shuffle.consolidatefiles=true, spark.shuffle.spill=false true, 7

Distributed Data Inges>on API & GUI Highly available, scalable and resilient distributed download tool exposed through Resnul API & GUI Supports encryp<on/decryp<on, compression/decompression, automa<c backup & restore (aws S3) and geo- failover (hdfs and S3 in both us- east and us- west ec2 regions) Support mul<ple input sources: sop, S3 and 450+ sources through Talend Integra<on Configurable throughput (number of parallel Spark processors, in both fine- grained and coarse- grained Mesos modes) File Transfer log and file transi<on state history for audi<ng purposes (pluggable persistence model, Cassandra/hdfs), configurable alerts, reports Ingest + Backup: download + decompression + hdfs persistence + encryp<on + S3 upload Restore: S3 download + decryp<on + decompress + hdfs persistence Geo- failover: backup on S3 us- east + restore from S3 us- east into west- coast hdfs + backup on S3 us- west Inges<on jobs can be resumed from any stage aoer failure (# of Spark task retries exhausted) Logs, job status and progress pushed asynchronously to GUI though web sockets H&p streaming API exposed for high- throughput push model inges<on (inges<on into Kara pub- sub, batch Spark job for transfer into hdfs) 8

9

T- Component API & GUI Data Transforma<on component for building a data pipeline with monitoring and quality gates Exposes all of Oozie s ac<on types and adds Spark (Java & Scala) and Shark (QL) stages Uses our own Spark JobServer (mul<ple Spark contexts in same JVM!) Spark stage required to run code that accepts an xpa&erns- managed Spark context (coarse- grained or fine- grained) as parameter DAG and job execu<on info persistence in Hive Metastore Exposes full API for job, stages, resources management and scheduled pipeline execu<on Logs, job status and progress pushed asynchronously to GUI though web sockets 10

T- component DAG executed by Oozie Spark and Shark stages executed through ssh ac<ons Spark stage sent to Spark JobServer SharSk stage executed through shark CLI for now (SharkServer2 in the future) Support for pyspark stage coming soon 11

Jaws REST SharkServer & GUI Jaws: a highly scalable and resilient resnul (h&p) interface on top of a managed Shark session that can concurrently and asynchronously submit Shark queries, return persisted results (automa<cally limited in size or paged), execu<on logs and job informa<on (Cassandra or hdfs persisted). Jaws can be load balanced for higher availability and scalability and it fuels a web- based GUI that is integrated in the xpa&erns Management Console (Warehouse Explorer) Jaws exposes configura<on op<ons for fine- tuning Spark & Shark performance and running against a stand- alone Spark deployment, with or without Tachyon as in- memory distributed file system on top of HDFS, and with or without Mesos as resource manager Shark editor provides analysts, data scien<sts with a view into the warehouse through a metadata explorer, provides a query editor with intelligent features like auto- complete, a results viewer, logs viewer and historical queries for asynchronously retrieving persisted results, logs and query informa<on for both running and historical queries web- style pagina<on and query cancella<on, spray io h&p layer (REST on Akka) Open Sourced at the Summit! h&p://github.com/a<geo/h&p- shark- server 12

Jaws REST SharkServer & GUI 13

14

Export to NoSql API Datasets in the warehouse need to be exposed to high- throughput low- latency real- <me APIs. Each applica<on requires extra processing performed on top of the core datasets, hence addi<onal transforma<ons are executed for building data marts inside the warehouse Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring) (instrumenta<on dashboard embedded, logs, progress and instrumenta<on events pushed though SSE) Data Modeling is driven by the read access pa&erns provided by an applica<on engineer building dashboards and visualiza<ons: lookup key, columns (record fields to read), paging, sor<ng, filtering The end result of a job run is a REST API endpoint (instrumented, monitored, resilient, geo- replicated) that uses the underlying generated Cassandra data model and fuels the data in the dashboards Configura<on API provided for crea<ng export jobs and execu<ng them (ad- hoc or scheduled). Logs, job status and progress pushed asynchronously to GUI though web sockets 15

16

Mesos/Spark cluster 17

Cassandra mul> DC ring write latency 18

Nagios monitoring 19

20

Coming soon Export to Seman<c Search API (solrcloud/lucene) pyspark Job Server pyspark ß à Shark/Tachyon interop (either) pyspark ß à Spark SQL (1.0) interop (or) Parquet columnar storage for warehouse data 21

We need your feedback! Be the first to test new features, get updates, and give feedback by signing up at h&p://xpa&erns.com/sparksummit 22

2013 A<geo, LLC. All rights reserved. A<geo and the xpa&erns logo are trademarks of A<geo. The informa<on herein is for informa<onal purposes only and represents the current view of A<geo as of the date of this presenta<on. Because A<geo must respond to changing market condi<ons, it should not be interpreted to be a commitment on the part of A<geo, and A<geo cannot guarantee the accuracy of any informa<on provided aoer the date of this presenta<on. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.