Apache Spark and the future of big data applica5ons. Eric Baldeschwieler

Similar documents
Hortonworks Architecting the Future of Big Data

Ali Ghodsi Head of PM and Engineering Databricks

How To Create A Data Visualization With Apache Spark And Zeppelin

Moving From Hadoop to Spark

Unified Big Data Processing with Apache Spark. Matei

Application Development. A Paradigm Shift

How Companies are! Using Spark

An Open Source Memory-Centric Distributed Storage System

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Realities Hadoop in the Enterprise Architecture

HDP Hadoop From concept to deployment.

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

BIG DATA USING HADOOP

HDP Enabling the Modern Data Architecture

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Upcoming Announcements

Securing Hadoop Data Big Data Everywhere - Atlanta January 27, 2015

Open source Google-style large scale data analysis with Hadoop

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

SQream Technologies Ltd - Confiden7al

Write Once, Run Anywhere Pat McDonough

An to Big Data, Apache Hadoop, and Cloudera

Can t We All Just Get Along? Spark and Resource Management on Hadoop

NoSQL Data Base Basics

Conquering Big Data with BDAS (Berkeley Data Analytics)

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Comprehensive Analytics on the Hortonworks Data Platform

DNS Big Data

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

The Future of Data Management

Extending Hadoop beyond MapReduce

How To Scale Out Of A Nosql Database

Data Security in Hadoop

Big Data, Big Traffic. And the WAN

Hadoop implementation of MapReduce computational model. Ján Vaňo

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Hadoop & Spark Using Amazon EMR

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Introduc8on to Apache Spark

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

YARN Apache Hadoop Next Generation Compute Platform

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Interactive data analytics drive insights

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Next-Gen Big Data Analytics using the Spark stack

Hadoop. Sunday, November 25, 12

Like what you hear? Tweet it using: #Sec360

Machine- Learning Summer School

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Large scale processing using Hadoop. Ján Vaňo

MapReduce, Hadoop and Amazon AWS

How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Sujee Maniyam, ElephantScale

Data Stream Algorithms in Storm and R. Radek Maciaszek

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Apache Hadoop's Role in Your Big Data Architecture

Using RDBMS, NoSQL or Hadoop?

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Building your cloud porbolio APS Connect

Massive Cloud Auditing using Data Mining on Hadoop

Oracle Big Data SQL Technical Update

Analytics on Spark &

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

Open source large scale distributed data management with Google s MapReduce and Bigtable

Streaming items through a cluster with Spark Streaming

Real Time Data Processing using Spark Streaming

The Inside Scoop on Hadoop

Talend Big Data. Delivering instant value from all your data. Talend

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Big Data Research in the AMPLab: BDAS and Beyond

Implement Hadoop jobs to extract business value from large and varied data sets

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

Real-Time Data Access Using Restful Framework for Multi-Platform Data Warehouse Environment

The Future of Data Management with Hadoop and the Enterprise Data Hub

Reference Architecture, Requirements, Gaps, Roles

Architectures for massive data management

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop in the Enterprise

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Transcription:

Apache Spark and the future of big data applica5ons Eric Baldeschwieler

Who is Eric14? Big data veteran (since 1996) Databricks Tech Advisor Twitter handle: @jeric14 Previously CTO/CEO of Hortonworks Yahoo - VP Hadoop Engineering Inktomi Web Search

Last Spark Summit IMO Apache Spark is the most exci5ng thing happening in big data today The poten5al: Spark - the lingua franca for data science Spark and Hadoop - great together So how are we doing?

Spark is now bundled with Hadoop All major Hadoop distributions include Spark Other big data solutions too!

Spark is in use for data science

Great progress! Time to declare victory? Nope, we re only gerng started.

Making Spark great for Data Science

Increase focus on ETL Science needs data, in the right place and format Teams are por5ng Hadoop ETL languages to Spark BeUer job scheduling tools ETL workloads are different Scale and Throughput Spark 1.0 a big step forward for these workloads 1000 node spark clusters petabyte scale jobs Let s build benchmarks and iterate hups://github.com/cascading/copa/wiki

More stuff R bindings!! Add features to ease/accelerate code sharing SparkSQL needs to be extended to run against more data stores, including object stores Deep learning and other Algo support Trade off completeness for speed More communica5on primi5ves? Developer basics Profiling & debugging, error repor5ng & logging

Data science drives new applica5ons (some history)

Hadoop at Yahoo! Source: hup://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop- at- yahoo- more- than- ever- before/

CASE STUDY YAHOO! HOMEPAGE Serving Maps Users - Interests Five Minute Produc7on Weekly Categoriza7on models USER BEHAVIOR SERVING MAPS (every 5 minutes) SCIENCE HADOOP CLUSTER PRODUCTION HADOOP CLUSTER USER BEHAVIOR» Machine learning to build ever beter categoriza7on models CATEGORIZATION MODELS (weekly)» Iden7fy user interests using Categoriza7on models SERVING SYSTEMS ENGAGED USERS Build customized home pages with latest data (thousands / second) Yahoo 2011 12 12

Big data applica5on model Web & App Servers (ApacheD, Tomcat ) Serving Store (Cassandra, MySQL ) Interac5ve layer Message Bus (Kaga ) Streaming Engine (Spark, Storm ) Streaming layer Batch Compute Framework (Spark, MapReduce ) Batch Storage (HDFS, S3 ) Batch layer

Spark applica5ons today

3 rd party applica5ons Spark Apps Spark Distros

Classes of applica5ons Custom Solu5ons Internal apps Enterprise data tooling ETL, BI/Query Data science tooling Analy5cs & ML Collabora5on & Repor5ng Ver5cal specific applica5ons financial, healthcare, IC marke5ng, retail, gaming Web & App Servers Message Bus Serving Store Streaming Engine Batch Compute Framework Batch Storage Interac5ve layer Stream layer Batch layer

Why so much ac5vity? Spark s speed allows compelling interac5vity Interac5ve API eases development Spark runs well in many environments Cloud, Hadoop/YARN, Cassandra, Broad open source community support

Improving Spark for Applica5ons

Build an Open Cer5fica5on Suite Goals of cer5fica5on for an applica5on Write an app once and run anywhere Apps con5nue running ajer plakorm upgrades Value of an open suite to user community Contributors can validate their specific requirements Drives widest possible ecosystem for applica5ons Open source community does the right thing Backwards compa5bility has been a challenge in the big data ecosystem Sharing code that defines correct compa5bility a win

Tachyon / More storage APIs BeUer mul5- tenancy for Spark Keep objects in RAM between Jobs / users Have a system wide view of RAM requirements Provide storage portability Spark can run against S3, HDFS, Cassanda Can we make this transparent to applica5ons BeUer use of 5ered storage RAM, SSD and Disk

IMO @ApacheSpark is the most exci5ng thing happening in big data today Enjoy the show! - @jeric14