How Companies are! Using Spark

Similar documents

How To Create A Data Visualization With Apache Spark And Zeppelin

Unified Big Data Processing with Apache Spark. Matei

Beyond Hadoop with Apache Spark and BDAS

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Ali Ghodsi Head of PM and Engineering Databricks

Spark: Making Big Data Interactive & Real-Time

Unified Big Data Analytics Pipeline. 连城

How To Handle Big Data With A Data Scientist

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Moving From Hadoop to Spark

Analytics on Spark &

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Conquering Big Data with BDAS (Berkeley Data Analytics)

Spark and the Big Data Library

Hadoop Ecosystem B Y R A H I M A.

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

HDP Hadoop From concept to deployment.

Scaling Out With Apache Spark. DTL Meeting Slides based on

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

The Internet of Things and Big Data: Intro

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop in the Enterprise

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Why Spark on Hadoop Matters

Dell In-Memory Appliance for Cloudera Enterprise

Upcoming Announcements

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Native Connectivity to Big Data Sources in MSTR 10

KNIME & Avira, or how I ve learned to love Big Data

Comprehensive Analytics on the Hortonworks Data Platform

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Large scale processing using Hadoop. Ján Vaňo

Hybrid Software Architectures for Big

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Big Data Analytics Hadoop and Spark

Workshop on Hadoop with Big Data

Apache Flink Next-gen data analysis. Kostas

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Brave New World: Hadoop vs. Spark

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Shark Installation Guide Week 3 Report. Ankush Arora

Databricks. A Primer

CSE-E5430 Scalable Cloud Computing Lecture 11

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Databricks. A Primer

From Spark to Ignition:

Sujee Maniyam, ElephantScale

Chase Wu New Jersey Ins0tute of Technology

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Architectures for massive data management

Case Study : 3 different hadoop cluster deployments

Hadoop. Sunday, November 25, 12

BIG DATA TRENDS AND TECHNOLOGIES

A Brief Introduction to Apache Tez

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data on Microsoft Platform

the missing log collector Treasure Data, Inc. Muga Nishizawa

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Data for Investment Research Management

Big Data Explained. An introduction to Big Data Science.

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

YARN Apache Hadoop Next Generation Compute Platform

Hadoop Big Data for Processing Data and Performing Workload

Hadoop & Spark Using Amazon EMR

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Interactive data analytics drive insights

Managing large clusters resources

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

How To Scale Out Of A Nosql Database

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Understanding the Value of In-Memory in the IT Landscape

Hadoop implementation of MapReduce computational model. Ján Vaňo

Self-service BI for big data applications using Apache Drill

Big Data and Industrial Internet

Berkeley Data Analytics Stack:! Experience and Lesson Learned

Information Builders Mission & Value Proposition

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

The Future of Data Management

White Paper: What You Need To Know About Hadoop

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

GS Big Data Platform

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

HDP Enabling the Modern Data Architecture

Cost-Effective Business Intelligence with Red Hat and Open Source

Transcription:

How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia

History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made it 10-20x cheaper to store large datasets Broadly available from multiple vendors

Implication Big data storage is becoming commoditized, so how will organizations get an edge? What matters now is what you can do with the data.

Two Factors Speed: how quickly can you go from data to decisions? Sophistication: can you run the best algorithms on the data? These factors have usually required separate,! non-commodity tools

Apache Spark A compute engine for Hadoop data that is: Fast: up to 100x faster than MapReduce Response time (s) 100 80 60 40 20 0 SQL performance 90 Hive Spark (disk) Spark (RAM) 18 1.1

Apache Spark A compute engine for Hadoop data that is: Fast: up to 100x faster than MapReduce Sophisticated: can run today s! most advanced algorithms Shark SQL Spark Streaming! real-time MLlib machine learning GraphX graph Apache Spark

Apache Spark A compute engine for Hadoop data that is: Fast: up to 100x faster than MapReduce Sophisticated: can run today s! most advanced algorithms 150! Contributors in past year Fully open source: one of most! active projects in big data 100! 50! Giraph! Storm! Tez! 0!

Apache Spark A compute engine for Hadoop data that is: Fast: up to 100x faster than MapReduce Sophisticated: can run today s! most advanced algorithms 150! Contributors in past year Fully open source: one of most! active projects in big data 100! 50! commodity Hadoop 0! clusters Giraph! Spark brings top-end data analysis to! Storm! Tez!

Spark Use Cases

1. Yahoo! Personalization Yahoo! properties are highly personalized to maximize relevance Reaction must be fast, as stories, etc change in time Best algorithms are highly sophisticated

1. Yahoo! Personalization Example challenge: relevance of news stories!!! Relevance models must be updated throughout the day

1. Yahoo! Personalization Spark at Yahoo!» Runs in Hadoop YARN to use existing data & clusters Result: pilot for stream ads» 120 lines in Scala, compared to 15K in C++» 30 min to run on 100 million samples Hadoop:! Batch Processing Spark: Iterative Processing YARN: Resource Manager Storage: HDFS, HBase, etc Major contributor on YARN! support, scalability, operations

2. Yahoo! Ad Analytics Yahoo! Ads wanted interactive BI on terabytes of data Chose Shark (Hive on Spark) to provide this through standard Hive server API + Tableau Result: interactive-speed queries! on terabytes from Tableau Major contributor on columnar compression, statistics, JDBC Large Hadoop Cluster Hadoop MR! (Pig, Hive, MR) Satellite! Shark Cluster YARN Spark Historical DW (HDFS) Satellite! Shark Cluster

3. Conviva Real-Time Video Optimization Conviva manages 4+ billion video streams per month Dynamically selects sources to optimize quality Time is critical: 1 second buffering = lost viewers

3. Conviva Real-Time Video Optimization Using Spark Streaming, Conviva learns network conditions in real time Results fed directly to video players to optimize streams Spark Node Spark Node Spark Node Spark Node Storage Layer Spark Node System running in production Decision Maker Decision Maker Decision Maker

4. ClearStory Data:! Multi-source, Fast-cycle Analysis Same-day results from data updating at disparate sources Dozens of disparate sources converged in seconds/minutes Data Sources ClearStory Platform ClearStory Application Harmonization Data Inference & Profiling In-Memory Data Units Visualization Collaboration clearstorydata.com-

4. ClearStory Data:! Multi-source, Fast-cycle Analysis

Get Started Download and resources: spark.incubator.apache.org Free video tutorials: spark-summit.org/2013 Commercial support: +

Conclusion Big data will be standard: everyone will have it Organizations will gain an edge through speed of action and sophistication of analysis Apache Spark brings these to Hadoop clusters