1 How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia
2 History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made it 10-20x cheaper to store large datasets Broadly available from multiple vendors
3 Implication Big data storage is becoming commoditized, so how will organizations get an edge? What matters now is what you can do with the data.
4 Two Factors Speed: how quickly can you go from data to decisions? Sophistication: can you run the best algorithms on the data? These factors have usually required separate,! non-commodity tools
5 Apache Spark A compute engine for Hadoop data that is: Fast: up to 100x faster than MapReduce Response time (s) SQL performance 90 Hive Spark (disk) Spark (RAM)
6 Apache Spark A compute engine for Hadoop data that is: Fast: up to 100x faster than MapReduce Sophisticated: can run today s! most advanced algorithms Shark SQL Spark Streaming! real-time MLlib machine learning GraphX graph Apache Spark
7 Apache Spark A compute engine for Hadoop data that is: Fast: up to 100x faster than MapReduce Sophisticated: can run today s! most advanced algorithms 150! Contributors in past year Fully open source: one of most! active projects in big data 100! 50! Giraph! Storm! Tez! 0!
8 Apache Spark A compute engine for Hadoop data that is: Fast: up to 100x faster than MapReduce Sophisticated: can run today s! most advanced algorithms 150! Contributors in past year Fully open source: one of most! active projects in big data 100! 50! commodity Hadoop 0! clusters Giraph! Spark brings top-end data analysis to! Storm! Tez!
9 Spark Use Cases
10 1. Yahoo! Personalization Yahoo! properties are highly personalized to maximize relevance Reaction must be fast, as stories, etc change in time Best algorithms are highly sophisticated
11 1. Yahoo! Personalization Example challenge: relevance of news stories!!! Relevance models must be updated throughout the day
12 1. Yahoo! Personalization Spark at Yahoo!» Runs in Hadoop YARN to use existing data & clusters Result: pilot for stream ads» 120 lines in Scala, compared to 15K in C++» 30 min to run on 100 million samples Hadoop:! Batch Processing Spark: Iterative Processing YARN: Resource Manager Storage: HDFS, HBase, etc Major contributor on YARN! support, scalability, operations
13 2. Yahoo! Ad Analytics Yahoo! Ads wanted interactive BI on terabytes of data Chose Shark (Hive on Spark) to provide this through standard Hive server API + Tableau Result: interactive-speed queries! on terabytes from Tableau Major contributor on columnar compression, statistics, JDBC Large Hadoop Cluster Hadoop MR! (Pig, Hive, MR) Satellite! Shark Cluster YARN Spark Historical DW (HDFS) Satellite! Shark Cluster
14 3. Conviva Real-Time Video Optimization Conviva manages 4+ billion video streams per month Dynamically selects sources to optimize quality Time is critical: 1 second buffering = lost viewers
15 3. Conviva Real-Time Video Optimization Using Spark Streaming, Conviva learns network conditions in real time Results fed directly to video players to optimize streams Spark Node Spark Node Spark Node Spark Node Storage Layer Spark Node System running in production Decision Maker Decision Maker Decision Maker
16 4. ClearStory Data:! Multi-source, Fast-cycle Analysis Same-day results from data updating at disparate sources Dozens of disparate sources converged in seconds/minutes Data Sources ClearStory Platform ClearStory Application Harmonization Data Inference & Profiling In-Memory Data Units Visualization Collaboration clearstorydata.com-
White Paper Intel Reference Architecture Big Data Analytics Predictive Analytics and Interactive Queries on Big Data WRITERS Moty Fania, Principal Engineer, Big Data/Advanced Analytics, Intel IT Parviz
QLIKVIEW AND BIG DATA A QlikView Technology White Paper July 2012 victa.nl firstname.lastname@example.org +31 74 2915208 Introduction There is an incredible amount of interest in the topic of Big Data at present: for many
QLIKVIEW AND BIG DATA: HAVE IT YOUR WAY A QlikView White Paper November 2012 qlikview.com Table of Contents Executive Summary 3 Introduction 3 The Two Sides of Big Data Analytics 3 How Big Data Flows from
white paper Capitalizing on Big Data Analytics for the Insurance Industry A Simplified, Automated Approach to Big Data Applications: StackIQ Enterprise Data Management and Monitoring. Abstract Contents
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
White Paper The Business Analyst s Guide to Hadoop Get Ready, Get Set, and Go: A Three-Step Guide to Implementing Hadoop-based Analytics By Alteryx and Hortonworks (T)here is considerable evidence that
Change the world with data. We ll show you how. strataconf.com Sep 25 27, 2013 Boston, MA Oct 28 30, 2013 New York, NY Nov 11 13, 2013 London, England 2013 O Reilly Media, Inc. O Reilly logo is a registered
White Paper Intel Distribution for Apache Hadoop* Big Data Real-Time Big Data Analytics for the Enterprise SAP HANA* and the Intel Distribution for Apache Hadoop* Software Executive Summary Companies are
PARC and SAP Co-innovation: High-performance Graph Analytics for Big Data Powered by SAP HANA Harnessing the combined power of SAP HANA and PARC s HiperGraph graph analytics technology for real-time insights
BIG DATA TECHNOLOGIES, USE CASES, AND RESEARCH ISSUES Il-Yeol Song, Ph.D. College of Computing & Informatics Drexel University Philadelphia, PA 19104 ACM SAC 2015 April 14, 2015 Salamanca, Spain Source:
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
White Paper Data Warehouse Optimization with Hadoop A Big Data Reference Architecture Using Informatica and Cloudera Technologies This document contains Confidential, Proprietary and Trade Secret Information
WHITE PAPER Intel Distribution for Apache Hadoop* Software Financial Services Big Data Analytics Better Performance for Big Data One of the Largest Banks in Italy Speeds Processing for Big Data Analytics
Using Data Mining and Machine Learning in Retail Omeid Seide Senior Manager, Big Data Solutions Sears Holdings Bharat Prasad Big Data Solution Architect Sears Holdings Over a Century of Innovation A Fortune
WHITEPAPER OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT A top-tier global bank s end-of-day risk analysis jobs didn t complete in time for the next start of trading day. To solve
Big Data: Challenges and Opportunities Roberto V. Zicari Goethe University Frankfurt This is Big Data. Every day, 2.5 quintillion bytes of data are created. This data comes from digital pictures, videos,
32 Big Data: present and future Big Data: present and future Mircea Răducu TRIFU, Mihaela Laura IVAN University of Economic Studies, Bucharest, Romania email@example.com, firstname.lastname@example.org
The little elephant driving Big Data Despite the funny-sounding name, Hadoop is a serious enterprise software suite that drives Big Data Hadoop enables the storage and processing of very large databases
ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference
COULD VS. SHOULD: BALANCING BIG DATA AND ANALYTICS TECHNOLOGY The business world is abuzz with the potential of data. In fact, most businesses have so much data that it is difficult for them to process
3 Big Data: Challenges and Opportunities Roberto V. Zicari Contents Introduction... 104 The Story as it is Told from the Business Perspective... 104 The Story as it is Told from the Technology Perspective...