Spark Application Carousel. Spark Summit East 2015



Similar documents
Unified Big Data Processing with Apache Spark. Matei

Unified Big Data Analytics Pipeline. 连 城

Architectures for massive data management

Moving From Hadoop to Spark

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

How Companies are! Using Spark

Machine- Learning Summer School

Introduction to Spark

Databricks. A Primer

Big Data Processing. Patrick Wendell Databricks

Beyond Hadoop with Apache Spark and BDAS

Scaling Out With Apache Spark. DTL Meeting Slides based on

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Databricks. A Primer

Spark and the Big Data Library

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Big Data Analytics Hadoop and Spark

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data Analytics with Cassandra, Spark & MLLib

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data With Hadoop

Hadoop Ecosystem B Y R A H I M A.

Write Once, Run Anywhere Pat McDonough

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

This is a brief tutorial that explains the basics of Spark SQL programming.

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

Next-Gen Big Data Analytics using the Spark stack

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Putting Apache Kafka to Use!

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Real Time Data Processing using Spark Streaming

The Internet of Things and Big Data: Intro

Streaming items through a cluster with Spark Streaming

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Analytics. Lucas Rego Drumond

Spark: Cluster Computing with Working Sets

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Customer Case Study. Sharethrough

Ali Ghodsi Head of PM and Engineering Databricks

From Spark to Ignition:

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Reference Architecture, Requirements, Gaps, Roles

Big Data Research in the AMPLab: BDAS and Beyond

Big Data and Scripting Systems build on top of Hadoop

Apache Flink Next-gen data analysis. Kostas

Spark: Making Big Data Interactive & Real-Time

CSE-E5430 Scalable Cloud Computing Lecture 11

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Apache Spark and Distributed Programming

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

How To Handle Big Data With A Data Scientist

InfiniteGraph: The Distributed Graph Database

Hadoop, Hive & Spark Tutorial

How To Create A Data Visualization With Apache Spark And Zeppelin

Real-Time Data Analytics and Visualization

Big Data Processing with Google s MapReduce. Alexandru Costan

HDFS. Hadoop Distributed File System

Introduction To Hive

BIG DATA ANALYTICS For REAL TIME SYSTEM

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Making big data simple with Databricks

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Data and Scripting Systems beyond Hadoop

Apache Flink. Fast and Reliable Large-Scale Data Processing

The Easiest Way to Run Spark Jobs. How-To Guide

Conquering Big Data with BDAS (Berkeley Data Analytics)

DduP Towards a Deduplication Framework utilising Apache Spark

Big Data Frameworks: Scala and Spark Tutorial

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop and Map-Reduce. Swati Gore

Why Spark on Hadoop Matters

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

NOT IN KANSAS ANY MORE

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Transcription:

Spark Application Carousel Spark Summit East 2015

About Today s Talk About Me: Vida Ha - Solutions Engineer at Databricks. Goal: For beginning/early intermediate Spark Developers. Motivate you to start writing more apps in Spark. Share some tips I ve learned along the way. 2

Today s Applications Covered Web Logs Analysis Basic Data Pipeline Wikipedia Dataset Machine Learning Facebook API Graph Algorithms 3

Application 1: Web Log Analysis 4

Web Logs Why? Most organizations have web log data. Dataset is too expensive to store in a database. Awesome, easy way to learn Spark! What? Standard Apache Access Logs. Web logs flow in each day from a web server. 5

Reading in Log Files access_logs = (sc.textfile(dbfs_sample_logs_folder) # Call the parse_apace_log_line on each line..map(parse_apache_log_line) # Caches the objects in memory..cache()) # Call an action on the RDD to actually populate the cache. access_logs.count() 6

Calculate Context Size Stats content_sizes = (access_logs.map(lambda row: row.contentsize).cache()) # Cache since multiple queries. average_content_size = content_sizes.reduce( lambda x, y: x + y) / content_sizes.count() min_content_size = content_sizes.min() max_content_size = content_sizes.max() 7

Frequent IPAddresses - Key/Value Pairs ip_addresses_rdd = (access_logs.map(lambda log: (log.ipaddress, 1)).reduceByKey(lambda x, y : x + y).filter(lambda s: s[1] > n).map(lambda s: Row(ip_address = s[0], count = s[1]))) # Alternately, could just collect() the values..registertemptable( ip_addresses ) 8

Other Statistics to Compute Response Code Count. Top Endpoints & Distribution. and more. Great way to learn various Spark Transformations and actions and how to chain them together. * BUT Spark SQL makes this much easier! 9

Better: Register Logs as a Spark SQL Table sqlcontext.sql( CREATE EXTERNAL TABLE access_logs ( ipaddress STRING contextsize INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.regexserde' WITH SERDEPROPERTIES ( "input.regex" = '^(\\S+) (\\S+) (\\S+) ) LOCATION \ /tmp/sample_logs\ ) 10

Context Sizes with Spark SQL sqlcontext.sql( SELECT (SUM(contentsize) / COUNT(*)), # Average MIN(contentsize), MAX(contentsize) FROM access_logs ) 11

Frequent IPAddress with Spark SQL sqlcontext.sql( SELECT ipaddress, COUNT(*) AS total FROM access_logs GROUP BY ipaddress HAVING total > N ) 12

Tip: Use Partitioning Only analyze files from days you care about. sqlcontext.sql( ALTER TABLE access_logs ADD PARTITION (date='20150318') LOCATION /logs/2015/3/18 ) If your data rolls between days - perhaps those few missed logs don t matter. 13

Tip: Define Last N Day Tables for caching Create another table with a similar format. Only register partitions for the last N days. Each night: Uncache the table. Update the partition definitions. Recache: sqlcontext.sql( CACHE access_logs_last_7_days ) 14

Tip: Monitor the Pipeline with Spark SQL Detect if your batch jobs are taking too long. Programmatically create a temp table with stats from one run. sqlcontext.sql( CREATE TABLE IF NOT EXISTS pipelinestats (runstart INT, runduration INT) ) sqlcontext.sql( insert into TABLE pipelinestats select runstart, runduration from onerun limit 1 ) Coalesce the table from time to time. 15

Demo: Web Log Analysis 16

Application: Wikipedia 17

Tip: Use Spark to Parallelize Downloading Wikipedia can be downloaded in one giant file, or you can download the 27 parts. val articlesrdd = sc.parallelize(articlestoretrieve.tolist, 4) val retrieveinpartitions = (iter: Iterator[String]) => { iter.map(article => retrievearticleandwritetos3(article)) } val fetchedarticles = articlesrdd.mappartitions(retrieveinpartitions).collect() 18

Processing XML data Excessively large (> 1GB) compressed XML data is hard to process. Not easily splittable. Solution: Break into text files where there is one XML element per line. 19

ETL-ing your data with Spark Use an XML Parser to pull out fields of interest in the XML document. Save the data in Parquet File format for faster querying. Register the Parquet format files as Spark SQL since there is a clearly defined schema. 20

Using Spark for Fast Data Exploration CACHE the dataset for faster querying. Interactive programming experience. Use a mix of Python or Scala combined with SQL to analyze the dataset. 21

Tip: Use MLLib to Learn from Dataset Wikipedia articles are an rich set of data for the English language. Word2Vec is a simple algorithm to learn synonyms and can be applied to the wikipedia article. Try out your favorite ML/NLP algorithms! 22

Demo: Wikipedia App 23

Application: Facebook API 24

Tip: Use Spark to Scrape Facebook Data Use Spark to Facebook to make requests for friends of friends in parallel. NOTE: Latest Facebook API will only show friends that have also enabled the app. If you build a Facebook App and get more users to accept it, you can build a more complete picture of the social graph! 25

Tip: Use GraphX to learn on Data Use the Page Rank algorithm to determine who s the most popular**. Output User Data: Facebook User Id to name. Output Edges: User Id to User Id ** In this case it s my friends, so I m clearly the most popular. 26

Demo: Facebook API 27

Conclusion I hope this talk has inspired you to want to write Spark applications on your favorite dataset. Hacking (and making mistakes) is the best way to learn. If you want to walk through some examples, see the Databricks Spark Reference Applications: https://github.com/databricks/referenceapps 28

THE END