@Scalding. https://github.com/twitter/scalding. Based on talk by Oscar Boykin / Twitter

Similar documents

The Internet of Things and Big Data: Intro

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

Other Map-Reduce (ish) Frameworks. William Cohen

HIGH PERFORMANCE BIG DATA ANALYTICS

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

ITG Software Engineering

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Hadoop: The Definitive Guide

Fast Data in the Era of Big Data: Twitter s Real-

Unified Big Data Analytics Pipeline. 连 城

Parquet. Columnar storage for the people

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data for the JVM developer. Costin Leau,

Productivity frameworks in big data image processing computations - creating photographic mosaics with Hadoop and Scalding

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Implement Hadoop jobs to extract business value from large and varied data sets

Shark Installation Guide Week 3 Report. Ankush Arora

Scaling Out With Apache Spark. DTL Meeting Slides based on

I/O Considerations in Big Data Analytics

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

COURSE CONTENT Big Data and Hadoop Training

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Hadoop: The Definitive Guide

Big Data Frameworks: Scala and Spark Tutorial

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Big Data Course Highlights

Advanced Big Data Analytics with R and Hadoop

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Introduction to Spark

Apache Flink Next-gen data analysis. Kostas

Unified Batch & Stream Processing Platform

Unlocking the True Value of Hadoop with Open Data Science

Big Data Analytics! Architectures, Algorithms and Applications! Part #3: Analytics Platform

Real-time Big Data Analytics with Storm

Map-Reduce for Machine Learning on Multicore

Moving From Hadoop to Spark

MapReduce with Apache Hadoop Analysing Big Data

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Case Study : 3 different hadoop cluster deployments

HADOOP. Revised 10/19/2015

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Unified Big Data Processing with Apache Spark. Matei

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Peers Techno log ies Pv t. L td. HADOOP

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

Introduction to DISC and Hadoop

ANALYTICS CENTER LEARNING PROGRAM

Big Data Pipeline and Analytics Platform

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Introduc8on to Apache Spark

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Spark and the Big Data Library

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

How Companies are! Using Spark

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

Architectures for Big Data Analytics A database perspective

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Integrating Apache Spark with an Enterprise Data Warehouse

Querying MongoDB without programming using FUNQL

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Cloudera Certified Developer for Apache Hadoop

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Big Data Technology Pig: Query Language atop Map-Reduce

Lecture 10 - Functional programming: Hadoop and MapReduce

Big Data Research in Berlin BBDC and Apache Flink

Architectures for massive data management

Hadoop Ecosystem B Y R A H I M A.

Complete Java Classes Hadoop Syllabus Contact No:

Making big data simple with Databricks

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

BIG DATA - HADOOP PROFESSIONAL amron

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

Computational Mathematics with Python

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

CSE 373: Data Structure & Algorithms Lecture 25: Programming Languages. Nicki Dell Spring 2014

Summingbird: A Framework for Integrating Batch and Online MapReduce Computations

Big Data Analytics. Lucas Rego Drumond

Transcription:

@Scalding https://github.com/twitter/scalding Based on talk by Oscar Boykin / Twitter

What is Scalding? Why Scala for Map/Reduce? How is it used at Twitter? What s next for Scalding?

Yep, we re counting words: Scalding jobs subclass Job

Yep, we re counting words: Logic is in the constructor

Yep, we re counting words: Functions can be called or defined inline

Scalding Model Source objects read and write data (from HDFS, DBs, MemCache, etc...) Pipes represent the flows of the data in the job. You can think of Pipe as a distributed list.

Yep, we re counting words: Read and Write data through Source objects

Yep, we re counting words: Data is modeled as streams of named Tuples (of objects)

Why Scala The scala language has a lot of built-in features that make domain-specific languages easy to implement. Map/Reduce is already within the functional paradigm. Scala s collection API covers almost all usual use cases.

Word Co-occurrence

Word Co-occurrence We can use standard scala containers

Word Co-occurrence We can do real logic in the mapper without external UDFs.

Word Co-occurrence Generalized plus handles lists/sets/maps and can be customized (implement Monoid[T])

GroupBuilder: enabling parallel reductions groupby takes a function that mutates a GroupBuilder. GroupBuilder adds fields which are reductions of (potentially different) inputs. On the left, we add 7 fields.

scald.rb driver script that compiles the job and runs it locally or transfers and runs remotely.

Most functions in the API have very close analogs in scala.collection.iterable.

Cascading is the java library that handles most of the map/reduce planning for scalding. has years of production use. is used, tested, and optimized by many teams (Concurrent Inc., DSLs in Scala, Clojure, Python @Twitter. Ruby at Etsy). has a (very fast) local mode that runs without Hadoop. flow planner designed to be portable (cascading on Spark? Storm?)

mapreducemap We abstract Cascading s map-side aggregation ability with a function called mapreducemap. If only mapreducemaps are called, map-side aggregation works. If a foldleft is called (which cannot be done map-side), scalding falls back to pushing everything to the reducers.

Most Reductions are mapreducemap

Optimized Joins mapside join is called joinwithtiny. Implements left or inner join with a very small pipe. blockjoin: deals with data skew by replicating the data (useful for walking the Twitter follower graph, where everyone follows Gaga/Bieber/Obama). coming: combine the above to dynamically set replication on a per key basis: only Gaga is replicated, and just the right amount.

Scalding @Twitter Revenue quality team (ads targeting, market insight, click-prediction, trafficquality) uses scalding for all our work. Scala engineers throughout the company use it (i.e. storage, platform). More than 60 in-production scalding jobs, more than 200 ad-hoc jobs. Not our only tool: Pig, PyCascading, Cascalog, Hive are also used.

Example: finding similarity A simple recommendation algorithm is cosine similarity. Represent user-tweet interaction as a vector, then find the users whose vectors point in directions near the user in question. We ve developed a Matrix library on top of scalding to make this easy.

Cosine Similarity Matrices are strongly typed.

Cosine Similarity Col,Row types (Int,Int) can be anything comparable. Strings are useful for text indices.

Cosine Similarity Value (Double) can be anything with a Ring[T] (plus/times)

Cosine Similarity Operator overloading gives intuitive code.

Matrix in foreground, map/reduce behind With this syntax, we can focus on logic, not how to map linear algebra to Hadoop

Example uses: Do random-walks on the following graph. Matrix power iteration until convergence:(m * m * m * m). Dimensionality reduction of follower graph (Matrix product by a lower dimensional projection matrix). Triangle counting: (M*M*M).trace / 3

What is next? Improve the logical flow planning (reorder commuting filters/projections before maps, etc...). Improve Matrix flow planning to narrow the gap to hand optimized code.

That s it. follow and mention: @scalding @posco pull reqs: http://github.com/twitter/scalding