From Distributed Systems to Data Science. William C. Benton Red Hat Emerging Technology

Similar documents
Word Count Code using MR2 Classes and API

LANGUAGES FOR HADOOP: PIG & HIVE

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Word count example Abdalrahman Alsaedi

How To Write A Mapreduce Program In Java.Io (Orchestra)

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Enterprise Data Storage and Analysis on Tim Barr

Hadoop Configuration and First Examples

An Introduction to Apostolos N. Papadopoulos

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop WordCount Explained! IT332 Distributed Systems

Tutorial- Counting Words in File(s) using MapReduce

BIG DATA APPLICATIONS

CS54100: Database Systems

HPCHadoop: MapReduce on Cray X-series

Introduction to Big Data Science. Wuhui Chen

Cloud Computing Era. Trend Micro

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop: Understanding the Big Data Processing Method

Data Science in the Wild

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Introduction to MapReduce and Hadoop

The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Mrs: MapReduce for Scientific Computing in Python

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop Basics with InfoSphere BigInsights

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Xiaoming Gao Hui Li Thilina Gunarathne

Introduc)on to Map- Reduce. Vincent Leroy

An Implementation of Sawzall on Hadoop

How to program a MapReduce cluster

Big Data Analytics* Outline. Issues. Big Data

Want to read more? You can buy this book at oreilly.com in print and ebook format. Buy 2 books, get the 3rd FREE!

MR-(Mapreduce Programming Language)

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

By Hrudaya nath K Cloud Computing

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Getting to know Apache Hadoop

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

map/reduce connected components

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Introduc8on to Apache Spark

Hadoop 2.0 Introduction with HDP for Windows. Seele Lin

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Programming Hive. Beijing Cambridge Farnham Köln Sebastopol Tokyo

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Download and install Download virtual machine Import virtual machine in Virtualbox

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

19 Putting into Practice: Large-Scale Data Management with HADOOP

Big Data Management and NoSQL Databases

Introduction To Hadoop

Extreme Computing. Hadoop MapReduce in more detail.

BIWA 2015 Big Data Lab Java MapReduce WordCount/Table JOIN Big Data Loader. Arijit Das Greg Belli Erik Lowney Nick Bitto

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Tutorial. Christopher M. Judd

Internals of Hadoop Application Framework and Distributed File System

Connecting Hadoop with Oracle Database

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Big Data 2012 Hadoop Tutorial

Three Approaches to Data Analysis with Hadoop

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Building Big Data Pipelines using OSS. Costin Leau Staff Engineer

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Map-Reduce and Hadoop

Hadoop Integration Guide

The Spark Big Data Analytics Platform

Distributed Systems + Middleware Hadoop

Big Data for the JVM developer. Costin Leau,

Data Science Analytics & Research Centre

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

How To Use Hadoop

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

BIG DATA ANALYTICS USING HADOOP TOOLS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

AVRO - SERIALIZATION

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Programming in Hadoop Programming, Tuning & Debugging

Implementations of iterative algorithms in Hadoop and Spark

Introduction to MapReduce and Hadoop

BIG DATA, MAPREDUCE & HADOOP

Transcription:

From Distributed Systems to Data Science William C. Benton Red Hat Emerging Technology

About me At Red Hat: scheduling, configuration management, RPC, Fedora, data engineering, data science. Before Red Hat: programming language research.

HISTORY and MYTHOLOGY

MapReduce (2004)

MapReduce (2004) "a b" "c e" "a b" "d a" "d b"

MapReduce (2004) (a, 1) (b, 1) (c, 1) (e, 1) (a, 1) (b, 1) (d, 1) (a, 1) (d, 1) (b, 1)

MapReduce (2004) (a, 1) (a, 1) (a, 1) (b, 1) (b, 1) (b, 1) (c, 1) (d, 1) (d, 1) (e, 1)

MapReduce (2004) (a, 3) (b, 3) (c, 1) (d, 2) (e, 1)

Hadoop (2005)

Hadoop (2005) HDFS

Hadoop (2005) Hadoop MapReduce

Hadoop (2005) package org.myorg; import java.io.ioexception; import java.util.*; Hadoop MapReduce import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; Hadoop (2005) import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); Hadoop MapReduce public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values,

StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } Hadoop (2005) public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } Hadoop MapReduce public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class);

} } sum += val.get(); } context.write(key, new IntWritable(sum)); Hadoop (2005) public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); Hadoop MapReduce } } FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true);

Facts (motivated by Hadoop) Data federation is the biggest problem; solve it with distributed storage. You need to be able to scale out to many nodes to handle real-world data analytics problems. Your network and disks probably aren t fast enough.

at least two analytics production clusters (at Microsoft and Yahoo!) have median job input sizes under 14 GB and 90% of jobs on a Facebook cluster have input sizes under 100gb. Appuswamy et al., Nobody ever got fired for buying a cluster. (Microsoft Research Technical Report)

Contrary to our expectations CPU (and not I/O) is often the bottleneck [and] improving network performance can improve job completion time by a median of at most 2%. Ousterhout et al., Making Sense of Performance in Data Analytics Frameworks. (USENIX NSDI 15)

Facts, revised for reality Data federation is absolutely a problem, but you have enormous flexibility as to where you solve it. Most real-world analytics workloads will fit in memory. Your network and disks probably aren t the problem.

DATA PROCESSING with ELASTIC RESOURCES

Instead of

Instead of

Instead of

Instead of import java.io.ioexception; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one);} } }

Instead of import java.io.ioexception; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; val file = spark.textfile("file://...") public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); } } public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one);} public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { } } int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); counts.saveastextfile("file://...") FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); } } job.waitforcompletion(true);

val file = spark.textfile("file://...") val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("file://...")

val file = spark.textfile("file://...") val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("file://...")

val file = spark.textfile("file://...") val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("file://...")

val file = spark.textfile("file://...") val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("file://...")

val file = spark.textfile("file://...") val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("file://...")

val file = spark.textfile("file://...") val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("file://...")

Unlike Hadoop Contemporary data frameworks like Apache Spark and Apache Flink provide unified abstractions and rich standard libraries for machine learning, structured query with planning, and graph processing.

DATA SCIENCE at RED HAT

Some of our recent projects Analyzing open-source community health (Benton, Spark Summit 15) Customer segmentation with NLP (Nowling, Spark Summit East 16) Understanding benchmark results (Feddema, Apache: Big Data 16) Machine role analysis (Erlandson, Apache: Big Data 16)

US12 and a not-so-recent one Waunakee US12 Middleton Vero na R oa d MS Analyzing endurance-sports data (Benton, Spark Summit 14) Verona

KEY TAKEAWAYS for your DATA SCIENCE INITIATIVES

When choosing a compute model, consider realistic working-set sizes rather than archive storage volume. (You may need petascale storage, but you probably don t even need terascale compute.)

Many real-world workloads benefit more from scaleup than scale-out. Choose a compute framework that lets you exploit both kinds of scale.

Don t choose a compute model that forces you into unrealistic assumptions about persistent storage or requires dedicated resources.

Make sure that your data lifecycle facilitates generating reproducible results. Archive all unmodified source data and automate your processing and analysis workflows.

Predictive models are great, but predictions are only part of the story. Whenever possible, use human-interpretable models that help you understand something new about your problem.

THANKS! @willb willb@redhat.com https://chapeau.freevariable.com