Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl

Similar documents

Building Big Data Pipelines using OSS. Costin Leau Staff Engineer

BIG DATA APPLICATIONS

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Tutorial- Counting Words in File(s) using MapReduce

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Getting to know Apache Hadoop

LANGUAGES FOR HADOOP: PIG & HIVE

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Introduc8on to Apache Spark

Mrs: MapReduce for Scientific Computing in Python

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Enterprise Data Storage and Analysis on Tim Barr

Using distributed technologies to analyze Big Data

The Hadoop Eco System Shanghai Data Science Meetup

IBM Big Data Platform

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Internals of Hadoop Application Framework and Distributed File System

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data 2012 Hadoop Tutorial

Moving From Hadoop to Spark

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Connecting Hadoop with Oracle Database

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Word Count Code using MR2 Classes and API

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data Too Big To Ignore

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Cloud Computing Era. Trend Micro

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Hadoop IST 734 SS CHUNG

MapReduce with Apache Hadoop Analysing Big Data

HPCHadoop: MapReduce on Cray X-series

Hadoop: Understanding the Big Data Processing Method

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

HDP Hadoop From concept to deployment.

Outline of Tutorial. Hadoop and Pig Overview Hands-on

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Programming in Hadoop Programming, Tuning & Debugging

Big Data and Data Science Grows Up. Ron Bodkin Founder & CEO Think Big Analy8cs ron.bodkin@xthinkbiganaly8cs..com

ITG Software Engineering

Implement Hadoop jobs to extract business value from large and varied data sets

Modeling Big Data Systems by Extending the Palladio Component Model

Introduction to MapReduce and Hadoop

Workshop on Hadoop with Big Data

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Big Data Workshop. dattamsha.com

Big Data Analytics* Outline. Issues. Big Data

Data Science Analytics & Research Centre

Running Hadoop on Windows CCNP Server

Big Data Infrastructure at Spotify

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Big Data Weather Analytics Using Hadoop

Hadoop Ecosystem B Y R A H I M A.

HADOOP. Revised 10/19/2015

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Extreme Computing. Hadoop MapReduce in more detail.

Large scale processing using Hadoop. Ján Vaňo

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Unified Big Data Analytics Pipeline. 连城

The Internet of Things and Big Data: Intro

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

A Cloud Based Platform for Big Data Science Md. Zahidul Islam

I/O Considerations in Big Data Analytics

Apache Flink Next-gen data analysis. Kostas

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

How To Scale Out Of A Nosql Database

CS54100: Database Systems

ITG Software Engineering

Big Data and Data Science: Behind the Buzz Words

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Information Builders Mission & Value Proposition

Transcription:

Big Data for the JVM developer Costin Leau, Elasticsearch @costinl

Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system

Data Landscape

Data Trends http://www.emc.com/leadership/programs/digital-universe.htm

Enterprise Data Trends

Enterprise Data Trends

Enterprise Data Trends

Enterprise Data Trends Unstructured data No predefined model Often doesn t fit in RDBMS Pre-Aggregated Data Computed during data collection Counters Running Averages

Cost Trends Big Iron: $40k/CPU Hardware cost halving every 18 months Commodity Cluster: $1k/CPU

Cost Trends Big Iron: $40k/CPU Hardware cost halving every 18 months Commodity Cluster: $1k/CPU

Value of Data Value from Data Exceeds Hardware & Software costs US retail 60+% increase in net margin possible 0.5-1.0% annual productivity growth

Big Data Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze A subjective and moving target Big data in many sectors today range from 10 s of TB to multiple PB

(Big) Data Pipeline

Big Data Pipeline Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Database (hbase, Gemfire, Cassandra) Big SQL (Greenplum, AsterData, Etc ) Batch Processing Unstructured Data (HDFS)

Data Pipeline Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Unstructured Data in Big Data Filesystem RT Processing HDFS Collect Interactive Processing HBase Cassandra Elasticsearch Transform SQL BIG SQL Data Grid

Data Pipeline Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Data Presentation Data Analytics Interactive Processing Batch Processing (Hadoop) HBase Cassandra Elasticsearch SQL BIG SQL Data Grids Unstructured Data in Big Data Filesystem HDFS

Taming Big Data

JVM as the platform Portable Fast Secure Rich eco-system Massive adoption in the enterprise

JVM as the platform Map Reduce Framework (M/R) Hadoop Distributed File System (HDFS)

Storage - HDFS Distributed Scalable Portable Data Aware Commodity hardware Unstructured Data (HDFS)

Computation Map/Reduce

Counting Words aka Hello World

Computation Map/Reduce public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken());context.write(word, one); }}} public class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Hadoop Streaming $HADOOP_HOME/bin/hadoop jar \ hadoop-streaming.jar \ -input myinputdirs \ -output myoutputdir \ -mapper /bin/cat \ -reducer /bin/wc

Hadoop Streaming $HADOOP_HOME/bin/hadoop jar \ hadoop-streaming.jar \ -input myinputdirs \ -output myoutputdir \ -mapper parseline.py \ -reducer /bin/wc

Cascading Mid-level abstraction on top of M/R Hides M/R plumbing through building blocks Handles process planning and scheduling JVM based (Java, Clojure* and Scala*) * External projects to Cascading

Cascading Counting Words Scheme sourcescheme = new TextLine(new Fields("line")); Tap source = new Hfs(sourceScheme, inputpath); Scheme sinkscheme = new TextLine(new Fields("word", "count")); Tap sink = new Hfs(sinkScheme, outputpath, SinkMode.REPLACE); Pipe assembly = new Pipe("wordcount"); String regex = "(?<!\\pl)(?=\\pl)[^ ]*(?<=\\pl)(?!\\pl)"; Function function = new RegexGenerator(new Fields("word"), regex); assembly = new Each(assembly, new Fields("line ), function ); assembly = new GroupBy(assembly, new Fields("word ) ); Aggregator count = new Count(new Fields("count )); assembly = new Every(assembly, count);

Scalding Counting Words package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ).flatmap('line -> 'word) { line : String => tokenize(line) }.groupby('word) { _.size }.write( Tsv( args("output") ) ) def tokenize(text : String) : Array[String] = { text.tolowercase.replaceall("[^a-za-z0-9\\s]", "").split("\\s+") } }

Cascalog Counting Words (ns count-words.core (:use cascalog.api) (:require [cascalog.ops :as c])) (defmapcatop split [^String sentence] (.split sentence "\\s+")) (defn wordcount-query [src] (<- [?word?count] (src?textline) (split?textline :>?word) (c/count?count)))

Apache Pig High-level abstraction on top of M/R Procedural ETL scripting language Extensible (Java, Python, Ruby or Groovy)

Apache Pig input_lines = LOAD '/tmp/books' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_ words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words';

Apache Hive SQL-like abstraction on top of M/R Allows basic ETL Extensible (Java, Python, Ruby or Groovy)

Counting Words Hive -- import the file as lines CREATE EXTERNAL TABLE lines(line string) LOAD DATA INPATH books OVERWRITE INTO TABLE lines; -- create a virtual view that splits the lines SELECT word, count(*) FROM lines LATERAL VIEW explode(split(text, )) ltable as word GROUP BY word;

Eco-system Oozie HBase Mahout Spring for Apache Hadoop Kafka Elasticsearch Storm

Big Data Pipeline Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Database (hbase, Gemfire, Cassandra) Big SQL (Greenplum, AsterData, Etc ) Batch Processing Unstructured Data (HDFS)

Wrapping up

Wrap-up Rich eco-system Variety of tools/frameworks/solutions they all run on the JVM both good & bad Be agile start small and grow organically Iterate over your design a lot Focus on data, not the tools

Hvala! @costinl