Big Data for the JVM developer. Costin Leau,

Size: px

Start display at page:

Download "Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl"

Hollie Tyler
10 years ago
Views:

1 Big Data for the JVM developer Costin Leau,

2 Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system

3 Data Landscape

4 Data Trends

5 Enterprise Data Trends

6 Enterprise Data Trends

7 Enterprise Data Trends

8 Enterprise Data Trends Unstructured data No predefined model Often doesn t fit in RDBMS Pre-Aggregated Data Computed during data collection Counters Running Averages

9 Cost Trends Big Iron: $40k/CPU Hardware cost halving every 18 months Commodity Cluster: $1k/CPU

10 Cost Trends Big Iron: $40k/CPU Hardware cost halving every 18 months Commodity Cluster: $1k/CPU

11 Value of Data Value from Data Exceeds Hardware & Software costs US retail 60+% increase in net margin possible % annual productivity growth

12 Big Data Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze A subjective and moving target Big data in many sectors today range from 10 s of TB to multiple PB

store, manage, and analyze A subjective and moving target

13 (Big) Data Pipeline

14 Big Data Pipeline Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Database (hbase, Gemfire, Cassandra) Big SQL (Greenplum, AsterData, Etc ) Batch Processing Unstructured Data (HDFS)

15 Data Pipeline Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Unstructured Data in Big Data Filesystem RT Processing HDFS Collect Interactive Processing HBase Cassandra Elasticsearch Transform SQL BIG SQL Data Grid

Filesystem RT Processing HDFS Collect Interactive

16 Data Pipeline Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Data Presentation Data Analytics Interactive Processing Batch Processing (Hadoop) HBase Cassandra Elasticsearch SQL BIG SQL Data Grids Unstructured Data in Big Data Filesystem HDFS

Processing Batch Processing (Hadoop) HBase Cassandra

17 Taming Big Data

18 JVM as the platform Portable Fast Secure Rich eco-system Massive adoption in the enterprise

19 JVM as the platform Map Reduce Framework (M/R) Hadoop Distributed File System (HDFS)

20 Storage - HDFS Distributed Scalable Portable Data Aware Commodity hardware Unstructured Data (HDFS)

21 Computation Map/Reduce

22 Counting Words aka Hello World

23 Computation Map/Reduce public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken());context.write(word, one); }}} public class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

24 Hadoop Streaming $HADOOP_HOME/bin/hadoop jar \ hadoop-streaming.jar \ -input myinputdirs \ -output myoutputdir \ -mapper /bin/cat \ -reducer /bin/wc

25 Hadoop Streaming $HADOOP_HOME/bin/hadoop jar \ hadoop-streaming.jar \ -input myinputdirs \ -output myoutputdir \ -mapper parseline.py \ -reducer /bin/wc

26 Cascading Mid-level abstraction on top of M/R Hides M/R plumbing through building blocks Handles process planning and scheduling JVM based (Java, Clojure* and Scala*) * External projects to Cascading

27 Cascading Counting Words Scheme sourcescheme = new TextLine(new Fields("line")); Tap source = new Hfs(sourceScheme, inputpath); Scheme sinkscheme = new TextLine(new Fields("word", "count")); Tap sink = new Hfs(sinkScheme, outputpath, SinkMode.REPLACE); Pipe assembly = new Pipe("wordcount"); String regex = "(?<!\\pl)(?=\\pl)[^ ]*(?<=\\pl)(?!\\pl)"; Function function = new RegexGenerator(new Fields("word"), regex); assembly = new Each(assembly, new Fields("line ), function ); assembly = new GroupBy(assembly, new Fields("word ) ); Aggregator count = new Count(new Fields("count )); assembly = new Every(assembly, count);

28 Scalding Counting Words package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ).flatmap('line -> 'word) { line : String => tokenize(line) }.groupby('word) { _.size }.write( Tsv( args("output") ) ) def tokenize(text : String) : Array[String] = { text.tolowercase.replaceall("[^a-za-z0-9\\s]", "").split("\\s+") } }

29 Cascalog Counting Words (ns count-words.core (:use cascalog.api) (:require [cascalog.ops :as c])) (defmapcatop split [^String sentence] (.split sentence "\\s+")) (defn wordcount-query [src] (<- [?word?count] (src?textline) (split?textline :>?word) (c/count?count)))

30 Apache Pig High-level abstraction on top of M/R Procedural ETL scripting language Extensible (Java, Python, Ruby or Groovy)

31 Apache Pig input_lines = LOAD '/tmp/books' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_ words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words';

32 Apache Hive SQL-like abstraction on top of M/R Allows basic ETL Extensible (Java, Python, Ruby or Groovy)

33 Counting Words Hive -- import the file as lines CREATE EXTERNAL TABLE lines(line string) LOAD DATA INPATH books OVERWRITE INTO TABLE lines; -- create a virtual view that splits the lines SELECT word, count(*) FROM lines LATERAL VIEW explode(split(text, )) ltable as word GROUP BY word;

34 Eco-system Oozie HBase Mahout Spring for Apache Hadoop Kafka Elasticsearch Storm

35 Big Data Pipeline Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Database (hbase, Gemfire, Cassandra) Big SQL (Greenplum, AsterData, Etc ) Batch Processing Unstructured Data (HDFS)

36 Wrapping up

37 Wrap-up Rich eco-system Variety of tools/frameworks/solutions they all run on the JVM both good & bad Be agile start small and grow organically Iterate over your design a lot Focus on data, not the tools

Building Big Data Pipelines using OSS. Costin Leau Staff Engineer VMware @CostinL

Building Big Data Pipelines using OSS Costin Leau Staff Engineer VMware @CostinL Costin Leau Speaker Bio Spring committer since 2006 Spring Framework (JPA, @Bean, cache abstraction) Spring OSGi/Dynamic