Big Data for the JVM developer Costin Leau, Elasticsearch @costinl
Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system
Data Landscape
Data Trends http://www.emc.com/leadership/programs/digital-universe.htm
Enterprise Data Trends
Enterprise Data Trends
Enterprise Data Trends
Enterprise Data Trends Unstructured data No predefined model Often doesn t fit in RDBMS Pre-Aggregated Data Computed during data collection Counters Running Averages
Cost Trends Big Iron: $40k/CPU Hardware cost halving every 18 months Commodity Cluster: $1k/CPU
Cost Trends Big Iron: $40k/CPU Hardware cost halving every 18 months Commodity Cluster: $1k/CPU
Value of Data Value from Data Exceeds Hardware & Software costs US retail 60+% increase in net margin possible 0.5-1.0% annual productivity growth
Big Data Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze A subjective and moving target Big data in many sectors today range from 10 s of TB to multiple PB
(Big) Data Pipeline
Big Data Pipeline Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Database (hbase, Gemfire, Cassandra) Big SQL (Greenplum, AsterData, Etc ) Batch Processing Unstructured Data (HDFS)
Data Pipeline Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Unstructured Data in Big Data Filesystem RT Processing HDFS Collect Interactive Processing HBase Cassandra Elasticsearch Transform SQL BIG SQL Data Grid
Data Pipeline Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Data Presentation Data Analytics Interactive Processing Batch Processing (Hadoop) HBase Cassandra Elasticsearch SQL BIG SQL Data Grids Unstructured Data in Big Data Filesystem HDFS
Taming Big Data
JVM as the platform Portable Fast Secure Rich eco-system Massive adoption in the enterprise
JVM as the platform Map Reduce Framework (M/R) Hadoop Distributed File System (HDFS)
Storage - HDFS Distributed Scalable Portable Data Aware Commodity hardware Unstructured Data (HDFS)
Computation Map/Reduce
Counting Words aka Hello World
Computation Map/Reduce public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken());context.write(word, one); }}} public class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
Hadoop Streaming $HADOOP_HOME/bin/hadoop jar \ hadoop-streaming.jar \ -input myinputdirs \ -output myoutputdir \ -mapper /bin/cat \ -reducer /bin/wc
Hadoop Streaming $HADOOP_HOME/bin/hadoop jar \ hadoop-streaming.jar \ -input myinputdirs \ -output myoutputdir \ -mapper parseline.py \ -reducer /bin/wc
Cascading Mid-level abstraction on top of M/R Hides M/R plumbing through building blocks Handles process planning and scheduling JVM based (Java, Clojure* and Scala*) * External projects to Cascading
Cascading Counting Words Scheme sourcescheme = new TextLine(new Fields("line")); Tap source = new Hfs(sourceScheme, inputpath); Scheme sinkscheme = new TextLine(new Fields("word", "count")); Tap sink = new Hfs(sinkScheme, outputpath, SinkMode.REPLACE); Pipe assembly = new Pipe("wordcount"); String regex = "(?<!\\pl)(?=\\pl)[^ ]*(?<=\\pl)(?!\\pl)"; Function function = new RegexGenerator(new Fields("word"), regex); assembly = new Each(assembly, new Fields("line ), function ); assembly = new GroupBy(assembly, new Fields("word ) ); Aggregator count = new Count(new Fields("count )); assembly = new Every(assembly, count);
Scalding Counting Words package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ).flatmap('line -> 'word) { line : String => tokenize(line) }.groupby('word) { _.size }.write( Tsv( args("output") ) ) def tokenize(text : String) : Array[String] = { text.tolowercase.replaceall("[^a-za-z0-9\\s]", "").split("\\s+") } }
Cascalog Counting Words (ns count-words.core (:use cascalog.api) (:require [cascalog.ops :as c])) (defmapcatop split [^String sentence] (.split sentence "\\s+")) (defn wordcount-query [src] (<- [?word?count] (src?textline) (split?textline :>?word) (c/count?count)))
Apache Pig High-level abstraction on top of M/R Procedural ETL scripting language Extensible (Java, Python, Ruby or Groovy)
Apache Pig input_lines = LOAD '/tmp/books' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_ words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words';
Apache Hive SQL-like abstraction on top of M/R Allows basic ETL Extensible (Java, Python, Ruby or Groovy)
Counting Words Hive -- import the file as lines CREATE EXTERNAL TABLE lines(line string) LOAD DATA INPATH books OVERWRITE INTO TABLE lines; -- create a virtual view that splits the lines SELECT word, count(*) FROM lines LATERAL VIEW explode(split(text, )) ltable as word GROUP BY word;
Eco-system Oozie HBase Mahout Spring for Apache Hadoop Kafka Elasticsearch Storm
Big Data Pipeline Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Database (hbase, Gemfire, Cassandra) Big SQL (Greenplum, AsterData, Etc ) Batch Processing Unstructured Data (HDFS)
Wrapping up
Wrap-up Rich eco-system Variety of tools/frameworks/solutions they all run on the JVM both good & bad Be agile start small and grow organically Iterate over your design a lot Focus on data, not the tools
Hvala! @costinl