Building Big Data Pipelines using OSS Costin Leau Staff Engineer VMware @CostinL
Costin Leau Speaker Bio Spring committer since 2006 Spring Framework (JPA, @Bean, cache abstraction) Spring OSGi/Dynamic Modules, OSGi Blueprint spec Spring Data (GemFire, Redis, Hadoop) 3
Data Landscape
Data Trends http://www.emc.com/leadership/programs/digital-universe.htm
Enterprise Data Trends
Enterprise Data Trends Unstructured data No predefined model Often doesn t fit well in RDBMS Pre-Aggregated Data Computed during data collection Counters Running Averages
Cost Trends Big Iron: $40k/CPU ardware cost halving every 18 months Commodity Cluster: $1k/CPU
The Value of Data Value from Data Exceeds Hardware & Software costs Value in connecting data sets Grouping e-commerce users by user agent Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/418.9 (KHTML, like Gecko) Safari/419.3
Big Data Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze A subjective and moving target Big data in many sectors today range from 10 s of TB to multiple PB
(Big) Data Pipelines
A Holistic View of a Big Data System ETL Real Time Streams Real-Time Processing (s4, storm) Real Time Structured Database (hbase, Gemfre, Cassandra) Analytics Big SQL (Greenplum, AsterData, Etc ) Unstructured Data (HDFS) Batch Processing
Big Data probls == Integration probls Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Real world big data solutions require workflow across systems Workflow for big data processing is an integration problem Share core components of a classic integration workflow Big data solutions need to integrate with existing data and apps Event-driven vs Batch workflows No silver bullet Michael Stonebraker: One Size Fits All, An Idea Whose Time Has Come And Pass
Big Data probls == Integration probls Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Real world big data solutions require workflow across systems Workflow for big data processing is an integration problem Share core components of a classic integration workflow Big data solutions need to integrate with existing data and apps Event-driven vs Batch workflows No silver bullet Michael Stonebraker: One Size Fits All, An Idea Whose Time Has Come And Pass Spring projects can provide the foundation for Big Data workflows
Taming Big Data
Hadoop as a Big Data Platform Map Reduce Framework (MapRed) Hadoop Distributed File System (HDFS)
Spring for Hadoop - Goals Hadoop has a poor out of the box programming model Spring simplifies developing Hadoop applications By providing a familiar and consistent Applications are generally a collection of scripts calling command line apps programming and configuration mode Across a wide range of use cases HDFS usage Data Analysis (MR/Pig/Hive/Cascading) Workflow Event Streams Integration Allowing to start small and grow
Relationship with other Spring projects Spring Batch On and Off Hadoop workflows Spring Integration Event-driven applications, Enterprise Integration Patterns Spring Data Redis, MongoDB, Neo4j, Gemfire Spring Framework Web, Messaging Applications Spring for Apache Hadoop Simplify Hadoop programming
Capabilities: Spring + Hadoop Declarative configuration Create, configure, and parameterize Hadoop connectivity and all job types Environment profiles easily move from dev to qa to prod Developer productivity Create well-formed applications, not spaghetti script applications Simplify HDFS and FsShell API with support for JVM scripting Runner classes for MR/Pig/Hive/Cascading for small workflows Helper Template classes for Pig/Hive/HBase
Core Hadoop
Core Map Reduce idea
Counting Words M/R public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken());context.write(word, one); }}} public class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
Counting Words Configuring M/R Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true);
Running Hadoop Jars (WordCount 1.0) Vanilla Hadoop bin/hadoop jar hadoop-examples.jar wordcount /wc/input /wc/output SHDP <hdp:configuration /> <hdp:jar-runner id= wordcount jar="hadoop-examples.jar> <hdp:arg value= wordcount /> <hdp:arg value= /wc/input /> <hdp:arg value= /wc/output /> </hdp:jar-runner>
Running Hadoop Tools (WordCount 2.0) Vanilla Hadoop bin/hadoop jar conf myhadoop-site.xml D ignorecase=true wordcount.jar org.myorg.wordcount /wc/input /wc/output SHDP <hdp:configuration resources= myhadoop-site.xml /> <hdp:tool-runner id="wc jar= wordcount.jar > <hdp:arg value= /wc/input /> <hdp:arg value= /wc/output /> ignorecase=true </hdp:tool-runner>
Configuring Hadoop <context:property-placeholder location="hadoop-dev.properties"/> <hdp:configuration> fs.default.name=${hd.fs} </hdp:configuration> <hdp:job id="word-count-job" applicationcontext.xml input-path= ${input.path}" output-path="${output.path} jar= myjob.jar mapper="org.apache.hadoop.examples.wordcount.tokenizermapper reducer="org.apache.hadoop.examples.wordcount.intsumreducer"/> <hdp:job-runner id= runner job-ref="word-count-job run-at-startup= true /> input.path=/wc/input/ output.path=/wc/word/ hd.fs=hdfs://localhost:9000 hadoop-dev.properties
Running a Streaming Job bin/hadoop jar hadoop-streaming.jar \ input /wc/input output /wc/output \ -mapper /bin/cat reducer /bin/wc \ -files stopwords.txt <context:property-placeholder location="hadoop-${env}.properties"/> <hdp:streaming id= wc input-path= ${input} output-path= ${output} mapper= ${cat} reducer= ${wc} files= classpath:stopwords.txt > </hdp:streaming> hadoop-dev.properties input.path=/wc/input/ output.path=/wc/word/ hd.fs=hdfs://localhost:9000 env=dev java jar SpringLauncher.jar applicationcontext.xml
Running a Streaming Job bin/hadoop jar hadoop-streaming.jar \ input /wc/input output /wc/output \ -mapper /bin/cat reducer /bin/wc \ -files stopwords.txt <context:property-placeholder location="hadoop-${env}.properties"/> <hdp:streaming id= wc input-path= ${input} output-path= ${output} mapper= ${cat} reducer= ${wc} files= classpath:stopwords.txt > </hdp:streaming> hadoop-dev.properties input.path=/wc/input/ output.path=/wc/word/ hd.fs=hdfs://localhost:9000 hadoop-qa.properties input.path=/gutenberg/input/ output.path=/gutenberg/word/ hd.fs=hdfs://darwin:9000 env=qa java jar SpringLauncher.jar applicationcontext.xml
Word Count Injecting Jobs Use Dependency Injection to obtain reference to Hadoop Job additional runtime Perform public class WordService { configuration and submit @Inject private Job mapreducejob; public void processwords() { mapreducejob.submit(); } }
HDFS and Hadoop Shell as APIs Has all bin/hadoop fs commands through FsShell chmod, test class { mkdir,myscript @Autowired FsShell fsh; @PostConstruct void init() { String outputdir = "/data/output"; if (fsshell.test(outputdir)) { fsshell.rmr(outputdir); } }}
HDFS and FsShell as APIs Excellent for JVM scripting init-files.groovy // use the shell (made available under variable fsh) if (!fsh.test(inputdir)) { fsh.mkdir(inputdir); fsh.copyfromlocal(sourcefile, inputdir); fsh.chmod(700, inputdir) } if (fsh.test(outputdir)) { fsh.rmr(outputdir) }
HDFS and FsShell as APIs appctx.xml <hdp:script id= init-script language= groovy > <hdp:property name= inputdir value= ${input} /> <hdp:property name= outputdir value= ${output} /> <hdp:property name= sourcefile value= ${source} /> // use the shell (made available under variable fsh) if (!fsh.test(inputdir)) { fsh.mkdir(inputdir); fsh.copyfromlocal(sourcefile, inputdir); fsh.chmod(700, inputdir) } if (fsh.test(outputdir)) { fsh.rmr(outputdir) } </hdp:script>
Counting Words - Pig input_lines = LOAD '/tmp/books' AS (line:chararray); -- Extract words from each line and put them into a pig bag words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_ words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words';
Pig Vanilla Pig pig x mapreduce wordcount.pig pig wordcount.pig P pig.properties p pig.exec.nocombiner=true SHDP Creates a PigServer Executes script on startup (optional) <pig-factory job-name= wc properties-location= pig.properties"> pig.exec.nocombiner=true <script location= wordcount.pig"> <arguments>ignorecase=true</arguments> </script> </pig-factory>
PigRunner A small pig workflow @Scheduled(cron= 0 0 12 * *? ) public void process() { pigrunner.call(); }
PigTemplate - Configuration
PigTemplate Programmatic Use public class PigPasswordRepository implements PasswordRepository { private PigTemplate pigtemplate; private String pigscript = "classpath:password-analysis.pig"; public void processpasswordfile(string inputfile) { String outputdir = baseoutputdir + File.separator + counter.incrementandget(); Properties scriptparameters = new Properties(); scriptparameters.put("inputdir", inputfile); scriptparameters.put("outputdir", outputdir); pigtemplate.executescript(pigscript, scriptparameters); } //... }
Counting Words Hive -- import the file as lines CREATE EXTERNAL TABLE lines(line string) LOAD DATA INPATH books OVERWRITE INTO TABLE lines; -- create a virtual view that splits the lines SELECT word, count(*) FROM lines LATERAL VIEW explode(split(text, )) ltable as word GROUP BY word;
Vanilla Hive Command-line JDBC based
Hive w/ SHDP Create Hive JDBC Client and use with Spring JdbcTemplate <bean id="hive-driver" class="org.apache.hadoop.hive.jdbc.hivedriver"/> <bean id="hive-ds" class="org.springframework.jdbc.datasource.simpledriverdatasource" c:driver-ref="hive-driver" c:url="${hive.url}"/> <bean id="template" class="org.springframework.jdbc.core.jdbctemplate" c:data-source-ref="hive-ds"/>
Hive w/ SHDP Create Hive JDBC Client and use with Spring JdbcTemplate <bean id="hive-driver" class="org.apache.hadoop.hive.jdbc.hivedriver"/> <bean id="hive-ds" class="org.springframework.jdbc.datasource.simpledriverdatasource" c:driver-ref="hive-driver" c:url="${hive.url}"/> <bean id="template" class="org.springframework.jdbc.core.jdbctemplate" c:data-source-ref="hive-ds"/> Reuse Spring s Rich ResultSet to POJO Mapping Features public long count() { return jdbctemplate.queryforlong("select count(*) from " + tablename); } List<Password> result = jdbctemplate.query( select * from passwords", new ResultSetExtractor<List<Password>() { public String extractdata(resultset rs) throws SQLException { // extract data from result set }});
Vanilla Hive - Thrift HiveClient is not thread-safe, throws checked exceptions public long count() { HiveClient hiveclient = createhiveclient(); try { hiveclient.execute("select count(*) from " + tablename); return Long.parseLong(hiveClient.fetchOne()); // checked exceptions } catch (HiveServerException ex) { throw translateexcpetion(ex); } catch (org.apache.thrift.texception tex) { throw translateexcpetion(tex); } finally { try { hiveclient.shutdown(); } catch (org.apache.thrift.texception tex) { logger.debug("unexpected exception on shutting down HiveClient", tex); }}} protected HiveClient createhiveclient() { TSocket transport = new TSocket(host, port, timeout); HiveClient hive = new HiveClient(new TBinaryProtocol(transport)); try { transport.open(); } catch (TTransportException e) { throw translateexcpetion(e); } return hive; }
SHDP Hive Easy client confguration <hive-client-factory host="${hive.host}" port="${hive.port}"/> <hive-template id="hivetemplate"/> Can create an embedded Hive server instance <hive-server auto-startup="true" port="${hive.port}"/> Declarative Usage <hive-runner run-at-startup="true > <hdp:script> DROP TABLE IF EXISTS ${wc.table}; </hdp:script> <hdp:script location= word-count.q /> </hive-runner>
SHDP - HiveTemplate (Thrift) One-liners to execute queries @Repository public class HiveTemplatePasswordRepository implements PasswordRepository { private @Value("${hive.table}") String tablename; private @Autowired HiveOperations hivetemplate; @Override public Long count() { return hivetemplate.queryforlong("select count(*) from " + tablename); } } One-lines for executing scripts Properties scriptparameters = new Properties(); scriptparameters.put("inputdir", inputfile); scriptparameters.put("outputdir", outputdir); hivetemplate.query( classpath:hive-analysis.q", scriptparameters);
Cascading Counting Words Scheme sourcescheme = new TextLine(new Fields("line")); Tap source = new Hfs(sourceScheme, inputpath); Scheme sinkscheme = new TextLine(new Fields("word", "count")); Tap sink = new Hfs(sinkScheme, outputpath, SinkMode.REPLACE); Pipe assembly = new Pipe("wordcount"); String regex = "(?<!\\pl)(?=\\pl)[^ ]*(?<=\\pl)(?!\\pl)"; Function function = new RegexGenerator(new Fields("word"), regex); assembly = new Each(assembly, new Fields("line ), function ); assembly = new GroupBy(assembly, new Fields("word ) ); Aggregator count = new Count(new Fields("count )); assembly = new Every(assembly, count);
Cascading Based on Spring s type safe @Configuration <bean class= wordcount.cascading.cascadingconfig "/> <bean id="cascade" class="org.springframework.data.hadoop.cascading.hadoopflowfactorybean" p:configuration-ref="hadoopconfiguration" p:tails-ref= countpipe" /> <hdp:configuration />
HBase Bootstrap HBase confguration from Hadoop Confguration <hdp:configuration/> <hdp:hbase-configuration delete-connection="true /> <bean id="hbasetemplate class="org.springframework.data.hadoop.hbase.hbasetemplate p:configuration-ref="hbaseconfiguration /> Template usage public List<User> findall() { return hbasetemplate.find(tablename, "cfinfo", new RowMapper<User>() { @Override public User maprow(result result, int rownum) throws Exception { return new User(Bytes.toString(result.getValue(CF_INFO, quser)), Bytes.toString(result.getValue(CF_INFO, qemail)), Bytes.toString(result.getValue(CF_INFO, qpassword))); } });}
Batch Workflows
On Hadoop Workflows Reuse same infrastructure for Hadoop based workflows HDFS PIG MR Step can any Hadoop job Hive HDFS
Capabilities: Spring + Hadoop + Batch Collect Transform RT Analysis Ingest Batch Analysis Spring Batch for File/DB/NoSQL driven applications Collect: Process local files Transform: Scripting or Java code to transform and enrich RT Analysis: N/A Ingest: (batch/aggregate) write to HDFS or split/filtering Batch Analysis: Orchestrate Hadoop steps in a workflow Distribute: Copy data out of HDFS to structured storage JMX enabled along with REST interface for job control Distribute Use
Spring Batch Configuration <job id="job1"> <step id="import" next="wordcount"> <tasklet ref= import-tasklet"/> </step> <step id= wc" next="pig"> <tasklet ref="wordcount-tasklet"/> </step> <step id="pig"> <tasklet ref="pig-tasklet ></step> <split id="parallel" next="hdfs"> <flow><step id="mrstep"> <tasklet ref="mr-tasklet"/> </step></flow> <flow><step id="hive"> <tasklet ref="hive-tasklet"/> </step></flow> </split> <step id="hdfs"> <tasklet ref="hdfs-tasklet"/></step> </job>
Spring Batch Configuration Additional configuration behind the graph Reuse previous Hadoop job definitions
Spring Batch Admin
Event Driven Applications
Capabilities: Spring + Hadoop + EAI Collect Transform RT Analysis Ingest Batch Analysis Distribute Big data solutions need to integrate with existing data and apps Share core components of a classic integration workflow Spring Integration for Event driven applications Collect: Single node or distributed data collection (tcp/jms/rabbit) Transform: Scripting or Java code to transform and enrich RT Analysis: Connectivity to multiple analysis techniques Ingest: Write to HDFS, Split/Filter data stream to other stores JMX enabled + control bus for starting/stopping individual components Use
Spring Integration Polling Log File Poll a directory for files, files are rolled over every 10 min Copy files to staging area Copy files to HDFS Use an aggregator to wait for 10 files in 20 minute interval to launch MR job
Spring Integration Syslog to HDFS Use tcp/udp Syslog adapter Transformer categorizes messages Route to specific channels based on category One route leads to HDFS write and filtered data stored in Redis
Spring Integration Multi-node Syslog Spread log collection across multiple machines Use TCP Adapters to forward events across machines Can use other middleware Reusable flows, creak the flow at a channel boundary and insert inbound/outbound adapter pair
Resources Prepping for GA feedback welcome Project Page: springsource.org/spring-data/hadoop Source Code: github.com/springsource/spring-hadoop Books
Q&A @CostinL