Pattern an open source project for migrating predictive models from SAS, etc., onto Hadoop. Paco Nathan Concurrent, Inc. San Francisco,

Size: px
Start display at page:

Download "Pattern an open source project for migrating predictive models from SAS, etc., onto Hadoop. Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid"

Transcription

1 Pattern an open source project for migrating predictive models from SAS, etc., onto Hadoop Paco Nathan Concurrent, Inc. San Francisco,

2 employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

3 Cascading origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products. Wensel was following the Nutch open source project where Hadoop started. Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce potential blocker for leveraging new open source technology.

4 Cascading functional programming Key insight: MapReduce is based on functional programming back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as Main Street Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: leverages JVM and Java-based tools without any need to create new languages allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters

5 Cascading definitions a pattern language for Enterprise Data Workflows Customers simple to build, easy to test, robust in production Web App design principles ensure best practices at scale Support trap logs logs Logs source Cache sink Modeling PMML Data Workflow sink source Analytics Cubes Reporting Hadoop Cluster customer profile Customer DBs Prefs

6 Cascading usage Java API, DSLs in Scala, Clojure, Jython, JRuby, Groovy, ANSI SQL Customers Web App ASL 2 license, GitHub src, 5+ yrs production use, multiple Enterprise verticals Support Modeling PMML trap logs logs Logs source Data Workflow Cache sink sink source Analytics Cubes Reporting Hadoop Cluster customer profile Customer DBs Prefs

7 Cascading integrations partners: Microsoft Azure, Hortonworks, Amazon AWS, MapR, EMC, SpringSource, Cloudera s: Memcached, Cassandra, MongoDB, HBase, JDBC, Parquet, etc. serialization: Avro, Thrift, Kryo, JSON, etc. Support trap logs logs Logs source Customers Web App Cache sink topologies: Apache Hadoop, tuple spaces, local mode Modeling PMML sink Data Workflow source Analytics Cubes Reporting Hadoop Cluster customer profile Customer DBs Prefs

8 Cascading deployments case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uswitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, ecrm, utility grids, telecom, genomics, climatology, agronomics, etc.

9 Cascading deployments case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uswitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, ecrm, utility grids, telecom, genomics, climatology, agronomics, etc. workflow abstraction addresses: staffing bottleneck; system integration; operational complexity; test-driven development

10 employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

11 Enterprise Data Workflows Let s consider a strawman architecture for an example app at the front end LOB use cases drive demand for apps Customers Web App Support trap logs logs Logs source Cache sink Modeling Analytics Cubes Reporting PMML sink Data Workflow Hadoop Cluster source customer profile Customer DBs Prefs

12 Enterprise Data Workflows Same example in the back office Organizations have substantial investments in people, infrastructure, process Customers Web App Support trap logs logs Logs source Cache sink Modeling Analytics Cubes Reporting PMML sink Data Workflow Hadoop Cluster source customer profile Customer DBs Prefs

13 Enterprise Data Workflows Same example the heavy lifting! Main Street firms are migrating workflows to Hadoop, for cost savings and scale-out Support trap logs logs Logs source Customers Web App Cache sink Modeling Analytics Cubes Reporting PMML sink Data Workflow Hadoop Cluster source customer profile Customer DBs Prefs

14 Cascading workflows s s integrate other data frameworks, as tuple streams these are plumbing endpoints in the pattern language sources (inputs), sinks (outputs), traps (exceptions) text delimited, JDBC, Memcached, HBase, Cassandra, MongoDB, etc. data serialization: Avro, Thrift, Kryo, JSON, etc. extend a new kind of in just a few lines of Java schema and provenance get derived from analysis of the s Support Modeling Analytics Cubes Reporting PMML trap sink logs logs Logs source Data Workflow Hadoop Cluster Customers Web App Cache sink source customer profile Customer DBs Prefs

15 Cascading workflows s String docpath = args[ 0 ]; String wcpath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowconnector = new HadoopFlowConnector( properties ); // create source and sink s Tap doctap = new Hfs( new TextDelimited( true, "\t" ), docpath ); Tap wctap = new Hfs( new TextDelimited( true, "\t" ), wcpath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\ (\\),.]" ); // only returns "token" Pipe docpipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcpipe = new Pipe( "wc", docpipe ); wcpipe = new GroupBy( wcpipe, token ); wcpipe = new Every( wcpipe, Fields.ALL, new Count(), Fields.ALL ); // connect the s, pipes, etc., into a flow FlowDef flowdef = FlowDef.flowDef().setName( "wc" ).addsource( docpipe, doctap ).addtailsink( wcpipe, wctap ); // write a DOT file and run the flow Flow wcflow = flowconnector.connect( flowdef ); wcflow.writedot( "dot/wc.dot" ); wcflow.complete(); source and sink s for TSV data in HDFS

16 Cascading workflows topologies topologies execute workflows on clusters flow planner is like a compiler for queries Customers - Hadoop (MapReduce jobs) Web App - - local mode (dev/test or special config) in-memory data grids (real-time) flow planner can be extended to support other topologies Support Modeling PMML trap logs logs Logs source Data Workflow Cache sink blend flows in different topologies into the same app for example, batch (Hadoop) + transactions (IMDG) Analytics Cubes Reporting sink Hadoop Cluster source customer profile Customer DBs Prefs

17 Cascading workflows topologies String docpath = args[ 0 ]; String wcpath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowconnector = new HadoopFlowConnector( properties ); // create source and sink s Tap doctap = new Hfs( new TextDelimited( true, "\t" ), docpath ); Tap wctap = new Hfs( new TextDelimited( true, "\t" ), wcpath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\ (\\),.]" ); // only returns "token" Pipe docpipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcpipe = new Pipe( "wc", docpipe ); wcpipe = new GroupBy( wcpipe, token ); wcpipe = new Every( wcpipe, Fields.ALL, new Count(), Fields.ALL ); // connect the s, pipes, etc., into a flow FlowDef flowdef = FlowDef.flowDef().setName( "wc" ).addsource( docpipe, doctap ).addtailsink( wcpipe, wctap ); // write a DOT file and run the flow Flow wcflow = flowconnector.connect( flowdef ); wcflow.writedot( "dot/wc.dot" ); wcflow.complete(); flow planner for Apache Hadoop topology

18 Cascading workflows test-driven development assert patterns (regex) on the tuple streams adjust assert levels, like log4j levels Customers trap edge cases as data exceptions Web App TDD at scale: start from raw inputs in the flow graph define stream assertions for each stage of transforms verify exceptions, code to remove them Support Modeling PMML trap logs logs Logs source Data Workflow Cache sink 4. when impl is complete, app has full test coverage redirect traps in production to Ops, QA, Support, Audit, etc. Analytics Cubes Reporting sink Hadoop Cluster source customer profile Customer DBs Prefs

19 Workflow Abstraction pattern language Cascading uses a plumbing mehor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Tokenize Scrub token M Stop Word List HashJoin Left RHS Regex token GroupBy token R Data is represented as flows of tuples. Operations within the flows bring functional programming aspects into Java In formal terms, this provides a pattern language Count Word Count

20 Pattern Language structured method for solving large, complex design problems, where the syntax of the language ensures the use of best practices i.e., conveying expertise employee Join Count leads PMML classifier quarterly sales bonus allocation A Pattern Language Christopher Alexander, et al. amazon.com/dp/ Failure Traps

21 Workflow Abstraction literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Tokenize Scrub token M Stop Word List HashJoin Left RHS Regex token GroupBy token R in formal terms, flow diagrams leverage a methodology called literate programming provides intuitive, visual representations for apps great for cross-team collaboration Count Word Count

22 Literate Programming by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

23 Workflow Abstraction business process following the essence of literate programming, Cascading workflows provide statements of business process this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) this is especially apparent in large-scale Cascalog apps: Specify what you require, not how to achieve it. by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale

24 Business Process by Edgar Codd A relational model of data for large shared data banks Communications of the ACM, 1970 dl.acm.org/citation.cfm?id= rather than arguing between SQL vs. NoSQL structured vs. unstructured data frameworks this approach focuses on what apps do: the process of structuring data

25 Cascading functional programming Twitter, ebay, LinkedIn, Nokia, YieldBot, uswitch, etc., have invested in open source projects atop Cascading used for their large-scale production deployments new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Why Adopting the Declarative Programming Practices Will Improve Your Return from Technology Dan Woods, Forbes forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programmingpractices-will-improve-your-return-from-technology/

26 Functional Programming for Big Data WordCount with token scrubbing Apache Hive: 52 lines HQL + 8 lines Python (UDF) compared to Scalding: 18 lines Scala/Cascading functional programming languages help reduce software engineering costs at scale, over time

27 Two Avenues to the App Layer Enterprise: must contend with complexity at scale everyday incumbents extend current practices and infrastructure investments using J2EE, ANSI SQL, SAS, etc. to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding complexity scale

28 employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

29 PMML standard established XML standard for predictive model markup organized by Data Mining Group (DMG), since members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc. PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations. With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application. wikipedia.org/wiki/predictive_model_markup_language

30 PMML model coverage Association Rules: AssociationModel element Cluster Models: ClusteringModel element Decision Trees: TreeModel element Naïve Bayes Classifiers: NaiveBayesModel element Neural Networks: NeuralNetwork element Regression: RegressionModel and GeneralRegressionModel elements Rulesets: RuleSetModel element Sequences: SequenceModel element Support Vector Machines: SupportVectorMachineModel element Text Models: TextModel element Time Series: TimeSeriesModel element ibm.com/developerworks/industry/library/ind-pmml2/

31 PMML vendor coverage

32 employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

33 Pattern model scoring migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML Customers great open source tools R, Weka, KNIME, Matlab, RapidMiner, etc. Web App integrate with other libraries Matrix API, etc. leverage PMML as another kind of DSL Support Modeling PMML trap logs logs Logs source Data Workflow Cache sink sink source cascading.org/pattern Analytics Cubes Reporting Hadoop Cluster customer profile Customer DBs Prefs

34 Pattern create a model in R ## train a RandomForest model f <- as.formula("as.factor(label) ~.") fit <- randomforest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance) print(fit) predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=false, sep="\t", row.names=false) ## export RF model to PMML savexml(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))

35 Pattern capture model parameters as PMML <?xml version="1.0"?> <PMML version="4.0" xmlns=" xmlns:xsi=" xsi:schemalocation=" <Header copyright="copyright (c)2012 Concurrent, Inc." description="random Forest Tree Model"> <Extension name="user" value="ceteri" extender="rattle/pmml"/> <Application name="rattle/pmml" version="1.2.30"/> <Timestamp> :39:28</Timestamp> </Header> <DataDictionary numberoffields="4"> <DataField name="label" optype="categorical" datatype="string"> <Value value="0"/> <Value value="1"/> </DataField> <DataField name="var0" optype="continuous" datatype="double"/> <DataField name="var1" optype="continuous" datatype="double"/> <DataField name="var2" optype="continuous" datatype="double"/> </DataDictionary> <MiningModel modelname="randomforest_model" functionname="classification"> <MiningSchema> <MiningField name="label" usagetype="predicted"/> <MiningField name="var0" usagetype="active"/> <MiningField name="var1" usagetype="active"/> <MiningField name="var2" usagetype="active"/> </MiningSchema> <Segmentation multiplemodelmethod="majorityvote"> <Segment id="1"> <True/> <TreeModel modelname="randomforest_model" functionname="classification" algorithmname="randomforest" splitcharacteristic="binarysplit"> <MiningSchema> <MiningField name="label" usagetype="predicted"/> <MiningField name="var0" usagetype="active"/> <MiningField name="var1" usagetype="active"/> <MiningField name="var2" usagetype="active"/> </MiningSchema>...

36 Pattern score a model, within an app public static void main( String[] args ) throws RuntimeException { String inputpath = args[ 0 ]; String classifypath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowconnector = new HadoopFlowConnector( properties ); // create source and sink s Tap inputtap = new Hfs( new TextDelimited( true, "\t" ), inputpath ); Tap classifytap = new Hfs( new TextDelimited( true, "\t" ), classifypath ); // handle command line options OptionParser optparser = new OptionParser(); optparser.accepts( "pmml" ).withrequiredarg(); OptionSet options = optparser.parse( args ); // connect the s, pipes, etc., into a flow FlowDef flowdef = FlowDef.flowDef().setName( "classify" ).addsource( "input", inputtap ).addsink( "classify", classifytap ); if( options.hasargument( "pmml" ) ) { String pmmlpath = (String) options.valuesof( "pmml" ).get( 0 ); PMMLPlanner pmmlplanner = new PMMLPlanner().setPMMLInput( new File( pmmlpath ) ).retainonlyactiveincomingfields().setdefaultpredictedfield( new Fields( "predict", Double.class ) ); // default value if missing from the model flowdef.addassemblyplanner( pmmlplanner ); } // write a DOT file and run the flow Flow classifyflow = flowconnector.connect( flowdef ); classifyflow.writedot( "dot/classify.dot" ); classifyflow.complete(); }

37 Pattern score a model, using pre-defined Cascading app Customer Orders Classify Scored Orders Assert GroupBy token M R PMML Model Count Failure Traps Confusion Matrix cascading.org/pattern

38 Pattern score a model, using pre-defined Cascading app ## run an RF classifier at scale hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \ --pmml data/sample.rf.xml ## run an RF classifier at scale, assert regression test, measure confusion matrix hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \ --pmml data/sample.rf.xml --assert --measure out/measure ## run a predictive model at scale, measure RMSE hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap \ --pmml data/iris.lm_p.xml --rmse out/measure

39 Roadmap existing algorithms for scoring Random Forest Decision Trees Linear Regression GLM Logistic Regression K-Means Clustering Hierarchical Clustering Multinomial Support Vector Machines (prepared for release) also, model chaining and general support for ensembles cascading.org/pattern

40 Roadmap next priorities for scoring Time Series (ARIMA forecast) Association Rules (basket analysis) Naïve Bayes Neural Networks algorithms extended based on customer use cases contact groups.google.com/forum/?fromgroups#!forum/pattern-user cascading.org/pattern

41 Roadmap top priorities for creating models at scale Random Forest Logistic Regression K-Means Clustering Association Rules plus all models which can be trained via sparse matrix factorization (TQSR => PCA, SVD least squares, etc.) a wealth of recent research indicates many opportunities to parallelize popular algorithms for training models at scale on Apache Hadoop cascading.org/pattern

42 employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

43 Experiments comparing models much customer interest in leveraging Cascading and Apache Hadoop to run customer experiments at scale run multiple variants, then measure relative lift Concurrent runtime tag and track models the following example compares two models trained with different machine learning algorithms this is exaggerated, one has an important variable intentionally omitted to help illustrate the experiment

44 Experiments Random Forest model ## train a Random Forest model ## example: f <- as.formula("as.factor(label) ~ var0 + var1 + var2") fit <- randomforest(f, data=data, proximity=true, ntree=25) print(fit) savexml(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/")) OOB estimate of error rate: 14% Confusion matrix: 0 1 class.error

45 Experiments Logistic Regression model ## train a Logistic Regression model (special case of GLM) ## example: f <- as.formula("as.factor(label) ~ var0 + var2") fit <- glm(f, family=binomial, data=data) print(summary(fit)) savexml(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/")) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-06 *** var ** var e-11 *** --- Signif. codes: 0 *** ** 0.01 * NB: this model has var1 intentionally omitted

46 Experiments comparing results use a confusion matrix to compare results for the classifiers Logistic Regression has a lower false negative rate (5% vs. 11%) however it has a much higher false positive rate (52% vs. 14%) assign a cost model to select a winner for example, in an ecommerce anti-fraud classifier: FN chargeback risk FP customer support costs

47 employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

48 Two Cultures A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets. Statistical Modeling: The Two Cultures Leo Breiman, 2001 bit.ly/euth9l in other words, seeing the forest for the trees this paper chronicled a sea change from data modeling practices (silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization)

49 Why Do Ensembles Matter? The World The World per Data Modeling

50 Algorithmic Modeling The trick to being a scientist is to be open to using a wide variety of tools. Breiman circa 2001: Random Forest, bootstrap aggregation, etc., yield dramatic increases in predictive power over earlier modeling such as Logistic Regression major learnings from the Netflix Prize: the power of ensembles, model chaining, etc. the problems at hand have become simply too big and too complex for ONE distribution, ONE model, ONE team

51 Ensemble Models Breiman: a multiplicity of data models BellKor team: 100+ individual models in 2007 Progress Prize while the process of combining models adds complexity (making it more difficult to anticipate or explain predictions) accuracy may increase substantially Ensemble Learning: Better Predictions Through Diversity Todd Holloway ETech (2008) abeautifulwww.com/ensemblelearningetech.pdf The Story of the Netflix Prize: An Ensemblers Tale Lester Mackey National Academies Seminar, Washington, DC (2011) stanford.edu/~lmackey/papers/

52 KDD 2013 PMML Workshop Pattern: PMML for Cascading and Hadoop Paco Nathan, Girish Kathalagiri Chicago, (accepted) 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining kdd13pmml.wordpress.com

53 employee Join Count leads PMML classifier quarterly sales bonus allocation Cascading: background Failure Traps The Workflow Abstraction PMML: Predictive Model Markup Pattern: PMML in Cascading PMML for Customer Experiments Ensemble Models with Pattern Workflow Design Pattern

54 Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies ETL data prep predictive model data sources end uses

55 Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies ANSI SQL for ETL ETL data prep predictive model data sources end uses

56 Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies ETL data prep predictive model data sources J2EE for business logic end uses

57 Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies SAS for predictive models ETL data prep predictive model data sources end uses

58 Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies ANSI SQL for ETL most of the licensing costs SAS for predictive models ETL data prep predictive model data sources end uses

59 Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies most of the project costs ETL data prep predictive model data sources J2EE for business logic end uses

60 Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app one among many, typically based on 100% open source Lingual: DW ANSI SQL business logic in Java, Clojure, Scala, etc. Pattern: SAS, R, etc. PMML cascading.org ETL data prep predictive model data sources a compiler sees it all end uses source s for Cassandra, JDBC, Splunk, etc. sink s for Memcached, HBase, MongoDB, etc.

61 Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app one among many, typically based on 100% open source Lingual: DW ANSI SQL business logic in Java, Clojure, Scala, etc. Pattern: SAS, R, etc. PMML cascading.org data sources ETL FlowDef dataflowdef = FlowDef.flowDef() predictive.setname( prep "etl" ) model.addsource( "example.employee", empltap ).addsource( "example.sales", salestap ).addsink( "results", resultstap ); SQLPlanner sqlplanner = new SQLPlanner().setSql( sqlstatement ); flowdef.addassemblyplanner( sqlplanner end ); uses a compiler sees it all source s for Cassandra, JDBC, Splunk, etc. sink s for Memcached, HBase, MongoDB, etc.

62 Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app one among many, typically based on 100% open source Lingual: DW ANSI SQL business logic in Java, Clojure, Scala, etc. Pattern: SAS, R, etc. PMML data FlowDef flowdef = FlowDef.flowDef() ETL.setName( "classifier" ) prep.addsource( "input", inputtap ).addsink( "classify", classifytap ); PMMLPlanner pmmlplanner = new PMMLPlanner().setPMMLInput( new File( pmmlmodel ) ).retainonlyactiveincomingfields(); data flowdef.addassemblyplanner( pmmlplanner ); sources a compiler sees it all predictive model end uses source s for Cassandra, JDBC, Splunk, etc. sink s for Memcached, HBase, MongoDB, etc.

63 Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components employee into an integrated app one among many, typically based on 100% open source cascading.org quarterly sales Join Lingual: DW ANSI SQL ETL Count leads business logic in Java, Clojure, Scala, etc. PMML classifier data prep Pattern: SAS, R, etc. PMML bonus allocation predictive model data sources Failure Traps source s for Cassandra, JDBC, Splunk, etc. end uses visual collaboration for the business logic is a great way to improve how teams work together sink s for Memcached, HBase, MongoDB, etc.

64 Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components employee into an integrated app one among many, typically based on 100% open source cascading.org quarterly sales data sources multiple departments, working in their respective Lingual: DW ANSI SQL ETL business logic in Java, Clojure, Scala, etc. data prep Pattern: SAS, R, etc. PMML frameworks, integrate results into a combined app, Join Count leads PMML classifier which runs at scale on a cluster business process bonus allocation predictive model combined in a common space (DAG) for flow planners, compiler, optimization, troubleshooting, exception handling, notifications, security audit, performance monitoring, etc. end uses Failure Traps source s for Cassandra, JDBC, Splunk, etc. sink s for Memcached, HBase, MongoDB, etc.

65 references Enterprise Data Workflows with Cascading O Reilly, 2013 amazon.com/dp/ newsletter updates: liber118.com/pxn/

66 acknowledgements Many thanks to others who have contributed code, ideas, suggestions, etc., to Pattern: Chris Concurrent Girish AgilOne Vijay Srinivas Impetus Chris ebay Ofer Hortonworks Sergey Nokia Quinton IZAZI Solutions Chris Airbnb Villu JPMML project

67 drill-down blog, developer community, code/wiki/gists, maven repo, commercial products, etc.: cascading.org zest.to/group11 github.com/cascading conjars.org goo.gl/kqtul concurrentinc.com

Cascading Pattern - How to quickly migrate Predictive Models (PMML) from SAS, R, Micro Strategies etc., onto Hadoop and deploy them at scale

Cascading Pattern - How to quickly migrate Predictive Models (PMML) from SAS, R, Micro Strategies etc., onto Hadoop and deploy them at scale Cascading Pattern - How to quickly migrate Predictive Models (PMML) from SAS, R, Micro Strategies etc., onto Hadoop and deploy them at scale V1.0 September 12, 2013 Introduction Summary Cascading Pattern

More information

HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING. Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara

HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING. Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara DRIVING INNOVATION THROUGH DATA HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara ABOUT ME I am a Data Engineer, not a Data

More information

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING Supreet Oberoi VP Field Engineering, Concurrent Inc GET TO KNOW CONCURRENT Leader in Application Infrastructure

More information

Pattern: PMML for Cascading and Hadoop

Pattern: PMML for Cascading and Hadoop Pattern: PMML for Cascading and Hadoop Paco Nathan Concurrent, Inc. pacoid@cs.stanford.edu Girish Kathalagiri AgilOne, Inc. girish.kathalagiri@agilone.com ABSTRACT Pattern is an open source project based

More information

http://glennengstrand.info/analytics/fp

http://glennengstrand.info/analytics/fp Functional Programming and Big Data by Glenn Engstrand (September 2014) http://glennengstrand.info/analytics/fp What is Functional Programming? It is a style of programming that emphasizes immutable state,

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis Webinar will begin shortly Hadoop s Advantages for Machine Learning and Predictive Analytics Presented by Hortonworks & Zementis September 10, 2014 Copyright 2014 Zementis, Inc. All rights reserved. 2

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

ANALYTICS CENTER LEARNING PROGRAM

ANALYTICS CENTER LEARNING PROGRAM Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

More information

Creating Big Data Applications with Spring XD

Creating Big Data Applications with Spring XD Creating Big Data Applications with Spring XD Thomas Darimont @thomasdarimont THE FASTEST PATH TO NEW BUSINESS VALUE Journey Introduction Concepts Applications Outlook 3 Unless otherwise indicated, these

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

@Scalding. https://github.com/twitter/scalding. Based on talk by Oscar Boykin / Twitter

@Scalding. https://github.com/twitter/scalding. Based on talk by Oscar Boykin / Twitter @Scalding https://github.com/twitter/scalding Based on talk by Oscar Boykin / Twitter What is Scalding? Why Scala for Map/Reduce? How is it used at Twitter? What s next for Scalding? Yep, we re counting

More information

BIRT in the World of Big Data

BIRT in the World of Big Data BIRT in the World of Big Data David Rosenbacher VP Sales Engineering Actuate Corporation 2013 Actuate Customer Days Today s Agenda and Goals Introduction to Big Data Compare with Regular Data Common Approaches

More information

This Symposium brought to you by www.ttcus.com

This Symposium brought to you by www.ttcus.com This Symposium brought to you by www.ttcus.com Linkedin/Group: Technology Training Corporation @Techtrain Technology Training Corporation www.ttcus.com Big Data Analytics as a Service (BDAaaS) Big Data

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm

More information

BIG DATA SOLUTION DATA SHEET

BIG DATA SOLUTION DATA SHEET BIG DATA SOLUTION DATA SHEET Highlight. DATA SHEET HGrid247 BIG DATA SOLUTION Exploring your BIG DATA, get some deeper insight. It is possible! Another approach to access your BIG DATA with the latest

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

More information

Building Your Big Data Team

Building Your Big Data Team Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements.

More information

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling

More information

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by Christian Tzolov @christzolov Whoami Christian Tzolov Technical Architect at Pivotal, BigData, Hadoop, SpringXD,

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

The 4 Pillars of Technosoft s Big Data Practice

The 4 Pillars of Technosoft s Big Data Practice beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed

More information

What s Cooking in KNIME

What s Cooking in KNIME What s Cooking in KNIME Thomas Gabriel Copyright 2015 KNIME.com AG Agenda Querying NoSQL Databases Database Improvements & Big Data Copyright 2015 KNIME.com AG 2 Querying NoSQL Databases MongoDB & CouchDB

More information

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? The Big Data Buzz big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Machine Learning with MATLAB David Willingham Application Engineer

Machine Learning with MATLAB David Willingham Application Engineer Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the

More information

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Openbus Documentation

Openbus Documentation Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents:

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Integrating a Big Data Platform into Government:

Integrating a Big Data Platform into Government: Integrating a Big Data Platform into Government: Drive Better Decisions for Policy and Program Outcomes John Haddad, Senior Director Product Marketing, Informatica Digital Government Institute s Government

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Comprehensive Analytics on the Hortonworks Data Platform

Comprehensive Analytics on the Hortonworks Data Platform Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page

More information

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth MAKING BIG DATA COME ALIVE Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth Steve Gonzales, Principal Manager steve.gonzales@thinkbiganalytics.com

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users. Bonus Chapter Ten Major Predictive Analytics Vendors In This Chapter Angoss FICO IBM RapidMiner Revolution Analytics Salford Systems SAP SAS StatSoft, Inc. TIBCO This chapter highlights ten of the major

More information

Cascading 2 User Guide

Cascading 2 User Guide Cascading 2 User Guide Concurrent, Inc. Copyright 2007-2012 Concurrent, Inc. Publication date October 2012 Table of Contents 1. About Cascading... 1 1.1. What is Cascading?... 1 1.2. Usage Scenarios...

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

HADOOP. Revised 10/19/2015

HADOOP. Revised 10/19/2015 HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...

More information

Model Deployment. Dr. Saed Sayad. University of Toronto 2010 saed.sayad@utoronto.ca. http://chem-eng.utoronto.ca/~datamining/

Model Deployment. Dr. Saed Sayad. University of Toronto 2010 saed.sayad@utoronto.ca. http://chem-eng.utoronto.ca/~datamining/ Model Deployment Dr. Saed Sayad University of Toronto 2010 saed.sayad@utoronto.ca http://chem-eng.utoronto.ca/~datamining/ 1 Model Deployment Creation of the model is generally not the end of the project.

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics Please note the following IBM s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

The R pmmltransformations Package

The R pmmltransformations Package The R pmmltransformations Package Tridivesh Jena Alex Guazzelli Wen-Ching Lin Michael Zeller Zementis, Inc.* Zementis, Inc. Zementis, Inc. Zementis, Inc. Tridivesh.Jena@ Alex.Guazzelli@ Wenching.Lin@ Michael.Zeller@

More information

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Real World Big Data Architecture - Splunk, Hadoop, RDBMS Copyright 2015 Splunk Inc. Real World Big Data Architecture - Splunk, Hadoop, RDBMS Raanan Dagan, Big Data Specialist, Splunk Disclaimer During the course of this presentagon, we may make forward looking

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

Customer Behaviour Analytics: Billions of Events to one Customer-Product Graph. Budapest BI Forum, 6th November 2013 Presented by Paul Lam

Customer Behaviour Analytics: Billions of Events to one Customer-Product Graph. Budapest BI Forum, 6th November 2013 Presented by Paul Lam Customer Behaviour Analytics: Billions of Events to one Customer-Product Graph Budapest BI Forum, 6th November 2013 Presented by Paul Lam About Paul Lam Joined uswitch.com as first Data Scientist in 2010

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Forecast of Big Data Trends Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Big Data transforms Business 2 Data created every minute Source http://mashable.com/2012/06/22/data-created-every-minute/

More information

Big Data Analytics and Optimization

Big Data Analytics and Optimization Big Data Analytics and Optimization C e r t i f i c a t e P r o g r a m i n E n g i n e e r i n g E x c e l l e n c e e.edu.in http://www.insof LIST OF COURSES Essential Business Skills for a Data Scientist...

More information

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten MC Brown, Director of Documentation Linas Virbalas, Senior Software Engineer. About Tungsten Replicator Open source drop-in

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015 R Tools Evaluation A review by Analytics @ Global BI / Local & Regional Capabilities Telefónica CCDO May 2015 R Features What is? Most widely used data analysis software Used by 2M+ data scientists, statisticians

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required. What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Customer Case Study. Sharethrough

Customer Case Study. Sharethrough Customer Case Study Customer Case Study Benefits Faster prototyping of new applications Easier debugging of complex pipelines Improved overall engineering team productivity Summary offers a robust advertising

More information

HDP Enabling the Modern Data Architecture

HDP Enabling the Modern Data Architecture HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,

More information

BIG DATA TOOLS. Top 10 open source technologies for Big Data

BIG DATA TOOLS. Top 10 open source technologies for Big Data BIG DATA TOOLS Top 10 open source technologies for Big Data We are in an ever expanding marketplace!!! With shorter product lifecycles, evolving customer behavior and an economy that travels at the speed

More information

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs 1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be

More information

AtScale Intelligence Platform

AtScale Intelligence Platform AtScale Intelligence Platform PUT THE POWER OF HADOOP IN THE HANDS OF BUSINESS USERS. Connect your BI tools directly to Hadoop without compromising scale, performance, or control. TURN HADOOP INTO A HIGH-PERFORMANCE

More information

Performance and Scalability Overview

Performance and Scalability Overview Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics platform. PENTAHO PERFORMANCE ENGINEERING

More information

Sunnie Chung. Cleveland State University

Sunnie Chung. Cleveland State University Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:

More information

2015 Workshops for Professors

2015 Workshops for Professors SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

More information

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo 178627 Database And Data Mining Research Group

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo 178627 Database And Data Mining Research Group RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo 178627 Database And Data Mining Research Group Summary RapidMiner project Strengths How to use RapidMiner Operator

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

In-Memory BigData. Summer 2012, Technology Overview

In-Memory BigData. Summer 2012, Technology Overview In-Memory BigData Summer 2012, Technology Overview Company Vision In-Memory Data Processing Leader: > 5 years in production > 100s of customers > Starts every 10 secs worldwide > Over 10,000,000 starts

More information

Data Mining + Business Intelligence. Integration, Design and Implementation

Data Mining + Business Intelligence. Integration, Design and Implementation Data Mining + Business Intelligence Integration, Design and Implementation ABOUT ME Vijay Kotu Data, Business, Technology, Statistics BUSINESS INTELLIGENCE - Result Making data accessible Wider distribution

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Performance and Scalability Overview

Performance and Scalability Overview Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and

More information

Databricks. A Primer

Databricks. A Primer Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically

More information

Databricks. A Primer

Databricks. A Primer Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer Automated Data Ingestion Bernhard Disselhoff Enterprise Sales Engineer Agenda Pentaho Overview Templated dynamic ETL workflows Pentaho Data Integration (PDI) Use Cases Pentaho Overview Overview What we

More information

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Big Data Analytics Platform @ Nokia

Big Data Analytics Platform @ Nokia Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform

More information