DRIVING INNOVATION THROUGH DATA HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara
ABOUT ME I am a Data Engineer, not a Data Scientist I help Enterprises develop decisions on building their Big Data roadmap and technical strategy use cases, products, technology decisions, employee skills I design Hadoop applications with the intent to operationalize them in Enterprise settings applications on which business depend, and last longer than the technologies underneath them This talk is about learning how to design your BD strategy that leverages best 2
BUILDING AN OPEN PLATFORM IS KEY TO PREVENTING LOCK-IN Open Language Open Data Open Hardware Open Compute Platform Open Development Platform 3
OPEN LANGUAGES ALLOW YOU TO HARNESS THE TALENT OF YOUR ENTERPRISE Don t equate architecture with language; develop architecture to support multiple Support SQL and SQL-like languages Encourage development in proven & scalable languages as Java Develop architecture to support change of programming languages (even for same app) Have common performance-management tools across all programming environments 4
OPEN DATA ENABLES REUSE OF DATA AND APPS Develop a common operating picture by promoting reuse with open data Prevent exclusive access to data sets through proprietary tools Promote a common meta-data repository Forbid storing data in proprietary formats Build seamless integration capabilities 5
OPEN HARDWARE PROMOTES REUSE OF INFRASTRUCTURE Get commodity hardware commodity hardware will always cost less than optimized specialized hardware (note: definition of specialized is up for debate) Develop and maintain a cluster that can be reused by different applications and technology stacks avoid custom software installations on the cluster, or setting up dedicated clusters for given tech stacks Harness the power of collective from the cluster avoid fragmenting the cluster if possible 6
OPEN COMPUTE PLATFORM MAKES YOU SELECT THE RIGHT TOOL FOR THE PROBLEM Make tradeoffs between reliability & speed based on your business context Ensure that moving your application from one Hadoop compute platform (e.g. MapReduce) to another (e.g., Tez) does not: impact application code impact production-monitoring tools Resist compute platforms that require your enterprise to acquire significantly new skills (even if it is easy) to become productive Avoid new platforms that partition the cluster Avoid platforms that do not support Open Data 7
OPEN DEVELOPMENT PLATFORM PROVIDES LONG-TERM SUSTAINABILITY Development platforms improve developer productivity and operational excellence picking a correct platform gives you best practices developed by the community, achieving higher quality Invest in picking the correct development platform open, easy, scalable, popular, tools, Bet on a sustainable open source platform Measure the vitality of the community: number of downloads, extensions (living ecosystem), extensible architecture, consumers of the technology, code stability A proven platform provides tools to get your apps to production 8
GET TO KNOW CONCURRENT Leader in Application Infrastructure for Big Data Building enterprise software to simplify Big Data application development and management Products and Technology CASCADING Open Source - The most widely used application infrastructure for Founded: 2008 HQ: San Francisco, CA CEO: Gary Nakamura CTO, Founder: Chris Wensel www.concurrentinc.com building Big Data apps with over 175,000 downloads each month DRIVEN Enterprise data application management for Big Data apps Proven Simple, Reliable, Robust Thousands of enterprises rely on Concurrent to provide their data application infrastructure. 9
CASCADING - DE-FACTO STANDARD FOR DATA APPS Cascading Apps Standard for enterprise SQL Clojure Ruby data app development Your programming language of choice Supported Fabrics and Data Stores New Fabrics Cascading applications that run on MapReduce Mainframe DB / DW In-Memory Data Stores Hadoop Tez Storm will also run on Apache Spark, Storm, and 10
DEMO: WORD COUNT EXAMPLE WITH CASCADING String docpath = args[ 0 ]; String wcpath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowconnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap doctap = new Hfs( new TextDelimited( true, "\t" ), docpath ); Tap wctap = new Hfs( new TextDelimited( true, "\t" ), wcpath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docpipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcpipe = new Pipe( "wc", docpipe ); wcpipe = new GroupBy( wcpipe, token ); wcpipe = new Every( wcpipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow definition FlowDef flowdef = FlowDef.flowDef().setName( "wc" ).addsource( docpipe, doctap ).addtailsink( wcpipe, wctap ); // create the Flow Flow wcflow = flowconnector.connect( flowdef ); // <<-- Unit of Work wcflow.complete(); // <<-- Runs jobs on Cluster configuration integration processing scheduling 11
THE STANDARD FOR DATA APPLICATION DEVELOPMENT Application platform that addresses: Build data apps that are scale-free" Systems Integration" Application Portability" Design principals ensure best practices at any scale Hadoop never lives alone. Easily integrate to existing systems Write once, then run on different computation fabrics Staffing Bottleneck" Test-Driven Development" Operational Complexity" Use existing Java, SQL, modeling skill sets Efficiently test code and process local files before deploying on a cluster Simple - Package up into one jar and hand to operations Proven application development framework for building data apps www.cascading.org 12
STRONG ORGANIC GROWTH 175,000+ downloads / month" 7000+ Deployments" 13
BUSINESSES DEPEND ON US Cascading Java API Data normalization and cleansing of search and click-through logs for use by analytics tools, Hive analysts Easy to operationalize heavy lifting of data in one framework 14
BUSINESSES DEPEND ON US Cascalog (Clojure) Weather pattern modeling to protect growers against loss ETL against 20+ datasets daily Machine learning to create models Purchased by Monsanto for $930M US 15
BUSINESSES DEPEND ON US Scalding (Scala) TWITTER Makes complex analysis of very large data sets simple Machine learning, linear algebra to improve User experience Ad quality (matching users and ad effectiveness) All revenue applications are running on Cascading/Scalding 16
CASCADING DATA APPLICATIONS Enterprise IT" Extract Transform Load Log File Analysis Systems Integration Operations Analysis Corporate Apps" HR Analytics Employee Behavioral Analysis Customer Support ecrm Business Reporting Telecom" Data processing of Open Data Geospatial Indexing Consumer Mobile Apps Location based services Marketing / Retail" Mobile, Social, Search Analytics Funnel Analysis Revenue Attribution Customer Experiments Ad Optimization Retail Recommenders Consumer / Entertainment" Music Recommendation Comparison Shopping Restaurant Rankings Real Estate Rental Listings Travel Search & Forecast Finance" Fraud and Anomaly Detection Fraud Experiments Customer Analytics Insurance Risk Metric Health / Biotech" Aggregate Metrics For Govt Person Biometrics Veterinary Diagnostics Next-Gen Genomics Argonomics Environmental Maps 17
BROAD SUPPORT Hadoop ecosystem supports Cascading 18
CASCADING DEPLOYMENTS 19 Confidential
AND INCLUDES RICH SET OF EXTENSIONS http://www.cascading.org/extensions/ 20
CASCADING 3.0 - CURRENTLY WIP Write once and deploy on your fabric of choice. The Innovation Cascading 3.0 will allow for data apps to execute on existing and emerging fabrics through its new customizable query Enterprise Data Applications planner. Local In-Memory MapReduce Apache Tez, Storm, Cascading 3.0 will support Local Computation Fabrics In-Memory, Apache MapReduce and soon thereafter (3.1) Apache Tez, Apache Spark and Apache Storm 21
USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps CLI / Shell Enterprise Java Supports integrating with any data source that can be accessed through JDBC Cascading Tap can be created for any source supporting JDBC Lingual Provider API Catalog JDBC API Lingual API Query Planner Cascading Great for migration of data, integrating with non- Big Data assets extends life of existing IT assets in an organization Apache Hadoop Data Stores 22
SCALDING Scalding is a language binding to Cascading for Scala The name Scalding comes from the combining of SCALa and cascading Scalding is great for Scala developers; can crisply write constructs for matrix math Scalding has very large commercial deployments at: Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality Ebay - Use cases include search analytics and other production data pipelines 23
PATTERN SCORES MODELS AT SCALE Pattern is an open source project that allows to leverage Predictive Model Markup Language (PMML) models and translate them into Cascading apps. PMML is an XML-based popular analytics framework that allows applications to describe data mining and machine learning algorithms PMML models from popular analytics frameworks can be reused and deployed within Cascading workflows" Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle Open source frameworks - R, Weka, KNIME, RapidMiner Pattern is great for migrating your model scoring to Hadoop from your decision systems 24
PATTERN SCORES MODELS AT SCALE Step 1: Train your model with industry-leading Tools Step 2: Score your models at scale with Pattern 25 Confidential
PATTERN DEMO: FROM TRAINING TO SCORING 26
WHY PATTERN Standards compliance provides integration with many tools Models are independent of data and integration Only debugging Cascading, not an ensemble of applications 27
PATTERN: ALGOS IMPLEMENTED Hierarchical Clustering K-Means Clustering Linear Regression Logistic Regression Random Forest algorithms extended based on customer use cases 28 Confidential
OPERATIONAL EXCELLENCE Visibility Through All Stages of App Lifecycle From Development Building and Testing" Design & Development Debugging Tuning To Production Monitoring and Tracking" Maintain Business SLAs Balance & Controls Application and Data Quality Operational Health Real-time Insights 29
DRIVEN ARCHITECTURE
CONTACT INFORMATION Supreet Oberoi" supreet@concurrentinc.com 650-868-7675 (m) @supreet_online
DRIVING INNOVATION THROUGH DATA THANK YOU Supreet Oberoi