HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING. Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara

Similar documents
DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING

Cascading Pattern - How to quickly migrate Predictive Models (PMML) from SAS, R, Micro Strategies etc., onto Hadoop and deploy them at scale

The Internet of Things and Big Data: Intro

Databricks. A Primer

Pattern an open source project for migrating predictive models from SAS, etc., onto Hadoop. Paco Nathan Concurrent, Inc. San Francisco,

Databricks. A Primer

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

BIG DATA SOLUTION DATA SHEET

ANALYTICS CENTER LEARNING PROGRAM

From Spark to Ignition:

Implement Hadoop jobs to extract business value from large and varied data sets

Making big data simple with Databricks

@Scalding. Based on talk by Oscar Boykin / Twitter

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Advanced In-Database Analytics

Hadoop & Spark Using Amazon EMR

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

How to Enhance Traditional BI Architecture to Leverage Big Data

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

HDP Hadoop From concept to deployment.

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

Advanced Big Data Analytics with R and Hadoop

Big Data for the JVM developer. Costin Leau,

SEIZE THE DATA SEIZE THE DATA. 2015

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Extend your analytic capabilities with SAP Predictive Analysis

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

How Companies are! Using Spark

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Apache Flink Next-gen data analysis. Kostas


Native Connectivity to Big Data Sources in MSTR 10

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

BIG DATA TECHNOLOGY. Hadoop Ecosystem

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Apache Hadoop: The Big Data Refinery

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

locuz.com Big Data Services

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

MicroStrategy Course Catalog

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Data Mining. Dr. Saed Sayad. University of Toronto

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Analytics - Accelerated. stream-horizon.com

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

DATAMEER WHITE PAPER. Beyond BI. Big Data Analytic Use Cases

How To Handle Big Data With A Data Scientist

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data and Data Science: Behind the Buzz Words

HDP Enabling the Modern Data Architecture

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Moving From Hadoop to Spark

Moving Faster: Why Intent Media Chose Cascalog for Data Processing and Machine Learning. Kurt Schrader May 20, 2014

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Building Your Big Data Team

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Big Data Analytics Nokia

Creating Big Data Applications with Spring XD

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics

The Future of Data Management

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

This Symposium brought to you by

Make Better Decisions Through Predictive Intelligence

The basic data mining algorithms introduced may be enhanced in a number of ways.

Upcoming Announcements

Using Data Mining and Machine Learning in Retail

How To Manage Marketing With A Cloud Based Software

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

#TalendSandbox for Big Data

Ganzheitliches Datenmanagement

AtScale Intelligence Platform

SAP and Hortonworks Reference Architecture

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Data Mining + Business Intelligence. Integration, Design and Implementation

Apache Kylin Introduction Dec 8,

Data Challenges in Telecommunications Networks and a Big Data Solution

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

BIG DATA What it is and how to use?

Harnessing Big Data with KNIME

Big Data must become a first class citizen in the enterprise

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Transcription:

DRIVING INNOVATION THROUGH DATA HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara

ABOUT ME I am a Data Engineer, not a Data Scientist I help Enterprises develop decisions on building their Big Data roadmap and technical strategy use cases, products, technology decisions, employee skills I design Hadoop applications with the intent to operationalize them in Enterprise settings applications on which business depend, and last longer than the technologies underneath them This talk is about learning how to design your BD strategy that leverages best 2

BUILDING AN OPEN PLATFORM IS KEY TO PREVENTING LOCK-IN Open Language Open Data Open Hardware Open Compute Platform Open Development Platform 3

OPEN LANGUAGES ALLOW YOU TO HARNESS THE TALENT OF YOUR ENTERPRISE Don t equate architecture with language; develop architecture to support multiple Support SQL and SQL-like languages Encourage development in proven & scalable languages as Java Develop architecture to support change of programming languages (even for same app) Have common performance-management tools across all programming environments 4

OPEN DATA ENABLES REUSE OF DATA AND APPS Develop a common operating picture by promoting reuse with open data Prevent exclusive access to data sets through proprietary tools Promote a common meta-data repository Forbid storing data in proprietary formats Build seamless integration capabilities 5

OPEN HARDWARE PROMOTES REUSE OF INFRASTRUCTURE Get commodity hardware commodity hardware will always cost less than optimized specialized hardware (note: definition of specialized is up for debate) Develop and maintain a cluster that can be reused by different applications and technology stacks avoid custom software installations on the cluster, or setting up dedicated clusters for given tech stacks Harness the power of collective from the cluster avoid fragmenting the cluster if possible 6

OPEN COMPUTE PLATFORM MAKES YOU SELECT THE RIGHT TOOL FOR THE PROBLEM Make tradeoffs between reliability & speed based on your business context Ensure that moving your application from one Hadoop compute platform (e.g. MapReduce) to another (e.g., Tez) does not: impact application code impact production-monitoring tools Resist compute platforms that require your enterprise to acquire significantly new skills (even if it is easy) to become productive Avoid new platforms that partition the cluster Avoid platforms that do not support Open Data 7

OPEN DEVELOPMENT PLATFORM PROVIDES LONG-TERM SUSTAINABILITY Development platforms improve developer productivity and operational excellence picking a correct platform gives you best practices developed by the community, achieving higher quality Invest in picking the correct development platform open, easy, scalable, popular, tools, Bet on a sustainable open source platform Measure the vitality of the community: number of downloads, extensions (living ecosystem), extensible architecture, consumers of the technology, code stability A proven platform provides tools to get your apps to production 8

GET TO KNOW CONCURRENT Leader in Application Infrastructure for Big Data Building enterprise software to simplify Big Data application development and management Products and Technology CASCADING Open Source - The most widely used application infrastructure for Founded: 2008 HQ: San Francisco, CA CEO: Gary Nakamura CTO, Founder: Chris Wensel www.concurrentinc.com building Big Data apps with over 175,000 downloads each month DRIVEN Enterprise data application management for Big Data apps Proven Simple, Reliable, Robust Thousands of enterprises rely on Concurrent to provide their data application infrastructure. 9

CASCADING - DE-FACTO STANDARD FOR DATA APPS Cascading Apps Standard for enterprise SQL Clojure Ruby data app development Your programming language of choice Supported Fabrics and Data Stores New Fabrics Cascading applications that run on MapReduce Mainframe DB / DW In-Memory Data Stores Hadoop Tez Storm will also run on Apache Spark, Storm, and 10

DEMO: WORD COUNT EXAMPLE WITH CASCADING String docpath = args[ 0 ]; String wcpath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowconnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap doctap = new Hfs( new TextDelimited( true, "\t" ), docpath ); Tap wctap = new Hfs( new TextDelimited( true, "\t" ), wcpath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docpipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcpipe = new Pipe( "wc", docpipe ); wcpipe = new GroupBy( wcpipe, token ); wcpipe = new Every( wcpipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow definition FlowDef flowdef = FlowDef.flowDef().setName( "wc" ).addsource( docpipe, doctap ).addtailsink( wcpipe, wctap ); // create the Flow Flow wcflow = flowconnector.connect( flowdef ); // <<-- Unit of Work wcflow.complete(); // <<-- Runs jobs on Cluster configuration integration processing scheduling 11

THE STANDARD FOR DATA APPLICATION DEVELOPMENT Application platform that addresses: Build data apps that are scale-free" Systems Integration" Application Portability" Design principals ensure best practices at any scale Hadoop never lives alone. Easily integrate to existing systems Write once, then run on different computation fabrics Staffing Bottleneck" Test-Driven Development" Operational Complexity" Use existing Java, SQL, modeling skill sets Efficiently test code and process local files before deploying on a cluster Simple - Package up into one jar and hand to operations Proven application development framework for building data apps www.cascading.org 12

STRONG ORGANIC GROWTH 175,000+ downloads / month" 7000+ Deployments" 13

BUSINESSES DEPEND ON US Cascading Java API Data normalization and cleansing of search and click-through logs for use by analytics tools, Hive analysts Easy to operationalize heavy lifting of data in one framework 14

BUSINESSES DEPEND ON US Cascalog (Clojure) Weather pattern modeling to protect growers against loss ETL against 20+ datasets daily Machine learning to create models Purchased by Monsanto for $930M US 15

BUSINESSES DEPEND ON US Scalding (Scala) TWITTER Makes complex analysis of very large data sets simple Machine learning, linear algebra to improve User experience Ad quality (matching users and ad effectiveness) All revenue applications are running on Cascading/Scalding 16

CASCADING DATA APPLICATIONS Enterprise IT" Extract Transform Load Log File Analysis Systems Integration Operations Analysis Corporate Apps" HR Analytics Employee Behavioral Analysis Customer Support ecrm Business Reporting Telecom" Data processing of Open Data Geospatial Indexing Consumer Mobile Apps Location based services Marketing / Retail" Mobile, Social, Search Analytics Funnel Analysis Revenue Attribution Customer Experiments Ad Optimization Retail Recommenders Consumer / Entertainment" Music Recommendation Comparison Shopping Restaurant Rankings Real Estate Rental Listings Travel Search & Forecast Finance" Fraud and Anomaly Detection Fraud Experiments Customer Analytics Insurance Risk Metric Health / Biotech" Aggregate Metrics For Govt Person Biometrics Veterinary Diagnostics Next-Gen Genomics Argonomics Environmental Maps 17

BROAD SUPPORT Hadoop ecosystem supports Cascading 18

CASCADING DEPLOYMENTS 19 Confidential

AND INCLUDES RICH SET OF EXTENSIONS http://www.cascading.org/extensions/ 20

CASCADING 3.0 - CURRENTLY WIP Write once and deploy on your fabric of choice. The Innovation Cascading 3.0 will allow for data apps to execute on existing and emerging fabrics through its new customizable query Enterprise Data Applications planner. Local In-Memory MapReduce Apache Tez, Storm, Cascading 3.0 will support Local Computation Fabrics In-Memory, Apache MapReduce and soon thereafter (3.1) Apache Tez, Apache Spark and Apache Storm 21

USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps CLI / Shell Enterprise Java Supports integrating with any data source that can be accessed through JDBC Cascading Tap can be created for any source supporting JDBC Lingual Provider API Catalog JDBC API Lingual API Query Planner Cascading Great for migration of data, integrating with non- Big Data assets extends life of existing IT assets in an organization Apache Hadoop Data Stores 22

SCALDING Scalding is a language binding to Cascading for Scala The name Scalding comes from the combining of SCALa and cascading Scalding is great for Scala developers; can crisply write constructs for matrix math Scalding has very large commercial deployments at: Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality Ebay - Use cases include search analytics and other production data pipelines 23

PATTERN SCORES MODELS AT SCALE Pattern is an open source project that allows to leverage Predictive Model Markup Language (PMML) models and translate them into Cascading apps. PMML is an XML-based popular analytics framework that allows applications to describe data mining and machine learning algorithms PMML models from popular analytics frameworks can be reused and deployed within Cascading workflows" Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle Open source frameworks - R, Weka, KNIME, RapidMiner Pattern is great for migrating your model scoring to Hadoop from your decision systems 24

PATTERN SCORES MODELS AT SCALE Step 1: Train your model with industry-leading Tools Step 2: Score your models at scale with Pattern 25 Confidential

PATTERN DEMO: FROM TRAINING TO SCORING 26

WHY PATTERN Standards compliance provides integration with many tools Models are independent of data and integration Only debugging Cascading, not an ensemble of applications 27

PATTERN: ALGOS IMPLEMENTED Hierarchical Clustering K-Means Clustering Linear Regression Logistic Regression Random Forest algorithms extended based on customer use cases 28 Confidential

OPERATIONAL EXCELLENCE Visibility Through All Stages of App Lifecycle From Development Building and Testing" Design & Development Debugging Tuning To Production Monitoring and Tracking" Maintain Business SLAs Balance & Controls Application and Data Quality Operational Health Real-time Insights 29

DRIVEN ARCHITECTURE

CONTACT INFORMATION Supreet Oberoi" supreet@concurrentinc.com 650-868-7675 (m) @supreet_online

DRIVING INNOVATION THROUGH DATA THANK YOU Supreet Oberoi