Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks

Similar documents
GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Comprehensive Analytics on the Hortonworks Data Platform

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Unified Batch & Stream Processing Platform

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Cymon.io. Open Threat Intelligence. 29 October 2015 Copyright 2015 esentire, Inc. 1

Big Data Analytics Nokia

#TalendSandbox for Big Data

Big Data Analytics and Optimization

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Challenges for Data Driven Systems

BIG DATA What it is and how to use?

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Talend Big Data. Delivering instant value from all your data. Talend

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data and Data Science: Behind the Buzz Words

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

How To Scale Out Of A Nosql Database

Machine Learning over Big Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Are You Ready for Big Data?

Real Time Analytics for Big Data. NtiSh Nati

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop Ecosystem B Y R A H I M A.

Big Data Analytics. Analysis of high-volume and unstructured Data

Oracle Big Data SQL Technical Update

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Upcoming Announcements

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

How to Hadoop Without the Worry: Protecting Big Data at Scale

Chase Wu New Jersey Ins0tute of Technology

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Analytics and Optimization

Hadoop IST 734 SS CHUNG

Are You Ready for Big Data?

Cloudera Manager Training: Hands-On Exercises

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

SAP and Hortonworks Reference Architecture

Big Data Architectures and Technologies

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Azure Data Lake Analytics

Peers Techno log ies Pv t. L td. HADOOP

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Getting to Know Big Data

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Microsoft SQL Server 2012 with Hadoop

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Hadoop Job Oriented Training Agenda

IBM Big Data Platform

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

The basic data mining algorithms introduced may be enhanced in a number of ways.

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Analyzing Big Data with AWS

Big Data Spatial Analytics An Introduction

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Real-time Big Data Analytics with Storm

How Companies are! Using Spark

Certified Big Data and Apache Hadoop Developer VS-1221

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Modern Data Architecture for Predictive Analytics

The 4 Pillars of Technosoft s Big Data Practice

<Insert Picture Here> Big Data

How To Make Sense Of Data With Altilia

BIG DATA TRENDS AND TECHNOLOGIES

Big Data and Analytics: Challenges and Opportunities

HADOOP. Revised 10/19/2015

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

EMC SOLUTION FOR SPLUNK

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

How To Use Big Data For Telco (For A Telco)

Hadoop. Sunday, November 25, 12

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

COMP9321 Web Application Engineering

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

Trafodion Operational SQL-on-Hadoop

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Real Time Big Data Processing

Workshop on Hadoop with Big Data

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Information Builders Mission & Value Proposition

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hybrid Solutions Combining In-Memory & SSD

Transcription:

Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks Dr. Aloke Guha 29th IEEE Conference on Massive Data Storage May 8 th, 2013 aloke@cruxly.com

What s Common Between a Sensor that could Distinguish a fine Cognac, and Predicting Movies You d Like on Netflix? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 2

The Sommelier Robot Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3

Predicting What Movies You d Watch Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4

(Analytics, BigData, DataStore)+ Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 5

Many Analytics Techniques... Linear Regression Time-Series Decision Trees Dendral (Feigenbaum) 1965 Expert Systems Statistics R Naïve Bayes SVM Random Forests Neural Networks... Vapnik (1992) AI (McCarthy) 1956 Machine Learning Random Forests Genetic Algorithms Ihaka and Gentleman (1993) SNARC (Minsky) 1951 LDA K-nearest neighbor... Fraser and Burnell (1970) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 6

Common Analytics Processing pre-2000 Sources: Local Data: Numeric, Homogeneous Processing: Local Consumer: Local Analytics: Linear/Non-Linear Regression, Neural Networks, SVM, LDA, LSA, Decision Trees, Monte Carlo, Lin-Ops, Expert Systems... Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 7

Flavor Predictor Neural Networks USPTO #5,373,452 (1994) 1988 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8

Pattern Recognition Genetic Algorithms US PTO #5,140,530, 1992 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9

Small to Big http://article.wn.com/view/2013/04/04/big_data_forefather_michael_stonebraker_shows_no_signs_of_sl/#/related_news Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 10

Typical Analytics: 2000-2006 Sources: Global, Social Networks Data: Heterogeneous, Numeric, Text Processing: Hosted/Scale Consumer: Global Analytics: Batch Mode, Social Media Marketing, Churn Detection, Sentiment Analysis, etc. Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 11

2007- : Internet Data Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12

Financial Risk Scoring: Detect Risk Scoring: detect incremental change in # occurrences where corporate officers mention risk (or equivalent terms) during earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13

Financial Risk Scoring: Listen *Risk Scoring: detect incremental change in occurrences where corporate officers mention risk (or semantically equivalent terms) during the corporate earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14

Banking: Credit Worthiness remember 2008? Analyze bank reports to assess loans, payments, recoveries, etc. for key bank indexes, groups of banks, or individual banks Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15

Share of Voice: Online Buzz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16

Sentiment Analysis Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17

Analytics Processing: 2007- Sources: Global, Mobile, New Social (Instagram,.. ) Data: Multi-Dimensional, Heterogeneous, Audio/Video Processing: Hosted/Scale Consumer: Global Analytics: Batch, Streaming,... Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 18

2008 - : Real-Time/Streaming Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19

Brand Marketing Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20

Brand Management 21

Customer Support Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22

Customer Support 23

Lead Generation Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 24

... More Data, Faster http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=ciominute05062013cioa Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25

Internet of Things Machine-to-Machine Message Queuing Telemetry Transport http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-form2m-technology-to-drive-connected-smarter-cities/ Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26

AumniData: Batch Processing Twitter YouTube RSS/ATOM Feed Blog/Web Site Blog/Web Site Blog/Web Site Requestor/ URL Scanner Custom Analytics Display Ad-Hoc Query Summary Dashboard Configuration (TomCat) Dashboard Application (.3 rd party App) Data Collector Data Collector Data Collector (Batch Scheduled) (Batch Scheduled) (Batch Scheduled) Content Store Content / Metadata Index (MySQL) Dashboard Store (SQL Server) NLP+ Cruxly Intent NLP+ Cruxly Intent NLP+ Detection Cruxly Intent NLP+ Detection (AWS) Cruxly Intent NLP+ (AWS) Detection Cruxly Intent NLP (AWS) Detection Stack+ AumniData Detection Classifier (AWS) + Analytics* (AWS) (RackSpace VM) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 27

Cruxly: Stream Processing Twitter Reports / Dashboard Tracker Editor (web app - Heroku) Request (Keywords) Tweets (Keywords) Streaming API Client Streaming API Client Streaming (Heroku Worker) API Client (Heroku (Heroku (24x7) Worker) (24x7) Worker) (24x7) Tweets (Keywords) Tweets Content Store (DynamoDB) Tweet ID + Intent Signal (Heroku PostgresSQL) NLP+ Cruxly Intent NLP+ Cruxly Intent NLP+ Detection Cruxly Intent NLP+ Detection (AWS) Cruxly Intent NLP+ Detection (AWS) Cruxly Intent NLP (AWS) Detection (NER, etc + Cruxly Detection Intent (AWS) Detection (AWS) (AWS) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 28

Data Analytics Demands... View Analyze Process Store NLP Classify Index Dashboards Chart Report Query/ RT Query Ad Hoc/ Search/ SQL Custom Analytics Machine Learning Library Stats Library R Data Collector Text / Sensor Data/ Stream... View Analyze Process Store Storm Yarn 29

Storage Implications: Back to the Future IOPs Stream MB/s Batch Both? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30

HDFS MapReduce Storage Implications: Back to the Future II, III Master Slave #1 Slave #N Mgmt Node Task tracker Task tracker Task tracker Zookeeper Hive Job Tracker Pig Name Node Oozie HUE Data Node Data Node Data Node HDFS client Storage Capacity Scaling? Import/Export Data? Storage Tiering? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 31

Sensor Processing: Data Integration Map Reduce /Distributed Data Store Analytics Processing Visualization Library / Interactive Query Local Storage/ Flash / DAS SAN A More General Data Analytics Framework? Data Ingesters (Basic) Data Ingesters (Smart) Data Ingesters Processing Stream and Batch Metadata / In-Mem Store Content Store Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 32

Conclusion Data Analytics Big Data Scale-Out Variety Infrastructure Volume Bandwidth Support Velocity Streaming Support We Solved the Processing Problem We Need to Solve the Larger Storage Problem Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 33

Grateful Acknowledgements Kapil Tundwal Dr. Kirill Kireyev Dr. Andrew Lampert Venky Madireddy Dr. Shumin Wu Joan Wrabetz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 34