Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks Dr. Aloke Guha 29th IEEE Conference on Massive Data Storage May 8 th, 2013 aloke@cruxly.com
What s Common Between a Sensor that could Distinguish a fine Cognac, and Predicting Movies You d Like on Netflix? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 2
The Sommelier Robot Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3
Predicting What Movies You d Watch Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4
(Analytics, BigData, DataStore)+ Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 5
Many Analytics Techniques... Linear Regression Time-Series Decision Trees Dendral (Feigenbaum) 1965 Expert Systems Statistics R Naïve Bayes SVM Random Forests Neural Networks... Vapnik (1992) AI (McCarthy) 1956 Machine Learning Random Forests Genetic Algorithms Ihaka and Gentleman (1993) SNARC (Minsky) 1951 LDA K-nearest neighbor... Fraser and Burnell (1970) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 6
Common Analytics Processing pre-2000 Sources: Local Data: Numeric, Homogeneous Processing: Local Consumer: Local Analytics: Linear/Non-Linear Regression, Neural Networks, SVM, LDA, LSA, Decision Trees, Monte Carlo, Lin-Ops, Expert Systems... Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 7
Flavor Predictor Neural Networks USPTO #5,373,452 (1994) 1988 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8
Pattern Recognition Genetic Algorithms US PTO #5,140,530, 1992 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9
Small to Big http://article.wn.com/view/2013/04/04/big_data_forefather_michael_stonebraker_shows_no_signs_of_sl/#/related_news Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 10
Typical Analytics: 2000-2006 Sources: Global, Social Networks Data: Heterogeneous, Numeric, Text Processing: Hosted/Scale Consumer: Global Analytics: Batch Mode, Social Media Marketing, Churn Detection, Sentiment Analysis, etc. Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 11
2007- : Internet Data Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12
Financial Risk Scoring: Detect Risk Scoring: detect incremental change in # occurrences where corporate officers mention risk (or equivalent terms) during earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13
Financial Risk Scoring: Listen *Risk Scoring: detect incremental change in occurrences where corporate officers mention risk (or semantically equivalent terms) during the corporate earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14
Banking: Credit Worthiness remember 2008? Analyze bank reports to assess loans, payments, recoveries, etc. for key bank indexes, groups of banks, or individual banks Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15
Share of Voice: Online Buzz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16
Sentiment Analysis Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17
Analytics Processing: 2007- Sources: Global, Mobile, New Social (Instagram,.. ) Data: Multi-Dimensional, Heterogeneous, Audio/Video Processing: Hosted/Scale Consumer: Global Analytics: Batch, Streaming,... Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 18
2008 - : Real-Time/Streaming Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19
Brand Marketing Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20
Brand Management 21
Customer Support Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22
Customer Support 23
Lead Generation Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 24
... More Data, Faster http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=ciominute05062013cioa Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25
Internet of Things Machine-to-Machine Message Queuing Telemetry Transport http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-form2m-technology-to-drive-connected-smarter-cities/ Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26
AumniData: Batch Processing Twitter YouTube RSS/ATOM Feed Blog/Web Site Blog/Web Site Blog/Web Site Requestor/ URL Scanner Custom Analytics Display Ad-Hoc Query Summary Dashboard Configuration (TomCat) Dashboard Application (.3 rd party App) Data Collector Data Collector Data Collector (Batch Scheduled) (Batch Scheduled) (Batch Scheduled) Content Store Content / Metadata Index (MySQL) Dashboard Store (SQL Server) NLP+ Cruxly Intent NLP+ Cruxly Intent NLP+ Detection Cruxly Intent NLP+ Detection (AWS) Cruxly Intent NLP+ (AWS) Detection Cruxly Intent NLP (AWS) Detection Stack+ AumniData Detection Classifier (AWS) + Analytics* (AWS) (RackSpace VM) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 27
Cruxly: Stream Processing Twitter Reports / Dashboard Tracker Editor (web app - Heroku) Request (Keywords) Tweets (Keywords) Streaming API Client Streaming API Client Streaming (Heroku Worker) API Client (Heroku (Heroku (24x7) Worker) (24x7) Worker) (24x7) Tweets (Keywords) Tweets Content Store (DynamoDB) Tweet ID + Intent Signal (Heroku PostgresSQL) NLP+ Cruxly Intent NLP+ Cruxly Intent NLP+ Detection Cruxly Intent NLP+ Detection (AWS) Cruxly Intent NLP+ Detection (AWS) Cruxly Intent NLP (AWS) Detection (NER, etc + Cruxly Detection Intent (AWS) Detection (AWS) (AWS) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 28
Data Analytics Demands... View Analyze Process Store NLP Classify Index Dashboards Chart Report Query/ RT Query Ad Hoc/ Search/ SQL Custom Analytics Machine Learning Library Stats Library R Data Collector Text / Sensor Data/ Stream... View Analyze Process Store Storm Yarn 29
Storage Implications: Back to the Future IOPs Stream MB/s Batch Both? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30
HDFS MapReduce Storage Implications: Back to the Future II, III Master Slave #1 Slave #N Mgmt Node Task tracker Task tracker Task tracker Zookeeper Hive Job Tracker Pig Name Node Oozie HUE Data Node Data Node Data Node HDFS client Storage Capacity Scaling? Import/Export Data? Storage Tiering? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 31
Sensor Processing: Data Integration Map Reduce /Distributed Data Store Analytics Processing Visualization Library / Interactive Query Local Storage/ Flash / DAS SAN A More General Data Analytics Framework? Data Ingesters (Basic) Data Ingesters (Smart) Data Ingesters Processing Stream and Batch Metadata / In-Mem Store Content Store Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 32
Conclusion Data Analytics Big Data Scale-Out Variety Infrastructure Volume Bandwidth Support Velocity Streaming Support We Solved the Processing Problem We Need to Solve the Larger Storage Problem Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 33
Grateful Acknowledgements Kapil Tundwal Dr. Kirill Kireyev Dr. Andrew Lampert Venky Madireddy Dr. Shumin Wu Joan Wrabetz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 34