Machine Learning for Data-Driven Discovery



Similar documents
Machine Learning in the Big Data Era: Are We There Yet?

Big Data Analytics. Chances and Challenges. Volker Markl

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

BIG DATA What it is and how to use?

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Cray: Enabling Real-Time Discovery in Big Data

Large-Scale Data Processing

Challenges for Data Driven Systems

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Big Data Technologies Compared June 2014

Oracle Big Data SQL Technical Update

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Understanding the Value of In-Memory in the IT Landscape

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Introduction to Data Mining

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

The 4 Pillars of Technosoft s Big Data Practice

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

High-Performance Analytics

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

ANALYTICS IN BIG DATA ERA

ANALYTICS CENTER LEARNING PROGRAM

How To Handle Big Data With A Data Scientist

Data Centric Systems (DCS)

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

The Scientific Data Mining Process

Information Management course

Information Processing, Big Data, and the Cloud

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Advanced Big Data Analytics with R and Hadoop

BIG DATA-AS-A-SERVICE

Find the Hidden Signal in Market Data Noise

Luncheon Webinar Series May 13, 2013

The 3 questions to ask yourself about BIG DATA

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data Challenges in Bioinformatics

Steven C.H. Hoi School of Information Systems Singapore Management University

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Hur hanterar vi utmaningar inom området - Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Parallel Data Warehouse

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Hybrid Software Architectures for Big

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Bringing Big Data Modelling into the Hands of Domain Experts

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

Big Data and Data Science: Behind the Buzz Words

Spatio-Temporal Networks:

How To Learn To Use Big Data

The? Data: Introduction and Future

Industry 4.0 and Big Data

The Internet of Things and Big Data: Intro

High Performance Predictive Analytics in R and Hadoop:

Internet of Things. Opportunity Challenges Solutions

Reference Architecture, Requirements, Gaps, Roles

Big Data. Lyle Ungar, University of Pennsylvania

Big-Data Computing with Smart Clouds and IoT Sensing

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Data management challenges in todays Healthcare and Life Sciences ecosystems

Oracle Big Data Building A Big Data Management System

Knowledge Discovery from patents using KMX Text Analytics

How To Use Hp Vertica Ondemand

Teradata s Big Data Technology Strategy & Roadmap

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Advanced In-Database Analytics

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

From Spark to Ignition:

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

RevoScaleR Speed and Scalability

Fast Analytics on Big Data with H20

Harnessing the power of advanced analytics with IBM Netezza

Big Data for Big Intel

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

Big Data solutions to support Intelligent Systems and Applications

COMP9321 Web Application Engineering

Transcription:

Machine Learning for Data-Driven Discovery Thoughts on the Past, Present and Future Sreenivas Rangan Sukumar, PhD Data Scientist and R&D Staff Computaional Data Analytics Group Computational Sciences and Engineering Division Oak Ridge National Laboratory Email: sukumarsr@ornlgov Phone: 865-241-6641 Tomorrow: Experience with Data Parallel Frameworks Food-for-thought towards the exascale data analysis supercomputer ORNL is managed by UT-Battelle for the US Department of Energy

Today s Outline Scalable Machine Learning Recent Advances and Trends State of the Practice Philosophy, Engineering, Process, Paradigms Are we there yet? If yes, how so? If not, why not? Concluding Future Thoughts Offline Debate and Discussion 2

Rows Machine Learning Given examples of a function (x, f(x)), Predict function f(x) for new examples x Columns f 1 f 2 f d Labels 1 2 3 c 1 c 1 c 3 N c 2 c k c k 3

Machine Learning in the Big Data Era Just in case you missed Assumption Complexity of data N ~ O(10 2 ), d ~ O(10 1 ) k ~ O(1) 1990 2000s 2010-Present Insight A model exists Better data will reveal the beautiful model (Knowing why is important) ( eg IRIS data) A model may not exist, but find a model anyway ( Why is not as important) N ~ O(10 6 ) d ~ O(10 4 ) (eg ImageNet) k ~ O(10 4 ) Dilemma: Better data or better algorithms Volume, Velocity, Variety and Veracity have all increased several orders of magnitude Data Model Relationship Model abstracts data Data is the model Models aggregated data It is not anymore about the 2 N ˆ 1 ( X ) 1 1 x x j ji i P ( X C c ) exp f ( x ) average It is about every j i 2 2 2 G ( ) ji ji N i 1 h h i i individual data point Model Parameter Complexity (eg Size of Neural Network) Accuracy, Precision, Recall eg Face Recognition Visual Scene Recognition Computing Capability Personal Computing High Performance Computing O(10 3 ) O(10 10 ) O(10 8 ) to O(10 10 ) in months ~ 70% was accepted Not possible 1 core, 256MB RAM, 8GB disk 1000 cores, 1 teraflops ~95% is the norm ~10% is the best result to date 16 cores,64 GB RAM,2TB disk 3 million cores, 34 petaflops 10-billion parameter network learned to recognize cats from videos Big Data also means Big Expectations Commercial tools are keeping pace with the PC market and not HPC market Number of Dwarves! 7 13 Big Data Magic: Dwarves are doubling 4

Today s Talk Compute is scaling up commensurate the data Is machine learning keeping pace with the data and compute scale-up? If Yes : How so? If Not : Why not? 5

Scalable Machine Learning: Philosophy The Lifecycle of Data-Intensive Discovery Better Data Collection Querying and Retrieval eg Google, Databases Predictive Modeling eg Climate Change Prediction Association Interrogation Modeling, Simulation, & Validation Data-fusion eg Robotic Vision Off-the-shelf Parallel Hardware Custom ICs eg FPGAs, Adapteva, Rasberry Pi) Customized Processing Eg Nvidia GPGPUs, YarcData Urika Multi-core HPC eg ( Cray XK, Cray XC, IBM Blue Gene) Virtual clusters / Cloud computing eg Amazon AWS, SAS (PaaS, + SaaS) 6

Scalable Machine Learning: Philosophy The Lifecycle of Data-Intensive Discovery Querying and Retrieval eg Google, Databases Business Intelligence Relationship analytics Interrogation Better Data Collection Association Data-fusion eg Robotic Vision Predictive Modeling eg Climate Change Prediction Modeling, Simulation, & Validation Predictive modeling appliance Simulation 7

Scalable Machine Learning: Discovery Process The Lifecycle of Data-Intensive Discovery Data-Driven Discovery Process Better Data Collection Querying and Retrieval eg Google, Databases Predictive Modeling eg Climate Change Prediction Association Interrogation Data-fusion eg Robotic Vision Modeling, Simulation, & Validation Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Hindsight Why did it happen? Insight Concept adapted from Gartner s Webinar on Big Data What will happen? Foresight How can we make it happen? 8

Scalable Machine Learning: System Engineering Data-Driven Discovery Process Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Hindsight Why did it happen? Insight Concept adapted from Gartner s Webinar on Big Data What will happen? Foresight How can we make it happen? Staging for Predictive Modeling Extract, Transform, Load Data Pre-processing Feature Engineering Predictive Modeling Rule-base extraction Pairwise-similarity (Distance Computation) Model-parameter estimation Cross validation Inference/ Model Deployment Data is model? Model is data? Adaptive model? Reinforcement? 9

Scalable Machine Learning: Production Staging for Predictive Modeling Extract, Transform, Load Data Pre-processing Feature Engineering Predictive Modeling Rule-base extraction Pairwise-similarity (Distance Computation) Model-parameter estimation Inference/ Model Deployment Cross-validation Data is model? Model is data? Adaptive model? Reinforcement? Disk Intensive File processing and repeated retrieval best done in massively parallel file systems or databases Disk, Memory and Compute Intensive Typically computing an aggregate measure, vector product, a kernel function etc Memory + Compute Intensive Real-time requirements 10

Scalable Machine Learning: Bleeding Edge The Berkeley Data Analysis Stack In-database processing In-memory processing In-disk processing Storage scale-up This is tremendous progress 11

But Is machine learning keeping pace with the data and compute scale-up? If Yes : How so? If Not : Why not? 12

The Big Data Problem (Volume, Velocity, Veracity) Multi-class scaling (Number, Hierarchies) The 5 Challenges of Scalable Machine Learning Given examples of a function (x, f(x)), Predict function f(x) for new examples x Dimensionality (Variety) f 1 f 2 f d Labels 1 c 1 2 c 1 3 N Science of Data (Quality, Structure, Content, Signal-to-noise) Data Science (Infrastructure, Hardware, Software, Algorithms) c 3 c 2 c k c k 13

Challenge #1: Data Science Systems Data Compute Analysis Infrastructure Design Operations Management Architecture Design Operations Databases SQL NoSQL Graph Management Quality Privacy Provenance Governance Structure Matrix/ Table Text, Image, Video Graphs Sequences Spatiotemporal Schema HPC TITAN CADES Cloud Urika Hadoop Programming OpenMP/MPI CUDA/ OpenML RDF/SPARQL SQL Map-Reduce Algorithms In-database In-memory In-situ Design Scalability V&V Viz HCI Interfaces Viz-Analytics Theory Performance of algorithm dependent on architecture Most data scientists/algorithm specialists are used to in-memory tools such as R, MATLAB etc Existing cloud-based solutions are designed for high performance storage and not high-performance compute or in-memory operations Steep learning curve towards programming new innovative algorithms Too many options without guiding benchmarks 14

Challenge #2: Science of Data Data-science is not the same as science of data Is the process of understanding characteristics of data before applying/designing a machine-learning algorithm SNR > 1 SNR >>> 1 SNR << 1 SNR >> 1 Data characterization (Avoid using machine learning as a black box) Signal-noise-ratio, bound on noise iid sampling assumptions stationarity, randomness, ergodicity, periodicity Generating models behind data 15

Challenge #3: The N-d-k problem The Big Data Problem The future is unstructured Text, images, videos, sequences Algorithms and infrastructure expected to handle Big Data ie, increasing N, d and k Feature engineering and requires automation Self-feature extracting methodologies encouraged Traditional (pain staking) pipeline of SMEs creating features from the data will fail or transform into a collaborative-parallel effort Increasing N does not imply increasing information content (Samples can still be good if not better than all of the data statistically) There can be hierarchies within the N-d-k dimensions 16

Challenge #4: The N-d-k problem (d) Traditional algorithms assume N >> d and d > k Most tools available today scale well for increasing N Time-complexity analysis Data characteristics [Chu et al, NIPS 2007] Not so much for increasing d or k [Donoho, 2000] The curse and blessings of dimensionality Methods are emerging : Multi-task learning, Spectral Hashing etc 17

Challenge #5: The N-d-k problem (k) Hays et al, SUN Attribute Database: Discovering, Annotating, and Recognizing Scene Attributes, CVPR 2012 Hasegawa et al, Online Incremental Attribute-based Zeroshot Learning, CVPR 2012 Berg et al, What Does Classifying More Than 10, 000 Image categories Tell Us?, ECCV 2010 What happens when K= K + 1? (adding a new class) Engineered features may not be good enough Trained model has to relearn from the entire feature set without guarantees on accuracy 18

Concluding Thoughts What can be scaled up? Compute Storage Memory Cores What aspect of data that needs scale up? What aspect of algorithm that needs scale up? Dimensions of Big Data Software Volume Hadoop, MPP, Spider Archival Reports Discovery Velocity Analytical Requirements Algorithms Programming Data-Parallel Complexity of Algorithms Compute on Data MapReduce, MPI, Threads Task-parallel Linear Iterative > O(N 2 ) Streaming Batch Retrieval Machine Learning I/O? Network? Variety SQL NoSQL Graph Speed of Execution Real-time Feasibility Future We need benchmarks before me make big investments (Fox et al, 2014) 19

Concluding Thoughts Storage/Memory and Memory/Compute Ratios that are critical for machine learning are smaller than Storage/Compute Ratio Associative memory and cognitively-inspired architectures may prove better than the Von-Neuman store-fetch-execute paradigm May be time to redesign from scratch The machine learning algorithms that scale all use either data-parallelism or the dwarves of parallel computing in some form Encouraging because gives us an intuition to build custom hardware for learning algorithms We have done well so far by treating Analysis as a retrieval problem We can do better 20

Thank You Questions? 21