Data Scientist: From Mathematics to data management

Similar documents

Robust Real-Time Face Detection

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Local features and matching. Image classification & object localization

Azure Machine Learning, SQL Data Mining and R

The Internet of Things and Big Data: Intro

BIG DATA & DATA SCIENCE

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Integrating a Big Data Platform into Government:

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Learning outcomes. Knowledge and understanding. Competence and skills

Data Mining. Nonlinear Classification

Knowledge Discovery and Data Mining

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Big Data & Security. Aljosa Pasic 12/02/2015

Data Mining Practical Machine Learning Tools and Techniques

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Big Data and Data Science: Behind the Buzz Words

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

ANALYTICS CENTER LEARNING PROGRAM

Big Data Analytics and Optimization

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Introduction to Data Mining

HDP Hadoop From concept to deployment.

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

The Visual Internet of Things System Based on Depth Camera

Has been into training Big Data Hadoop and MongoDB from more than a year now

Active Learning with Boosting for Spam Detection

Knowledge Discovery from patents using KMX Text Analytics

Some Research Challenges for Big Data Analytics of Intelligent Security

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

The? Data: Introduction and Future

Reference Architecture, Requirements, Gaps, Roles

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Experimentation on Cloud Databases to Handle Genomic Big Data

Big Data. Lyle Ungar, University of Pennsylvania

Client Overview. Engagement Situation. Key Requirements

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Introduction to Big Data Training

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Advanced In-Database Analytics

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Cloud Big Data Architectures

The 4 Pillars of Technosoft s Big Data Practice

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data and Analytics: Challenges and Opportunities

So What s the Big Deal?

Building Your Big Data Team

Consulting and Systems Integration (1) Networks & Cloud Integration Engineer

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Big Data Cloud Services

Machine Learning using MapReduce

OpenChorus: Building a Tool-Chest for Big Data Science

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Constructing a Data Lake: Hadoop and Oracle Database United!

POSTGRAD PLACEMENTS. Placements are an integral part of the Masters programmes, so international students will not require additional work visas.

Certificate Program in Applied Big Data Analytics in Dubai. A Collaborative Program offered by INSOFE and Synergy-BI

Moving From Hadoop to Spark

Chapter 6. The stacking ensemble approach

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Big Data Explained. An introduction to Big Data Science.

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

QUICK FACTS. Delivering a Unified Data Architecture for Sony Computer Entertainment America TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES

How To Handle Big Data With A Data Scientist

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

An interdisciplinary model for analytics education

Data Analytics and Business Intelligence (8696/8697)

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

COMP9321 Web Application Engineering

BIG DATA What it is and how to use?

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Challenges for Data Driven Systems

The Future of Data Management

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Hadoop Ecosystem B Y R A H I M A.

Data Science at U of U

Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics

Transforming the Telecoms Business using Big Data and Analytics

Spark: Cluster Computing with Working Sets

Monday Morning Data Mining

Big Data Analytics Nokia

Model Combination. 24 Novembre 2009

Upcoming Announcements

Oracle Big Data SQL Technical Update

Analysis of Big Data Survey 2015 on Skills, Training and Capacity Building

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Human Pose Estimation from RGB Input Using Synthetic Training Data

Training for Big Data

The Scientific Data Mining Process

MACHINE LEARNING BASICS WITH R

High Productivity Data Processing Analytics Methods with Applications

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition

Industry 4.0 and Big Data

Big Data Analytics and Optimization

locuz.com Big Data Services

Transcription:

Data Scientist: From Mathematics to data management Frederic Precioso 06/07/2015 Professor at University Nice Sophia Antipolis (UNS) Laboratory I3S Joint Research Unit from CNRS & UNS (UMR 7271) Team Scalable and Pervasive software and Knowledge Systems (SPARKS)

Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future These slides are partially based on (Big) Data (Science) Skills by Oscar Corcho 2

Data Scientist: The Sexiest Job of the 21st Century? October 2012: the Harvard Business Review published the article "Data Scientist: The Sexiest Job of the 21st Century" in its issue "Getting control of Big Data". Since then a lot of work has been done to draw the conclusion that there are actually more than one data scientist profile. 3

Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) 4

Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) Based on the survey data of several hundred data science professionals, the authors applying data science algorithms found that data scientists could be clustered into 4 subgroups, each with a different mix of skillsets: Data Businessperson Data Creatives Data Developers Data Researchers 5

Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) ML = Machine Learning OR = Operations Research 6

Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) From their answers, the data scientists see themselves as T-shape experts. 7

More recently 8

Big Data Species 1. HPC and e-infrastructure Experts Background: Computer Science (Systems) System Administration Terms used in their native language: Blades, Infiniband, OpenMPI, racks, HDF, TBs, Gflops Their daily life: Check system logs Make sure that queues are active Install a new rack What s Big Data for them? A commercial term for something that they have done for a long time They really know how to configure and monitor a Hadoop cluster They would love seeing those talking about Big Data executing processes on fluid dynamics [source Oscar Corcho] 9

Big Data Species 2. Data Storage and Access Experts Background: Computer Science Database administration Terms used in their native language: SQL, NoSQL, Column store Transactions, Hive, TBs/PBs/, TPS (Transactions per s) Their daily life: Optimize several queries Run a new benchmark Design an optimizer/physical operator What s Big Data for them? A new opportunity to work on optimization algorithms They know how to configure a database They often laugh at those who deploy a NoSQL solution for a problem that can be solved with a relational database [source Oscar Corcho] 10

Big Data Species 3. Machine Learning Experts Background: Mathematics, Statistics, Physics, Computer Science Terms used in their native language: Complexity, algorithm, p-value, convergence, precision, recall ROC curves, Bayesian networks, R Their daily life: Read about a new problem Write down a few formulae in the whiteboard (even blackboards) Prove that the algorithm terminates What s Big Data for them? The same problems applied to data of larger size, with new challenges Problems are not only solved in Hadoop or a powerful NoSQL DB Astonished by those who still mix up correlation and causality [source Oscar Corcho] 11

Big Data Species 4. Slow-data Experts Background: Computer Science, Statistics, Library Sciences, Linguistics Terms used in their native language: Information model, vocabulary, ontology, data quality, curation Their daily life: Receive a database schema Talk to data producers and (re)users Obtain consensus and transform data What s Big Data for them? The difficulty lies on the variety of data formats and structures We may integrate data from varied sources, although this is not always possible When you manage to integrate heterogeneous data, you can achieve better results [source Oscar Corcho] 12

Big Data Species 5. (Big Data) Consultants Background: Computer Science, Economy, Terms used in their native language: Business model, business opportunity, Big Data, Data Value Chain, Hadoop, Spark, R, TBs, GFlops Their daily life: Read a Gartner Big Data report Talk to potential customers Transfer needs to technicians What s Big Data for them? It s the 4Vs, plus a few more I have a PPT presentation with a Big Data infrastructure, architecture, and previous projects, which I will use to sell a project to my customers [source Oscar Corcho] 13

BigData Ecosystem Visualization Dashboard (Kibana / Datameer) Maps (InstantAtlas, Leaflet, CartoDB ) Charts (GoogleCharts, Charts.js ) D3.js / Tableau / Flame Analysis Machine Learning (Scikit Learn, Mahout, Spark) Search / retrieval (Elastic Search, Solr) Storage / Access / Exploitation File System (HDFS, GGFS, Cassandra ) Access (Hadoop / Spark / Both, Sqoop) Databases / Indexing (SQL / NoSQL / Both, MongoDB, HBase, Infinispan) Exploit (LogStash, Flume ) Infrastructures Grid Computing / HPC Cloud / Virtualization 14

Intermediate Conclusions We all know that there are big opportunities in Big Data But we need to be more productive. For that we need: Understand that simply by using Hadoop, Spark or R we are not necessarily doing Big Data The same as by coding in Java we are not necessarily understanding object-oriented programming Understand that we have to interpret results adequately, from a scientific point of view Understand the importance of homogenizing datasets, in order to facilitate their integration (slow-data) Create real multidisciplinary teams [source Oscar Corcho] 15

Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future These slides are partially based on (Big) Data (Science) Skills by Oscar Corcho 16

Future Profile: multidisciplinary Alex Szalay s T-shaped vs Pi-shaped Drew Conway's Data Science Venn Diagram Jim Gray's idea of the "Fourth Paradigm" of scientific discovery Volker Markl: Data Scientist Jack of All Trades! 17

Future Profile: multidisciplinary A recent report (in French) *, leads to the same conclusion: «The consensus nowadays is to define the data scientist at the intersection of three areas of expertise: (i) Computer Science, (ii) Statistics and Mathematics, and (iii) Business knowledge. ( ) Depending on the training program, one will most probably receive training with major either in Computer Science, in Statistics or Business knowledge.» * Serge Abiteboul, François Bancilhon, François Bourdoncle, Stephan Clemencon, Colin De La Higuera, et al.. L émergence d'une nouvelle filière de formation : " data scientists ". [Interne] INRIA Saclay. 2014.<hal-01092062> 18

Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future 19

BigData Academic Research Visualization R R Analysis R R R Storage / Access / Exploitation Infrastructures R R 20

What BigData Academic Research means? Push the limits of existing approaches or design new ones even if it is risky or (very) difficult Demonstrate that contributions are theoretically sound Compare to others through participating to challenges or at least on BigData benchmarks Complexity and scalability are always better when they can be proven 21

2 success stories of Machine learning among many Classification: How to separate the data? Machine Error Real (Algorithm) Error Empirical (Algorithm)+Capacity(Algorithm) 22

2 success stories of Machine learning among many Classification: How to separate the data? Error Real (Algorithm) Error Empirical (Algorithm)+Capacity(Algorithm) Boosting Machine Random Forests 23

Ideas of boosting: Football Bets If Varane and Sakho play together, French Football team wins. If Ntep is not injured, French Football team wins. If Benzema is substitued before the end, French Football team loses. If Pogba is happy, French Football team wins. From Antoine Cornuéjols Lecture slides 24

How to win? Ask to professional gamblers Lets assume: That professional gamblers can provide one single decision rule simple and relevant But that face to several games, they can always provide decision rules a little bit better than random Can we become rich? From Antoine Cornuéjols Lecture slides 25

Idea Ask heuristics to the expert Gather a set of cases for which these heuristics fail (difficult cases) Ask again the expert to provide heuristics for the difficult cases And so one Combine these heuristics expert stands for weak learner From Antoine Cornuéjols Lecture slides 26

Questions How to choose games (i.e. learning examples) at each step? Focus on games (examples) the most difficult (the ones on which previous heuristics are the less relevant) How to merge heuristics (decision rules) into one single decision rule? Take a weighted vote of all decision rules From Antoine Cornuéjols Lecture slides 27

Boosting boosting = general method to convert several poor decision rules into one very powerful decision rule More precisely: Let have a weak learner which can always provide a decision rule (even just little) better than random, A boosting algorithm can build (theoretically) a global decision rule with an error rate as low as desired. A theorem of Schapire on weak learning power proves that H gets a higher relevance than a global decision rule which would have been learnt directly on all training examples. From Antoine Cornuéjols Lecture slides 28

Probabilistic boosting: AdaBoost The standard algorithm is AdaBoost (Adaptive Boosting). 3 main ideas to generalize towards probabilistic boosting: 1. A set of specialized experts and ask them to vote to take a decision. 2. Adaptive weighting of votes by multiplicative update. 3. Modifying example distribution to train each expert, increasing the weights iteratively of examples misclassified at previous iteration. From Antoine Cornuéjols Lecture slides 29

AdaBoost: the algorithm A training set: S = {(x 1,y 1 ),,(x m,y m )} y i {-1,+1} label (annotation) of example x i S A set of weak learners {h t } For t = 0,,T: Give a weight to every sample in {1,,m} regarding its difficulty to be well classified by h t-1 : D t Find the weak decision ( heuristic ): h t : S {-1,+1} with the smallest error ε t on D t : εt = Pr D[ h( ) ] ( ) t t x y D i = i i t Compute the influence/impact of h t ih : t ( x ) i y i Final decision H final = a majority weighted vote of all the h t 30

Error of generalization for AdaBoost Error of generalization of H can be bounded by: E Real ( H ) = E ( H ) + Ο T Empirical T m T. d Error Iterations where T is the number of boosting iterations m the number of training examples d the dimension of H T space ( weaks learner complexity ) 31

The Task of Face Detection Many slides adapted from P. Viola 32

Basic Idea Slide a window across image and evaluate a face model at every location. 33

Image Features Feature Value = (Pixel in white area) (Pixel in black area) -33 3 27-29 1 29-30 28 6 1 if < 29 1 if < 26 1 if > 11 h1 ( ) = h2 ( ) = h3 ( ) = 0 otherwise 0 otherwise 0 otherwise 34

AdaBoost Cascade Principle 0 1 0 0 0 1 0 1 1 1 0 1 0 1 AdaBoost AdaBoost 1 Face x 99% 2 Face x 98% Non Face x 30% Non Face x 9% N Non Face x 70% Non Face x 21% Face x 90% Non Face x 0.00006% 35

The Implemented System Training Data 5000 faces All frontal, rescaled to 24x24 pixels 300 million non-faces sub-windows 9500 non-face images Faces are normalized Scale, translation Many variations Across individuals Illumination Pose 36

Results Fixed images Video sequence Frontal face Left profile face Right profile face 37

Extension Fast and robust Other descriptors Other cascades (rotation ) Eye detection, Hand detection, Body detection 38

2 success stories of Machine learning among many Classification: How to separate the data? Error Real (Algorithm) Error Empirical (Algorithm)+Capacity(Algorithm) Boosting Machine Random Forests 39

Decision tree to decide playing tennis or not Objective 2 classes: yes & no Prediction if a game will be played or not Temperature will be easily converted into numerical I.H. Witten and E. Frank, Data Mining, Morgan Kaufmann Pub., 2000. 40

Decision tree to decide playing tennis or not Class: NO Class:YES Class: YES 41

Final decision tree 42

Decision trees do not converge? Make a forest 43

Error of generalization for Random Forest Error of generalization of RF can be bounded by: E Real ( RF ) 2 2 ρ(1 s ) s where ρ is the mean correlation between two decision trees s is the quality of prediction of the set of decision trees 44

Success story: Kinect From Real-Time Human Pose Recognition in Parts from a Single Depth Image, Jamie Shotton, Andrew Fitzgibbon, 45 Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, and Andrew Blake at CVPR June 2011.

Success story: Kinect 46

Other success stories Support Vector Machines E Real ( SVM ) = ( SVM ) E Empirical d ln m: the number of training examples d: the dimension of decision space Bound valid with probability 1 - α + 2m d α + 1 ln 4 m Artificial Neural Network and Deep Learning 47

Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future 48

Future trainees Before considering applying a method or a technology, be sure that original conditions are verified When a method is extended out of its domain of validity, intend to prove the mathematical consistency / stability of the new method Demonstrate or at least provide insights of its complexity and scalability In the very next years, new students will come out with a more global vision of data science challenges, a deep understanding of involved layers and a better knowledge of powerful techniques. 49

I will be glad to answer to any question Frederic Precioso 06/07/2015 Professor at University Nice Sophia Antipolis (UNS) Laboratory I3S Joint Research Unit from CNRS & UNS (UMR 7271) Team Scalable and Pervasive software and Knowledge Systems (SPARKS)