Data Scientist: From Mathematics to data management Frederic Precioso 06/07/2015 Professor at University Nice Sophia Antipolis (UNS) Laboratory I3S Joint Research Unit from CNRS & UNS (UMR 7271) Team Scalable and Pervasive software and Knowledge Systems (SPARKS)
Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future These slides are partially based on (Big) Data (Science) Skills by Oscar Corcho 2
Data Scientist: The Sexiest Job of the 21st Century? October 2012: the Harvard Business Review published the article "Data Scientist: The Sexiest Job of the 21st Century" in its issue "Getting control of Big Data". Since then a lot of work has been done to draw the conclusion that there are actually more than one data scientist profile. 3
Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) 4
Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) Based on the survey data of several hundred data science professionals, the authors applying data science algorithms found that data scientists could be clustered into 4 subgroups, each with a different mix of skillsets: Data Businessperson Data Creatives Data Developers Data Researchers 5
Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) ML = Machine Learning OR = Operations Research 6
Analyzing the Analyzers An Introspective Survey of Data Scientists and Their Work (June 2013) From their answers, the data scientists see themselves as T-shape experts. 7
More recently 8
Big Data Species 1. HPC and e-infrastructure Experts Background: Computer Science (Systems) System Administration Terms used in their native language: Blades, Infiniband, OpenMPI, racks, HDF, TBs, Gflops Their daily life: Check system logs Make sure that queues are active Install a new rack What s Big Data for them? A commercial term for something that they have done for a long time They really know how to configure and monitor a Hadoop cluster They would love seeing those talking about Big Data executing processes on fluid dynamics [source Oscar Corcho] 9
Big Data Species 2. Data Storage and Access Experts Background: Computer Science Database administration Terms used in their native language: SQL, NoSQL, Column store Transactions, Hive, TBs/PBs/, TPS (Transactions per s) Their daily life: Optimize several queries Run a new benchmark Design an optimizer/physical operator What s Big Data for them? A new opportunity to work on optimization algorithms They know how to configure a database They often laugh at those who deploy a NoSQL solution for a problem that can be solved with a relational database [source Oscar Corcho] 10
Big Data Species 3. Machine Learning Experts Background: Mathematics, Statistics, Physics, Computer Science Terms used in their native language: Complexity, algorithm, p-value, convergence, precision, recall ROC curves, Bayesian networks, R Their daily life: Read about a new problem Write down a few formulae in the whiteboard (even blackboards) Prove that the algorithm terminates What s Big Data for them? The same problems applied to data of larger size, with new challenges Problems are not only solved in Hadoop or a powerful NoSQL DB Astonished by those who still mix up correlation and causality [source Oscar Corcho] 11
Big Data Species 4. Slow-data Experts Background: Computer Science, Statistics, Library Sciences, Linguistics Terms used in their native language: Information model, vocabulary, ontology, data quality, curation Their daily life: Receive a database schema Talk to data producers and (re)users Obtain consensus and transform data What s Big Data for them? The difficulty lies on the variety of data formats and structures We may integrate data from varied sources, although this is not always possible When you manage to integrate heterogeneous data, you can achieve better results [source Oscar Corcho] 12
Big Data Species 5. (Big Data) Consultants Background: Computer Science, Economy, Terms used in their native language: Business model, business opportunity, Big Data, Data Value Chain, Hadoop, Spark, R, TBs, GFlops Their daily life: Read a Gartner Big Data report Talk to potential customers Transfer needs to technicians What s Big Data for them? It s the 4Vs, plus a few more I have a PPT presentation with a Big Data infrastructure, architecture, and previous projects, which I will use to sell a project to my customers [source Oscar Corcho] 13
BigData Ecosystem Visualization Dashboard (Kibana / Datameer) Maps (InstantAtlas, Leaflet, CartoDB ) Charts (GoogleCharts, Charts.js ) D3.js / Tableau / Flame Analysis Machine Learning (Scikit Learn, Mahout, Spark) Search / retrieval (Elastic Search, Solr) Storage / Access / Exploitation File System (HDFS, GGFS, Cassandra ) Access (Hadoop / Spark / Both, Sqoop) Databases / Indexing (SQL / NoSQL / Both, MongoDB, HBase, Infinispan) Exploit (LogStash, Flume ) Infrastructures Grid Computing / HPC Cloud / Virtualization 14
Intermediate Conclusions We all know that there are big opportunities in Big Data But we need to be more productive. For that we need: Understand that simply by using Hadoop, Spark or R we are not necessarily doing Big Data The same as by coding in Java we are not necessarily understanding object-oriented programming Understand that we have to interpret results adequately, from a scientific point of view Understand the importance of homogenizing datasets, in order to facilitate their integration (slow-data) Create real multidisciplinary teams [source Oscar Corcho] 15
Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future These slides are partially based on (Big) Data (Science) Skills by Oscar Corcho 16
Future Profile: multidisciplinary Alex Szalay s T-shaped vs Pi-shaped Drew Conway's Data Science Venn Diagram Jim Gray's idea of the "Fourth Paradigm" of scientific discovery Volker Markl: Data Scientist Jack of All Trades! 17
Future Profile: multidisciplinary A recent report (in French) *, leads to the same conclusion: «The consensus nowadays is to define the data scientist at the intersection of three areas of expertise: (i) Computer Science, (ii) Statistics and Mathematics, and (iii) Business knowledge. ( ) Depending on the training program, one will most probably receive training with major either in Computer Science, in Statistics or Business knowledge.» * Serge Abiteboul, François Bancilhon, François Bourdoncle, Stephan Clemencon, Colin De La Higuera, et al.. L émergence d'une nouvelle filière de formation : " data scientists ". [Interne] INRIA Saclay. 2014.<hal-01092062> 18
Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future 19
BigData Academic Research Visualization R R Analysis R R R Storage / Access / Exploitation Infrastructures R R 20
What BigData Academic Research means? Push the limits of existing approaches or design new ones even if it is risky or (very) difficult Demonstrate that contributions are theoretically sound Compare to others through participating to challenges or at least on BigData benchmarks Complexity and scalability are always better when they can be proven 21
2 success stories of Machine learning among many Classification: How to separate the data? Machine Error Real (Algorithm) Error Empirical (Algorithm)+Capacity(Algorithm) 22
2 success stories of Machine learning among many Classification: How to separate the data? Error Real (Algorithm) Error Empirical (Algorithm)+Capacity(Algorithm) Boosting Machine Random Forests 23
Ideas of boosting: Football Bets If Varane and Sakho play together, French Football team wins. If Ntep is not injured, French Football team wins. If Benzema is substitued before the end, French Football team loses. If Pogba is happy, French Football team wins. From Antoine Cornuéjols Lecture slides 24
How to win? Ask to professional gamblers Lets assume: That professional gamblers can provide one single decision rule simple and relevant But that face to several games, they can always provide decision rules a little bit better than random Can we become rich? From Antoine Cornuéjols Lecture slides 25
Idea Ask heuristics to the expert Gather a set of cases for which these heuristics fail (difficult cases) Ask again the expert to provide heuristics for the difficult cases And so one Combine these heuristics expert stands for weak learner From Antoine Cornuéjols Lecture slides 26
Questions How to choose games (i.e. learning examples) at each step? Focus on games (examples) the most difficult (the ones on which previous heuristics are the less relevant) How to merge heuristics (decision rules) into one single decision rule? Take a weighted vote of all decision rules From Antoine Cornuéjols Lecture slides 27
Boosting boosting = general method to convert several poor decision rules into one very powerful decision rule More precisely: Let have a weak learner which can always provide a decision rule (even just little) better than random, A boosting algorithm can build (theoretically) a global decision rule with an error rate as low as desired. A theorem of Schapire on weak learning power proves that H gets a higher relevance than a global decision rule which would have been learnt directly on all training examples. From Antoine Cornuéjols Lecture slides 28
Probabilistic boosting: AdaBoost The standard algorithm is AdaBoost (Adaptive Boosting). 3 main ideas to generalize towards probabilistic boosting: 1. A set of specialized experts and ask them to vote to take a decision. 2. Adaptive weighting of votes by multiplicative update. 3. Modifying example distribution to train each expert, increasing the weights iteratively of examples misclassified at previous iteration. From Antoine Cornuéjols Lecture slides 29
AdaBoost: the algorithm A training set: S = {(x 1,y 1 ),,(x m,y m )} y i {-1,+1} label (annotation) of example x i S A set of weak learners {h t } For t = 0,,T: Give a weight to every sample in {1,,m} regarding its difficulty to be well classified by h t-1 : D t Find the weak decision ( heuristic ): h t : S {-1,+1} with the smallest error ε t on D t : εt = Pr D[ h( ) ] ( ) t t x y D i = i i t Compute the influence/impact of h t ih : t ( x ) i y i Final decision H final = a majority weighted vote of all the h t 30
Error of generalization for AdaBoost Error of generalization of H can be bounded by: E Real ( H ) = E ( H ) + Ο T Empirical T m T. d Error Iterations where T is the number of boosting iterations m the number of training examples d the dimension of H T space ( weaks learner complexity ) 31
The Task of Face Detection Many slides adapted from P. Viola 32
Basic Idea Slide a window across image and evaluate a face model at every location. 33
Image Features Feature Value = (Pixel in white area) (Pixel in black area) -33 3 27-29 1 29-30 28 6 1 if < 29 1 if < 26 1 if > 11 h1 ( ) = h2 ( ) = h3 ( ) = 0 otherwise 0 otherwise 0 otherwise 34
AdaBoost Cascade Principle 0 1 0 0 0 1 0 1 1 1 0 1 0 1 AdaBoost AdaBoost 1 Face x 99% 2 Face x 98% Non Face x 30% Non Face x 9% N Non Face x 70% Non Face x 21% Face x 90% Non Face x 0.00006% 35
The Implemented System Training Data 5000 faces All frontal, rescaled to 24x24 pixels 300 million non-faces sub-windows 9500 non-face images Faces are normalized Scale, translation Many variations Across individuals Illumination Pose 36
Results Fixed images Video sequence Frontal face Left profile face Right profile face 37
Extension Fast and robust Other descriptors Other cascades (rotation ) Eye detection, Hand detection, Body detection 38
2 success stories of Machine learning among many Classification: How to separate the data? Error Real (Algorithm) Error Empirical (Algorithm)+Capacity(Algorithm) Boosting Machine Random Forests 39
Decision tree to decide playing tennis or not Objective 2 classes: yes & no Prediction if a game will be played or not Temperature will be easily converted into numerical I.H. Witten and E. Frank, Data Mining, Morgan Kaufmann Pub., 2000. 40
Decision tree to decide playing tennis or not Class: NO Class:YES Class: YES 41
Final decision tree 42
Decision trees do not converge? Make a forest 43
Error of generalization for Random Forest Error of generalization of RF can be bounded by: E Real ( RF ) 2 2 ρ(1 s ) s where ρ is the mean correlation between two decision trees s is the quality of prediction of the set of decision trees 44
Success story: Kinect From Real-Time Human Pose Recognition in Parts from a Single Depth Image, Jamie Shotton, Andrew Fitzgibbon, 45 Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, and Andrew Blake at CVPR June 2011.
Success story: Kinect 46
Other success stories Support Vector Machines E Real ( SVM ) = ( SVM ) E Empirical d ln m: the number of training examples d: the dimension of decision space Bound valid with probability 1 - α + 2m d α + 1 ln 4 m Artificial Neural Network and Deep Learning 47
Outlook 1.Data scientist: the sexiest job of the 21st century? 2.Data scientist future profile 3.Data science in academic research 4.Future 48
Future trainees Before considering applying a method or a technology, be sure that original conditions are verified When a method is extended out of its domain of validity, intend to prove the mathematical consistency / stability of the new method Demonstrate or at least provide insights of its complexity and scalability In the very next years, new students will come out with a more global vision of data science challenges, a deep understanding of involved layers and a better knowledge of powerful techniques. 49
I will be glad to answer to any question Frederic Precioso 06/07/2015 Professor at University Nice Sophia Antipolis (UNS) Laboratory I3S Joint Research Unit from CNRS & UNS (UMR 7271) Team Scalable and Pervasive software and Knowledge Systems (SPARKS)