DSSP Data Science Starter Program - Polytechnique



Similar documents
ANALYTICS CENTER LEARNING PROGRAM

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Learning outcomes. Knowledge and understanding. Competence and skills

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

The Need for Training in Big Data: Experiences and Case Studies

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Information Management course

Azure Machine Learning, SQL Data Mining and R

MS1b Statistical Data Mining

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level?

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Concept and Project Objectives

CS Data Science and Visualization Spring 2016

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Predictive Analytics Certificate Program

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

2016 POST-DOCTORAL PROGRAM Applicant Guide

Search in BigData2 - When Big Text meets Big Graph 1. Introduction State of the Art on Big Data

The University of Jordan

Big Data and Analytics: Challenges and Opportunities

Statistics for BIG data

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

BIG DATA What it is and how to use?

Big Data Analytics and Optimization

COMP9321 Web Application Engineering

Graduate Co-op Students Information Manual. Department of Computer Science. Faculty of Science. University of Regina

Office: LSK 5045 Begin subject: [ISOM3360]...

PROGRAMME SPECIFICATION POSTGRADUATE PROGRAMME

Bayesian networks - Time-series models - Apache Spark & Scala

Statistics Graduate Courses

The Data Mining Process

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Big Data Analytics and Healthcare

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

MHI3000 Big Data Analytics for Health Care Final Project Report

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

2015 Workshops for Professors

EUPIDE 2008 Enterprise-University Partnership in Doctoral Education June, Université Pierre et Marie Curie, Paris Conference program

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Big-Data Computing with Smart Clouds and IoT Sensing

Journée Thématique Big Data 13/03/2015

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Prerequisites. Course Outline

High Productivity Data Processing Analytics Methods with Applications

Master of Science in Health Information Technology Degree Curriculum

MD - Data Mining

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Knowledge Discovery from patents using KMX Text Analytics

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

Sunnie Chung. Cleveland State University

Introduction to Data Mining

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Machine Learning Introduction

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Data Mining + Business Intelligence. Integration, Design and Implementation

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

Predictive Data modeling for health care: Comparative performance study of different prediction models

An interdisciplinary model for analytics education

for the Field of Electrical and Information Engineering 1. Introduction: the doctorate in the framework of the European policy of education

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

An Introduction to Data Mining

Doctor of Philosophy in Computer Science

Information and Decision Sciences (IDS)

An Introduction to Health Informatics for a Global Information Based Society

Scalable Developments for Big Data Analytics in Remote Sensing

Machine Learning with MATLAB David Willingham Application Engineer

Machine learning for algo trading

CURRICULUM VITAE. August 2008 now: Lecturer in Analysis at the University of Birmingham.

Advanced In-Database Analytics

Using Data Mining and Machine Learning in Retail

Core Curriculum to the Course:

ADVANCED MACHINE LEARNING. Introduction

Healthcare data analytics. Da-Wei Wang Institute of Information Science

At a Glance A short portrait of the Technical University of Crete

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

M E M O R A N D U M. Faculty Senate Approved April 2, 2015

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

How To Become A Data Scientist

Analysis Tools and Libraries for BigData

Transcription:

DSSP Data Science Starter Program - Polytechnique A novel professional training on Data Science and Bigdata, offered by École Polytechnique jointly by the Applied Mathematics and Informatics Department 1. Target Audience and Prerequisite(s) Year 1 / October 3 - December 13, 2014 The proposed modules are suitable for anyone with some basic knowledge of Computer Science or Statistics. No programming experience is required. The program is designed for individuals (researchers and practitioners). The concepts and training delivered in this program enable a sound understanding of the context and challenges of Big Data, a challenge that shapes the evolution of sciences and many business domains. The offered program is suitable to both early career professionals as well as senior managers that need an understanding of this challenging area and its applications. 2. Data Science Starter Program The training program aims at professionals and executives and covers taught modules, labs and homework. It addresses state- of- the- art topics in Data Science and Big Data ranging from data collection, storage and processing to analytics and visualization, as well as a range of real- world applications and business/laboratory cases. This program is large- scope, and will cover, to a satisfactory degree of detail, the methods and tools to tackle big data problems. 2.1 Master Structure The training spans 140 hours taught (Friday and Saturday, in October/November), each training day: 2 x 3h slots + 1h conference/invited talk. The thematic articulation is as follows: Week 1. Data Science introduction. Big Data ecosystem: players, software, hardware Data project cycle/management Legal issues/security framework. Week 2. Data Management. Database / SQL, data cleaning, normalization, feature selection & creation spectral, decompositions and dimensionality reduction. Weeks 3-5. Data Analysis and Machine Learning. Descriptive (data quality) Exploratory (summary statistics, correlation, ANOVA) Inferential (theory of generalization, sampling, statistical testing) Predictive (supervised, unsupervised machine learning). Week 6-7. Cloud computing & Big Data. Introduction the basics of the cloud computing paradigm and understanding of performance evaluation for applications in the cloud. Basic concepts of Bigdata - Hadoop/MapReduce as a programming model for distributed processing of large datasets. Introduction to NoSQL languages. Week 8-10. Graph & Text Mining and Bigdata Camp. Methods and tools for pre- processing, indexing, querying, retrieval and ranking of text at the document and collection levels. Algorithms for text- oriented application in web and social networks. Methods and tools for pre- processing graphs, searching ranking and evaluating nodes and communities. 1

2.2 Courses structure and Syllabus Course Objective Syllabus Introduction to Data Science Data Management Data Analysis and Machine Learning Cloud Computing & Bigdata To present a big picture of Data Science as well as of its cycles. To present the foundation of data management: accessing to the data stored in a database and (pre)processing to prepare its analysis To present the basis of Data Analysis and Machine Learning: how to describe and explore a dataset, how to use data to find hidden information and to do prediction with statistical and machine learning algorithms. Introduce the basics of the cloud- computing paradigm. Understand in performance evaluation for applications in the cloud. Understand the basic concepts in Hadoop/ MapReduce as a programming model for distributed processing of large datasets. Big Data ecosystem: players, software, hardware Data project cycle/management Juridic/security framework Databases, SQL, design Data processing: normalization, feature selection & creation, spectral decompositions and dimensionality reduction Looking at the data: Descriptive statistic, PCA and dimension reduction, Statistical testing Unsupervised clustering: Clustering, K- Means and K- Means++, DBSCAN, Hierarchical clustering Linear model and diagnostic: Generalization theory, Prediction vs inference, Linear model and diagnostic Logistic regression: Logistic regression and variable selection, Overfitting and Cross validation, Metric choice (AUC, Precision/Recall, F- Score,...) Machine Learning: Empirical criterion minimization, SVM, Regularization for SVM and logistic regression Tree methods and ensemble methods: Classification And Regression Tree,Bagging and boosting Further topics: Naive Bayes, Non- parametric methods, Neural networks and deep learning, Spectral clustering Overview of Computing Paradigms Grid Computing, Cluster Computing, Distributed Computing, Utility Computing, Cloud Computing Cloud Computing Architecture - Comparison with traditional computing architecture (client/server) Services provided at various levels, Role of Networks protocols, Web services Service Management in Cloud Computing Data security privacy and security Issues Principles of parallel processing and distributed systems Functional programming and parallel algorithms for Mapreduce Hadoop storage, DFS, Cluster architecture, Visual Analytics 2

Graph & Text Mining Graphs and Texts are ubiquitous in social and web data. This module provides methods and tools for pre- processing, indexing, querying, retrieval and ranking of text at the document and collection levels. We describe also algorithms for text- oriented application in web and social networks. For graphs, the objective is to provide methods and tools for pre- processing graphs, searching ranking and evaluating nodes and communities. Community mining methods, graph clustering methods (min- cut, spectral clustering), Spectral Clustering of Graph Data Ranking algorithms (Pagerank), Ranking evaluation measures (Kendal Tau, NDCG), Degeneracy (k- core & extensions) Feature extraction for text, scoring, term weighting & the vector space representation, indexing, retrieval functions: time- frequency/inverse- document- frequency (TF- IDF), BM25. Web Mining. Web personalization and recommendations (collaborative filtering) Web Advertising (Google ad- words, 2nd price auctions, campaign design principles, natural language generation for snippets, campaign optimization algorithms). Bigdata Camp Apply the techniques described in the previous lectures to a case study from an industrial problem or academic problem, using state- of- the- art methods and machine learning tools. Conferences - Invited talks Case study from industry or academia Workshops from machine learning challenges This is a horizontal activity spanning all the duration of the master with invited people from academia and industry to present topics and experiences from data science and big data case studies. 3

3. Teaching staff Faculty S. Gaiffas (CMAP), http://www.cmap.polytechnique.fr/~gaiffas/ C. Giatsidis (LIX), http://www.lix.polytechnique.fr/~giatsidis/ B. Kegl (LAL), https://users.lal.in2p3.fr/kegl/ Short CV Stéphane Gaïffas is Professeur Chargé at the department of applied mathematics of Ecole Polytechnique. He is doing research in Statistics and Machine Learning, with current applications to web- marketing, social networks, and health records data in partnership with Caisse Nationale d Assurance Maladie. He defended his PhD in Statistics about «Nonparametric Regression and Inhomogeneous Information» under the supervision of Marc Hoffman at LPMA - Univ. Denis Diderot in 2005. He was Maitre de Conférence at LSTA - Univ Paris 6 between 2007 and 2012. He has a scientific consultant activity for machine learning and big data since 3 years with several french companies. Christos Giatsidis is currently a Post- doctoral researcher in the Computer Science Laboratory at Ecole Polytechnique in France. He received his Diploma in computer Science from the Athens Univ. of Economics & Business, Greece in 2009 and his PhD from Ecole Polytechnique, under the supervision of Prof. Michalis Vazirgiannis. In 2014 he received a "thesis prize" for his thesis entitled "Graph Mining and Community Detection with Degeneracy". He has experience in both the research and industrial domain. Specifically, recent work on the industrial domain includes predicting a players obsession for a large French company in the gambling industry and working on a prediction model for component failure for a big aeronautics company. His research interests include data/graph mining and algorithms for big data management. Balázs Kégl received the Ph.D. degree in computer science from Concordia University, Montreal, in 1999. From January to December 2000 he was a Postdoctoral Fellow at the Department of Mathematics and Statistics at Queen's University, Kingston, Canada, receiving NSERC Postdoctoral Fellowship. He was an Assistant Professor from 2001 to 2006 in the Department of Computer Science and Operations Research at the University of Montreal. Since 2006 he has been a research scientist in the Linear Accelerator Laboratory of the CNRS (DR since 2013). He has published more than hundred papers on unsupervised and supervised learning (principal curves, intrinsic dimensionality estimation, boosting), large- scale Bayesian inference and optimization, and on various applications ranging from music and image processing to systems biology and experimental physics. At his current position he has been the head of the AppStat team working on machine learning and statistical inference problems motivated by applications in high- 4

energy particle and astroparticle physics. Since 2014, he has been the chair of the Center for Data Science of the University of Paris Saclay. E. Le Pennec (CMAP), http://www.cmap.polytechnique.fr/~lepennec/ Eric Matzner- Lober (CMAP) M. Vazirgiannis (LIX) www.lix.polytechnique.fr/~mvazirg/ Erwan Le Pennec have been an Associate Professor (Professeur associé) at the Applied Math department of École Polytechnique since September 2013. He is doing his research in statistics and signal processing at the CMAP of the same school. He has done a Signal Processing PhD with Stéphane Mallat at the centre de mathématiques appliquées de l'école Polytechnique. The subject of his thesis is the introduction of geometry in image representation. He defended it on December the 19th 2002: its title is Bandelettes et représentations géométriques des images (Bandelets and geometric representation of images). In 2002-2004, He worked as a "post- doc" in a joint- project between the CMAP and Let It Wave, a company created by Stéphane Mallat, Christophe Bernard, Jérôme Kalifa and myself to exploit our research on bandelets. From 2004 to 2010, He was a "Maitre de Conférence" (Assistant Professor) at the university Paris Diderot (Paris 7) in the "laboratoire de Probabilités et Modèles Aléatoires" (Statistics team). From 2010 to 2013, He was a "Chargé de Recherche" (Research Associate) at the project SELECT of Inria Saclay, a project in which he had already worked in 2009-2010. He has also accompanied Let It Wave, even after it was selled to Zoran, as a scientific consultant. Eric Matzner- Lober have been professor of Statistics at Rennes 2 university since 2007, he is also affiliated at Los Alamos National Laboratory. From this year on, he is also part time professor at Ecole Polytechnique. He is a specialist of non parametric statistic and machine learning. He is a renown expert of R, a language for which he runs a book series. He has also funded a statistic consulting company that has been bought by a major consulting actor. Dr. Vazirgiannis is a Professor in LIX, Ecole Polytechnique. He is currently working in the area of Data Science for Bigdata aiming at harnessing the potential of machine learning algorithms for large scale data sets including text and graphs. More specifically his current work is on graph degeneracy for large scale graph mining, graph based text retrieval, learning models from time series data and text mining for the web (i.e. advertising, news streams). He is involved in teaching in data mining and machine learning for big data in Ecole Polytechnique. He has supervised previously nine completed Ph.D. theses and supervises six more underway. He has published chapters in books and encyclopedias, two international books and more than a hundred twenty (120) papers in international refereed journals and conferences. He has received the 5

ERCIM and Marie Curie EU fellowships. Also he has coauthored three patents and attracted significant R&D funding including national and international research & development projects. Currently he leads industrial projects in the area of large scale machine learning. 6

4. Master Schedule (3/10/2014 15/12/2014) Session Date Topic Teaching Faculty Amphi 1 3/10/2014 Introduction to Data Science Gaiffas, Le Pennec, Matzner Painlevé 2 4/10/2014 Introduction to Data Science Gaiffas, Le Pennec, Matzner Painlevé 3 10/10/2014 Data Management Giatsidis, Vazirgianis Painlevé 4 11/10/2014 Data Management Giatsidis, Vazirgianis Painlevé 5 17/10/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 6 18/10/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 7 24/10/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 8 25/10/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 9 07/11/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 10 08/11/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 11 14/11/2014 Cloud Computing & Bigdata Gaiffas, Matzner Painlevé 12 15/11/2014 Cloud Computing & Bigdata Gaiffas, Matzner Painlevé 13 21/11/2014 Cloud Computing & Bigdata Giatsidis, Vazirgianis Painlevé 14 22/11/2014 Cloud Computing & Bigdata Giatsidis, Vazirgianis Painlevé 15 28/11/2014 Graph/Text Mining Vazirgianis, Giatsidis Painlevé 16 29/12/2014 Graph/Text Mining Vazirgianis, Malliaros Painlevé 17 5/12/2014 Bigdata Camp Kegl, Giatsidis Painlevé 18 6/12/2014 Bigdata Camp Kegl, Giatsidis Labs 19 12/12/2014 Bigdata Camp Kegl, Giatsidis Painlevé 20 13/12/2014 Bigdata Camp Kegl, Giatsidis Painlevé 7