1 DSSP Data Science Starter Program - Polytechnique A novel professional training on Data Science and Bigdata, offered by École Polytechnique jointly by the Applied Mathematics and Informatics Department 1. Target Audience and Prerequisite(s) Year 1 / October 3 - December 13, 2014 The proposed modules are suitable for anyone with some basic knowledge of Computer Science or Statistics. No programming experience is required. The program is designed for individuals (researchers and practitioners). The concepts and training delivered in this program enable a sound understanding of the context and challenges of Big Data, a challenge that shapes the evolution of sciences and many business domains. The offered program is suitable to both early career professionals as well as senior managers that need an understanding of this challenging area and its applications. 2. Data Science Starter Program The training program aims at professionals and executives and covers taught modules, labs and homework. It addresses state- of- the- art topics in Data Science and Big Data ranging from data collection, storage and processing to analytics and visualization, as well as a range of real- world applications and business/laboratory cases. This program is large- scope, and will cover, to a satisfactory degree of detail, the methods and tools to tackle big data problems. 2.1 Master Structure The training spans 140 hours taught (Friday and Saturday, in October/November), each training day: 2 x 3h slots + 1h conference/invited talk. The thematic articulation is as follows: Week 1. Data Science introduction. Big Data ecosystem: players, software, hardware Data project cycle/management Legal issues/security framework. Week 2. Data Management. Database / SQL, data cleaning, normalization, feature selection & creation spectral, decompositions and dimensionality reduction. Weeks 3-5. Data Analysis and Machine Learning. Descriptive (data quality) Exploratory (summary statistics, correlation, ANOVA) Inferential (theory of generalization, sampling, statistical testing) Predictive (supervised, unsupervised machine learning). Week 6-7. Cloud computing & Big Data. Introduction the basics of the cloud computing paradigm and understanding of performance evaluation for applications in the cloud. Basic concepts of Bigdata - Hadoop/MapReduce as a programming model for distributed processing of large datasets. Introduction to NoSQL languages. Week Graph & Text Mining and Bigdata Camp. Methods and tools for pre- processing, indexing, querying, retrieval and ranking of text at the document and collection levels. Algorithms for text- oriented application in web and social networks. Methods and tools for pre- processing graphs, searching ranking and evaluating nodes and communities. 1
2 2.2 Courses structure and Syllabus Course Objective Syllabus Introduction to Data Science Data Management Data Analysis and Machine Learning Cloud Computing & Bigdata To present a big picture of Data Science as well as of its cycles. To present the foundation of data management: accessing to the data stored in a database and (pre)processing to prepare its analysis To present the basis of Data Analysis and Machine Learning: how to describe and explore a dataset, how to use data to find hidden information and to do prediction with statistical and machine learning algorithms. Introduce the basics of the cloud- computing paradigm. Understand in performance evaluation for applications in the cloud. Understand the basic concepts in Hadoop/ MapReduce as a programming model for distributed processing of large datasets. Big Data ecosystem: players, software, hardware Data project cycle/management Juridic/security framework Databases, SQL, design Data processing: normalization, feature selection & creation, spectral decompositions and dimensionality reduction Looking at the data: Descriptive statistic, PCA and dimension reduction, Statistical testing Unsupervised clustering: Clustering, K- Means and K- Means++, DBSCAN, Hierarchical clustering Linear model and diagnostic: Generalization theory, Prediction vs inference, Linear model and diagnostic Logistic regression: Logistic regression and variable selection, Overfitting and Cross validation, Metric choice (AUC, Precision/Recall, F- Score,...) Machine Learning: Empirical criterion minimization, SVM, Regularization for SVM and logistic regression Tree methods and ensemble methods: Classification And Regression Tree,Bagging and boosting Further topics: Naive Bayes, Non- parametric methods, Neural networks and deep learning, Spectral clustering Overview of Computing Paradigms Grid Computing, Cluster Computing, Distributed Computing, Utility Computing, Cloud Computing Cloud Computing Architecture - Comparison with traditional computing architecture (client/server) Services provided at various levels, Role of Networks protocols, Web services Service Management in Cloud Computing Data security privacy and security Issues Principles of parallel processing and distributed systems Functional programming and parallel algorithms for Mapreduce Hadoop storage, DFS, Cluster architecture, Visual Analytics 2
3 Graph & Text Mining Graphs and Texts are ubiquitous in social and web data. This module provides methods and tools for pre- processing, indexing, querying, retrieval and ranking of text at the document and collection levels. We describe also algorithms for text- oriented application in web and social networks. For graphs, the objective is to provide methods and tools for pre- processing graphs, searching ranking and evaluating nodes and communities. Community mining methods, graph clustering methods (min- cut, spectral clustering), Spectral Clustering of Graph Data Ranking algorithms (Pagerank), Ranking evaluation measures (Kendal Tau, NDCG), Degeneracy (k- core & extensions) Feature extraction for text, scoring, term weighting & the vector space representation, indexing, retrieval functions: time- frequency/inverse- document- frequency (TF- IDF), BM25. Web Mining. Web personalization and recommendations (collaborative filtering) Web Advertising (Google ad- words, 2nd price auctions, campaign design principles, natural language generation for snippets, campaign optimization algorithms). Bigdata Camp Apply the techniques described in the previous lectures to a case study from an industrial problem or academic problem, using state- of- the- art methods and machine learning tools. Conferences - Invited talks Case study from industry or academia Workshops from machine learning challenges This is a horizontal activity spanning all the duration of the master with invited people from academia and industry to present topics and experiences from data science and big data case studies. 3
4 3. Teaching staff Faculty S. Gaiffas (CMAP), C. Giatsidis (LIX), B. Kegl (LAL), https://users.lal.in2p3.fr/kegl/ Short CV Stéphane Gaïffas is Professeur Chargé at the department of applied mathematics of Ecole Polytechnique. He is doing research in Statistics and Machine Learning, with current applications to web- marketing, social networks, and health records data in partnership with Caisse Nationale d Assurance Maladie. He defended his PhD in Statistics about «Nonparametric Regression and Inhomogeneous Information» under the supervision of Marc Hoffman at LPMA - Univ. Denis Diderot in He was Maitre de Conférence at LSTA - Univ Paris 6 between 2007 and He has a scientific consultant activity for machine learning and big data since 3 years with several french companies. Christos Giatsidis is currently a Post- doctoral researcher in the Computer Science Laboratory at Ecole Polytechnique in France. He received his Diploma in computer Science from the Athens Univ. of Economics & Business, Greece in 2009 and his PhD from Ecole Polytechnique, under the supervision of Prof. Michalis Vazirgiannis. In 2014 he received a "thesis prize" for his thesis entitled "Graph Mining and Community Detection with Degeneracy". He has experience in both the research and industrial domain. Specifically, recent work on the industrial domain includes predicting a players obsession for a large French company in the gambling industry and working on a prediction model for component failure for a big aeronautics company. His research interests include data/graph mining and algorithms for big data management. Balázs Kégl received the Ph.D. degree in computer science from Concordia University, Montreal, in From January to December 2000 he was a Postdoctoral Fellow at the Department of Mathematics and Statistics at Queen's University, Kingston, Canada, receiving NSERC Postdoctoral Fellowship. He was an Assistant Professor from 2001 to 2006 in the Department of Computer Science and Operations Research at the University of Montreal. Since 2006 he has been a research scientist in the Linear Accelerator Laboratory of the CNRS (DR since 2013). He has published more than hundred papers on unsupervised and supervised learning (principal curves, intrinsic dimensionality estimation, boosting), large- scale Bayesian inference and optimization, and on various applications ranging from music and image processing to systems biology and experimental physics. At his current position he has been the head of the AppStat team working on machine learning and statistical inference problems motivated by applications in high- 4
5 energy particle and astroparticle physics. Since 2014, he has been the chair of the Center for Data Science of the University of Paris Saclay. E. Le Pennec (CMAP), Eric Matzner- Lober (CMAP) M. Vazirgiannis (LIX) Erwan Le Pennec have been an Associate Professor (Professeur associé) at the Applied Math department of École Polytechnique since September He is doing his research in statistics and signal processing at the CMAP of the same school. He has done a Signal Processing PhD with Stéphane Mallat at the centre de mathématiques appliquées de l'école Polytechnique. The subject of his thesis is the introduction of geometry in image representation. He defended it on December the 19th 2002: its title is Bandelettes et représentations géométriques des images (Bandelets and geometric representation of images). In , He worked as a "post- doc" in a joint- project between the CMAP and Let It Wave, a company created by Stéphane Mallat, Christophe Bernard, Jérôme Kalifa and myself to exploit our research on bandelets. From 2004 to 2010, He was a "Maitre de Conférence" (Assistant Professor) at the university Paris Diderot (Paris 7) in the "laboratoire de Probabilités et Modèles Aléatoires" (Statistics team). From 2010 to 2013, He was a "Chargé de Recherche" (Research Associate) at the project SELECT of Inria Saclay, a project in which he had already worked in He has also accompanied Let It Wave, even after it was selled to Zoran, as a scientific consultant. Eric Matzner- Lober have been professor of Statistics at Rennes 2 university since 2007, he is also affiliated at Los Alamos National Laboratory. From this year on, he is also part time professor at Ecole Polytechnique. He is a specialist of non parametric statistic and machine learning. He is a renown expert of R, a language for which he runs a book series. He has also funded a statistic consulting company that has been bought by a major consulting actor. Dr. Vazirgiannis is a Professor in LIX, Ecole Polytechnique. He is currently working in the area of Data Science for Bigdata aiming at harnessing the potential of machine learning algorithms for large scale data sets including text and graphs. More specifically his current work is on graph degeneracy for large scale graph mining, graph based text retrieval, learning models from time series data and text mining for the web (i.e. advertising, news streams). He is involved in teaching in data mining and machine learning for big data in Ecole Polytechnique. He has supervised previously nine completed Ph.D. theses and supervises six more underway. He has published chapters in books and encyclopedias, two international books and more than a hundred twenty (120) papers in international refereed journals and conferences. He has received the 5
6 ERCIM and Marie Curie EU fellowships. Also he has coauthored three patents and attracted significant R&D funding including national and international research & development projects. Currently he leads industrial projects in the area of large scale machine learning. 6
7 4. Master Schedule (3/10/ /12/2014) Session Date Topic Teaching Faculty Amphi 1 3/10/2014 Introduction to Data Science Gaiffas, Le Pennec, Matzner Painlevé 2 4/10/2014 Introduction to Data Science Gaiffas, Le Pennec, Matzner Painlevé 3 10/10/2014 Data Management Giatsidis, Vazirgianis Painlevé 4 11/10/2014 Data Management Giatsidis, Vazirgianis Painlevé 5 17/10/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 6 18/10/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 7 24/10/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 8 25/10/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 9 07/11/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 10 08/11/2014 Data Analysis and Machine Learning Gaiffas, Le Pennec, Matzner Painlevé 11 14/11/2014 Cloud Computing & Bigdata Gaiffas, Matzner Painlevé 12 15/11/2014 Cloud Computing & Bigdata Gaiffas, Matzner Painlevé 13 21/11/2014 Cloud Computing & Bigdata Giatsidis, Vazirgianis Painlevé 14 22/11/2014 Cloud Computing & Bigdata Giatsidis, Vazirgianis Painlevé 15 28/11/2014 Graph/Text Mining Vazirgianis, Giatsidis Painlevé 16 29/12/2014 Graph/Text Mining Vazirgianis, Malliaros Painlevé 17 5/12/2014 Bigdata Camp Kegl, Giatsidis Painlevé 18 6/12/2014 Bigdata Camp Kegl, Giatsidis Labs 19 12/12/2014 Bigdata Camp Kegl, Giatsidis Painlevé 20 13/12/2014 Bigdata Camp Kegl, Giatsidis Painlevé 7
School of Natural Sciences Postgraduate Diploma in Data & Business Analytics Master of Science Degree in Data Analytics Course Brochure 2014 www.snu.edu.in Table of Contents Overview... 3 Program Objectives...
Program Proposal for Master of / Magisteriate in Supply Chain Management Department of Supply Chain and Business Technology Management John Molson School of Business Concordia University March 2014 8 ii
REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference to
DATA SCIENCE FOR MANAGERS A course for professionals seeking to harness the potential of Data Science. 12-14 OCTOBER 2015 MITI.MONASH.EDU FACULTY OF INFORMATION TECHNOLOGY & MONASH BUSINESS SCHOOL it.monash.edu/postgrad
University of Illinois at Chicago 1 Management Information Systems Mailing Address: UIC Liautaud Graduate School of Business 1108 University Hall (MC 077) 601 South Morgan Street Chicago, IL 60607 Contact
Master of Science in Health Information Technology Degree Curriculum Core courses: 8 courses Total Credit from Core Courses = 24 Core Courses Course Name HRS Pre-Req Choose MIS 525 or CIS 564: 1 MIS 525
Programme Specification Postgraduate Programmes Awarding Body/Institution Teaching Institution University of London Goldsmiths, University of London Name of Final Award and Programme Title MSc Data Science
BIG DATA IMPLEMENTATION COMMITTEE REPORT Submitted to Provost Hexter February 22, 2013 TABLE OF CONTENTS I. EXECUTIVE SUMMARY... 3 II. A VISION FOR UC DAVIS... 5 1. Big Data, Big Challenges, Big Opportunities...
SOUTH DAKOTA BOARD OF REGENTS Committee on Academic and Student Affairs AGENDA ITEM: III A DATE: December 4-5, 2013 ****************************************************************************** SUBJECT:
The Masters of Science in Information Systems & Technology College of Engineering and Computer Science University of Michigan-Dearborn A Rackham School of Graduate Studies Program PH: 1-59-561; FAX: 1-59-692;
SOUTH DAKOTA BOARD OF REGENTS Full Board AGENDA ITEM: 26 2 (a) DATE: April 2-3, 2014 ****************************************************************************** SUBJECT: New Program: DSU Master of Science
The Computer Science Program Page 1 Contents 1. Aims and Scope... 4 2. Applications and Admissions... 4 3. Student Study Plan... 5 3.1. Student Advisor... 5 3.2. Placement Test... 5 3.3. Transfer of Credit...
A Study of Application of Data Mining and Analytics in Education Domain Sahil P. Karkhanis B.Tech in Computer Science and Engineering student Vellore Institute of Technology Shweta S. Dumbre, PhD Project
Intelligent Data Analysis 13.08-18.08.2000 organized by Michael R. Berthold, Rudolf Kruse, Xiaohui Liu, and Helena Szczerbicka 1 Introduction For the last decade or so, the size of machine-readable data
Master of Science in Information Technology (MS-IT) Program Objectives The primary aim of the program is to allow IT professionals an opportunity for professional upgrading or an extension of their qualifications
STATE OF IOWA FEBRUARY 4-5, 2015 REQUEST FOR NEW PROGRAM AT IOWA STATE UNIVERSITY: MASTER OF BUSINESS ANALYTICS PROGRAM Contact: Diana Gonzalez Action Requested: Consider approval of the request by Iowa
Higher education programmes to address ICT and software development skills needs February 2014 ICT and Software Development Skills Programme Call for Proposals / Terms & Conditions 1 Important Notice 1.
Department of Computer Science Graduate Handbook MS in Computer Science Fall 2008 Edition California State University, Fullerton Fullerton, CA 92834-6870 Table of Contents Introduction...1 Computer Science
DESIGN AND DEVELOPMENT OF DATA MINING MODELS FOR THE PREDICTION OF MANPOWER PLACEMENT IN THE TECHNICAL DOMAIN Thesis submitted to Cochin University of Science and Technology in fulfilment of the requirements
STATE COUNCIL OF HIGHER EDUCATION FOR VIRGINIA Program Proposal Cover Sheet 1. Institution George Mason University 3. Title of proposed program Data Analytics 2. Program action (Check one): New program
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 1 On Understanding Big Data Impacts in Remotely Sensed Image Classification Using Support Vector Machine Methods Gabriele
Master s of Science of Clinical Science Double Master degree by Colorado University Denver (UCD) and Universidad Politécnica de Madrid (UPM) 60 ECTS (equiv. 30 cr. UCD) Program coordinators at UPM: - Francisco
Information Technologies Programs Data Science Certificate Program Accelerate Your Career extension.uci.edu/datascience Offered in partnership with University of California, Irvine Extension s professional
NEW GRADUATE CONCENTRATION PROPOSALS ARIZONA STATE UNIVERSITY GRADUATE EDUCATION This form should be used for academic units wishing to propose a new concentration for existing graduate degrees. A concentration