DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Similar documents
Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Data Science Certificate Program

Microsoft Research Windows Azure for Research Training

Microsoft Research Microsoft Azure for Research Training

ANALYTICS CENTER LEARNING PROGRAM

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

Azure Machine Learning, SQL Data Mining and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Diploma Of Computing

HADOOP. Revised 10/19/2015

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Machine Learning with MATLAB David Willingham Application Engineer

Databricks. A Primer

What s Cooking in KNIME

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Analysis Tools and Libraries for BigData

CORE CLASSES: IS 6410 Information Systems Analysis and Design IS 6420 Database Theory and Design IS 6440 Networking & Servers (3)

Master of Science in Health Information Technology Degree Curriculum

Databricks. A Primer

Learning outcomes. Knowledge and understanding. Competence and skills

Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm!

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

The Learn-Verified Full Stack Web Development Program

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

BIG DATA What it is and how to use?

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Challenges for Data Driven Systems

AcademyR Course Catalog

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Predicting outcome of soccer matches using machine learning

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

WEB DEVELOPMENT IMMERSIVE GA.CO/WDI

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

Data Analysis Bootcamp - What To Expect. Damian Herrick Founder, Principal Consultant Lake Hill Analytics, LLC

An In-Depth Look at In-Memory Predictive Analytics for Developers

Introduction to Big Data with Apache Spark UC BERKELEY

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Classroom Demonstrations of Big Data

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Big Data and Data Science: Behind the Buzz Words

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Computer Science Course Descriptions Page 1

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Predictive Analytics Certificate Program

SURVEY REPORT DATA SCIENCE SOCIETY 2014

SAV2013: The Great SharePoint 2013 App Venture

Big Data. Lyle Ungar, University of Pennsylvania

This Symposium brought to you by

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Bayesian networks - Time-series models - Apache Spark & Scala

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

MS1b Statistical Data Mining

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Big Data on Microsoft Platform

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Ironfan Your Foundation for Flexible Big Data Infrastructure

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

How To Learn To Use Big Data

Big Data Analytics and Optimization

An interdisciplinary model for analytics education

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Car Insurance. Prvák, Tomi, Havri

How To Write A Data Analysis Project

Introduction to Data Science: CptS Syllabus First Offering: Fall 2015

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

BIG DATA TRENDS AND TECHNOLOGIES

R YOU READY FOR PYTHON? Sunday 19th April, 2015

Big Data Integration: A Buyer's Guide

How To Make Sense Of Data With Altilia

Learning Web App Development

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Advanced Big Data Analytics with R and Hadoop

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Trainer Preparation Guide for Course 20488B: Developing Microsoft SharePoint Server 2013 Core Solutions Design of the Course

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

How To Handle Big Data With A Data Scientist

An Introduction to Data Mining

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Advanced In-Database Analytics

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Client Overview. Engagement Situation. Key Requirements

Data Integration Checklist

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

Statistics for BIG data

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Usability of Visualization Libraries for Web Browsers for Use in Scientific Analysis

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

You ll need to have: It d be great if you have:

Spark and the Big Data Library

An Overview of Predictive Analytics for Practitioners. Dean Abbott, Abbott Analytics

Predictive Modeling Techniques in Insurance

Transcription:

DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition. Over the course of five data science projects, they develop skills across key aspects of data science, and results from each project are added to the students' portfolios. In the last four weeks, students build out and complete their individual final projects, culminating in a presentation of their work to representatives from the Metis Hiring Network. ONLINE PRE-WORK Students work through a curated collection of tutorials that cover the basics so they can hit the ground running. First, they're guided through initial software setup. Introductory materials then start with productivity at the command line, using an editor effectively, and becoming familiar with Python basics. Students reinforce their statistics knowledge through a set of readings with exercises that start to blend the statistical and computational. Metis teaching assistants review these preparatory exercises and provide feedback online. INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS WEEK 1 UNIT ONE Introduction to the Data Science Toolkit Students complete an entire bite-sized data science project from start to finish. They start using Git for version control and the IPython environment with the pandas and matplotlib packages to perform exploratory statistical analyses and visualizations. Review probability and statistics, including distributions, bootstrapping, hypothesis testing, maximum likelihood estimation, and Bayes theorem (This review spans the first three weeks.) Use UNIX, Git, and IPython to organize data science project resources Load and manipulate data with the pandas Python package Visualize results using the matplotlib Python package Communicate data science results CODENAME In the For first the Students week, first The students last pass form Students guided small at work machine are groups project free small to that learning, focuses groups each anything on work using students unsupervised as covered MTA an internal dive turnstile deep class learning data or into science and learn estimate prediction NLP team something algorithms, the at a with fictional new NoSQL to answer volume regression of company people databases, models. the on in the questions They and street, insurance experience API they so data that industry want collection. (theoretical) to beauty (details address. of Students flat nonprofits are Some files, left work students and to and the learn individually companies students know to scrape what to and information street Supervised from constraints teams websites project efficiently. learning using for at the algorithms tools The design admissions students like of Python and this stage. are relational project. Requests, provided Others databases Beautiful with embark the have data on Soup, entirely been and and covered guided Selenium. new turf. in class. Every student can determine). will have be very their few passion deploy through After exploratory Students scraping works together data on intensely their analysis some own movie and classification challenges plotting box office so models him data, they or herself that students can fit focus within to find create on the and new something overall scrape tools, goals more cool, of the interesting, brainstorming, resources company on and their useful, and communication. own the or team. and worthwhile. present During McNulty, their movie students industry perform regression a deep dive into the to the visualization package D3 and create their own APIs on the Python Flask micro framework to class.

WEEK 2 UNIT TWO: PART 1: Design Process and Web Scraping In preparation for Project 2, students start to learn one of the most important tools a data scientist uses: the iterative design process. They learn tools for web scraping and start fitting simple models to data. Also, they are introduced to cloud computing and work on remote servers. Use the design process to iteratively explore the possible ways that a problem can be solved Create and work in a virtual environment on a cloud computing service Use Python s Requests and Selenium packages to obtain data from web pages Use Python s Beautiful Soup package to parse the content of a web page to find useful data for subsequent analysis Use the design process to iterate the concept for the Unit 2 projects Complete a primer on web fundamentals including HTML, CSS, and JavaScript WEEK 3 UNIT TWO: PART 2 Regression and Communicating Results Students go in-depth on regression using scikit-learn and matplotlib. Choosing among the analysis methods and approaches to reporting their results, students finish the second project and present their findings. Apply regression modeling with Python packages scikit-learn and statsmodels Load, clean, and explore data using Python packages pandas, numpy, and matplotlib/pyplot Experience how the design process influences analysis and results Complete second project and communicate results to each other CODENAME volume regression of company people databases, models. the on in questions the They and street, insurance API experience they so data that want industry collection. (theoretical) to address. (details beauty Students Some nonprofits are of left flat students work to files, and the individually know and companies students learn what and to will can determine). scrape be have their very final few project deploy information street Supervised constraints teams from at efficiently. learning the web for admissions sites the algorithms The design using students stage. of tools and this Others are relational project. like provided Python embark databases Requests, with on the entirely have data Beautiful been new and covered guided turf. Soup, Every and in class. student works through Selenium. exploratory Students After intensely work scraping data on their analysis and together challenges own and classification some plotting him movie or herself so models they box to that office can create fit focus within data, something on students the new overall cool, tools, find goals interesting, and of the useful, or brainstorming, scrape company more and resources worthwhile. and communication. the team. on their During own McNulty, and present students their perform movie a deep industry dive regression

WEEK 4 UNIT THREE: PART 1: Databases and Introduction to Machine Learning Concepts Students cover relational databases such as SQL and more ways of obtaining, cleaning and maintaining data. They are introduced to the concepts of machine learning and exposed to classification and supervised learning with a few examples such as logistic regression and KNN. They also discuss different types of feasibility related to data science questions and projects. Use SQL databases to store and organize data Explore supervised learning techniques including decision trees and random forests Access stored data with MySQL querying language Complete a deep applied survey of classification (supervised learning) techniques, such as logistic regression, k-nearest neighbors, etc. Design and evaluate the computational feasibility of a third data project WEEK 5 UNIT THREE: PART 2 Machine Learning, Supervised Learning Techniques, Naive Bayes Algorithm Students dig into more details and more algorithms for supervised learning including SVM, decision trees and random forests; techniques for feature selection and feature extraction; and concepts and applications for deep learning. Students choose to apply one or more of these algorithms as part of this Unit s project. WEEK 6 UNIT THREE: PART 3 JavaScript and D3 Connect regression modeling to the broader family of machine learning techniques Use supervised learning on Project 3; work in groups simulating in-house data science teams Refine models with feature selection and feature extraction Evaluate the efficacy and computational feasibility of various ML algorithms in different contexts Students visualize projects using D3, a favorite tool for flexible and attractive presentations of data and relationships. Since D3 is a JavaScript library, students learn JavaScript essentials and the incorporation of other js libraries (jquery, Bootstrap, etc.) that make the job much easier. Learn the fundamentals of JavaScript Explore basic principles of good visual design and communication Use D3 to create interactive visualizations that are functional in any browser Create novel data visualizations with D3 to illustrate Unit 3 project results in blog post format volume regression of company people databases, models. the on in the questions They and street, insurance API experience they so data that industry want collection. (theoretical) to (details address. beauty Students nonprofits are Some of left flat work students to files, and the individually and companies students know learn what to and can determine). scrape will have be very their few passion deploy information street Supervised constraints teams from project efficiently. learning web for at sites the algorithms The design admissions using students of tools and this stage. are relational project. like provided Python Others databases Requests, with embark the have data on Beautiful entirely been and covered guided Soup, new turf. and in class. Every student through Selenium. exploratory Students After works scraping data on intensely their analysis together own and classification challenges some plotting movie so models him they box or herself that office can fit focus within data, to create on students the new something overall tools, find goals and cool, of the interesting, brainstorming, scrape company more and resources useful, and communication. the or team. worthwhile. on their During own McNulty, and present students their perform movie a deep industry dive regression

WEEK 7 UNIT FOUR: PART 1: APIs, Data Collection Methods, NoSQL Storage, WebApps with Flask The project for the fourth unit involves text data. Students round out data acquisition methods with APIs and online database servers. Students also learn about NoSQL databases and start using MongoDB. WEEK 8 UNIT FOUR: PART 2 Natural Language Processing (NLP) Use Python to download data from an API Use NoSQL databases; parse and store unstructured data in MongoDB Review database selection: non-relational (NoSQL) databases vs. relational (SQL) databases vs. no database (flat files) Merge disparate data sets to practice data munging Design and propose initial data collection for Unit 4 project Students analyze the text data collected in the previous week and learn about NLP algorithms. More unsupervised learning algorithms are explored. Students dive deeper into unsupervised learning and more algorithms, covering K-means, hierarchical clustering, mixture models and topic models. They also learn about how large amounts of data are handled, discussing parallel computing and Hadoop MapReduce. Project 4 presentations are presented as lightning talks. Use Python s Natural Language ToolKit and TextBlob library to perform natural language analyses on text data Apply deep learning/neural networks, DBSCAN, dimensionality reduction (with principle components analysis). Algorithms including KD-trees and locality sensitive hashing are learned. Survey K-means, hierarchical clustering, and other unsupervised learning algorithms; applications on real data Reflect on the strengths and weaknesses of each algorithm and its appropriate use Outline the data science stack and design choices in data engineering fault tolerant systems Set up Hadoop environment on cloud servers Use Hadoop via Python bindings to write customized map-reduce jobs from scratch and run in Hadoop cloud environment Discuss Hadoop: history & ecosystem, when & why, hype & reality Complete Project 4 and present findings to class in lightning talk format volume regression of company people databases, models. the on in the questions They and street, insurance API experience they so data that industry want collection. (theoretical) to (details address. beauty Students nonprofits are Some of left flat work students to files, and the individually and companies students know learn what to and can determine). scrape will have be very their few passion deploy information street Supervised constraints teams from project efficiently. learning web for at sites the algorithms The design admissions using students of tools and this stage. are relational project. like provided Python Others databases Requests, with embark the have data on Beautiful entirely been and covered guided Soup, new turf. and in class. Every student through Selenium. exploratory Students After works scraping data on intensely their analysis together own and classification challenges some plotting movie so models him they box or herself that office can fit focus within data, to create on students the new something overall tools, find goals and cool, of the interesting, brainstorming, scrape company more and resources useful, and communication. the or team. worthwhile. on their During own McNulty, and present students their perform movie a deep industry dive regression

WEEKS 9-12 UNIT FIVE Final Project Students work full time on their Final Projects, which they have been slowly designing through the first eight weeks. They also learn more about cloud computing, system architectures and feasibility evaluations. Use the design process to isolate an appropriate problem to solve Evaluate the computational feasibility of the problem Choose data sources that can be used to address the problem Design and implement an appropriate computational architecture Design and implement an appropriate set of analysis steps Design and develop a data visualization to clearly convey the results of the analysis to a layperson Assemble final portfolio and present project at Career Day MORE ABOUT PROJECTS Data science projects can be divided into useful dimensions. A dimension can be thought of as a facet along which a decision must be made to specify a project implementation. The bootcamp considers the dimensions of domain, design, data, algorithms, tools, and communication. Each Unit covers certain content from several domains, which are reinforced in that Unit's project. The rigor with which we attack the topics covered in the bootcamp allow us to sleep soundly at night. We feel confident in saying that our graduates haven't simply learned about the tools that data scientists use. By the time they leave our classroom, our graduates are data scientists. They are ready to approach the problem space in their new careers and assemble the suite of tools and methods to answer insightful questions and communicate comprehensible results. They are competent, capable, and confident. And they are ready to work. volume regression of company people databases, models. the on in questions the They and street, insurance API experience they so data that want industry collection. (theoretical) to address. (details beauty Students Some nonprofits are of left flat students work to files, and the individually know and companies students learn what to and will can determine). scrape be have their very final few project deploy information street Supervised constraints teams from at efficiently. learning the web for admissions sites the algorithms The design using students stage. of tools and this Others are relational project. like provided Python embark databases Requests, with on the entirely have data Beautiful been new and covered guided turf. Soup, Every and in class. student works through Selenium. exploratory Students After intensely work scraping data on their analysis and together challenges own and classification some plotting him movie or herself so models they box to that office can create fit focus within data, something on students the new overall cool, tools, find goals interesting, and of the useful, or brainstorming, scrape company more and resources worthwhile. and communication. the team. on their During own McNulty, and present students their perform movie a deep industry dive regression