DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition. Over the course of five data science projects, they develop skills across key aspects of data science, and results from each project are added to the students' portfolios. In the last four weeks, students build out and complete their individual final projects, culminating in a presentation of their work to representatives from the Metis Hiring Network. ONLINE PRE-WORK Students work through a curated collection of tutorials that cover the basics so they can hit the ground running. First, they're guided through initial software setup. Introductory materials then start with productivity at the command line, using an editor effectively, and becoming familiar with Python basics. Students reinforce their statistics knowledge through a set of readings with exercises that start to blend the statistical and computational. Metis teaching assistants review these preparatory exercises and provide feedback online. INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS WEEK 1 UNIT ONE Introduction to the Data Science Toolkit Students complete an entire bite-sized data science project from start to finish. They start using Git for version control and the IPython environment with the pandas and matplotlib packages to perform exploratory statistical analyses and visualizations. Review probability and statistics, including distributions, bootstrapping, hypothesis testing, maximum likelihood estimation, and Bayes theorem (This review spans the first three weeks.) Use UNIX, Git, and IPython to organize data science project resources Load and manipulate data with the pandas Python package Visualize results using the matplotlib Python package Communicate data science results CODENAME In the For first the Students week, first The students last pass form Students guided small at work machine are groups project free small to that learning, focuses groups each anything on work using students unsupervised as covered MTA an internal dive turnstile deep class learning data or into science and learn estimate prediction NLP team something algorithms, the at a with fictional new NoSQL to answer volume regression of company people databases, models. the on in the questions They and street, insurance experience API they so data that industry want collection. (theoretical) to beauty (details address. of Students flat nonprofits are Some files, left work students and to and the learn individually companies students know to scrape what to and information street Supervised from constraints teams websites project efficiently. learning using for at the algorithms tools The design admissions students like of Python and this stage. are relational project. Requests, provided Others databases Beautiful with embark the have data on Soup, entirely been and and covered guided Selenium. new turf. in class. Every student can determine). will have be very their few passion deploy through After exploratory Students scraping works together data on intensely their analysis some own movie and classification challenges plotting box office so models him data, they or herself that students can fit focus within to find create on the and new something overall scrape tools, goals more cool, of the interesting, brainstorming, resources company on and their useful, and communication. own the or team. and worthwhile. present During McNulty, their movie students industry perform regression a deep dive into the to the visualization package D3 and create their own APIs on the Python Flask micro framework to class.

WEEK 2 UNIT TWO: PART 1: Design Process and Web Scraping In preparation for Project 2, students start to learn one of the most important tools a data scientist uses: the iterative design process. They learn tools for web scraping and start fitting simple models to data. Also, they are introduced to cloud computing and work on remote servers. Use the design process to iteratively explore the possible ways that a problem can be solved Create and work in a virtual environment on a cloud computing service Use Python s Requests and Selenium packages to obtain data from web pages Use Python s Beautiful Soup package to parse the content of a web page to find useful data for subsequent analysis Use the design process to iterate the concept for the Unit 2 projects Complete a primer on web fundamentals including HTML, CSS, and JavaScript WEEK 3 UNIT TWO: PART 2 Regression and Communicating Results Students go in-depth on regression using scikit-learn and matplotlib. Choosing among the analysis methods and approaches to reporting their results, students finish the second project and present their findings. Apply regression modeling with Python packages scikit-learn and statsmodels Load, clean, and explore data using Python packages pandas, numpy, and matplotlib/pyplot Experience how the design process influences analysis and results Complete second project and communicate results to each other CODENAME volume regression of company people databases, models. the on in questions the They and street, insurance API experience they so data that want industry collection. (theoretical) to address. (details beauty Students Some nonprofits are of left flat students work to files, and the individually know and companies students learn what and to will can determine). scrape be have their very final few project deploy information street Supervised constraints teams from at efficiently. learning the web for admissions sites the algorithms The design using students stage. of tools and this Others are relational project. like provided Python embark databases Requests, with on the entirely have data Beautiful been new and covered guided turf. Soup, Every and in class. student works through Selenium. exploratory Students After intensely work scraping data on their analysis and together challenges own and classification some plotting him movie or herself so models they box to that office can create fit focus within data, something on students the new overall cool, tools, find goals interesting, and of the useful, or brainstorming, scrape company more and resources worthwhile. and communication. the team. on their During own McNulty, and present students their perform movie a deep industry dive regression

WEEK 4 UNIT THREE: PART 1: Databases and Introduction to Machine Learning Concepts Students cover relational databases such as SQL and more ways of obtaining, cleaning and maintaining data. They are introduced to the concepts of machine learning and exposed to classification and supervised learning with a few examples such as logistic regression and KNN. They also discuss different types of feasibility related to data science questions and projects. Use SQL databases to store and organize data Explore supervised learning techniques including decision trees and random forests Access stored data with MySQL querying language Complete a deep applied survey of classification (supervised learning) techniques, such as logistic regression, k-nearest neighbors, etc. Design and evaluate the computational feasibility of a third data project WEEK 5 UNIT THREE: PART 2 Machine Learning, Supervised Learning Techniques, Naive Bayes Algorithm Students dig into more details and more algorithms for supervised learning including SVM, decision trees and random forests; techniques for feature selection and feature extraction; and concepts and applications for deep learning. Students choose to apply one or more of these algorithms as part of this Unit s project. WEEK 6 UNIT THREE: PART 3 JavaScript and D3 Connect regression modeling to the broader family of machine learning techniques Use supervised learning on Project 3; work in groups simulating in-house data science teams Refine models with feature selection and feature extraction Evaluate the efficacy and computational feasibility of various ML algorithms in different contexts Students visualize projects using D3, a favorite tool for flexible and attractive presentations of data and relationships. Since D3 is a JavaScript library, students learn JavaScript essentials and the incorporation of other js libraries (jquery, Bootstrap, etc.) that make the job much easier. Learn the fundamentals of JavaScript Explore basic principles of good visual design and communication Use D3 to create interactive visualizations that are functional in any browser Create novel data visualizations with D3 to illustrate Unit 3 project results in blog post format volume regression of company people databases, models. the on in the questions They and street, insurance API experience they so data that industry want collection. (theoretical) to (details address. beauty Students nonprofits are Some of left flat work students to files, and the individually and companies students know learn what to and can determine). scrape will have be very their few passion deploy information street Supervised constraints teams from project efficiently. learning web for at sites the algorithms The design admissions using students of tools and this stage. are relational project. like provided Python Others databases Requests, with embark the have data on Beautiful entirely been and covered guided Soup, new turf. and in class. Every student through Selenium. exploratory Students After works scraping data on intensely their analysis together own and classification challenges some plotting movie so models him they box or herself that office can fit focus within data, to create on students the new something overall tools, find goals and cool, of the interesting, brainstorming, scrape company more and resources useful, and communication. the or team. worthwhile. on their During own McNulty, and present students their perform movie a deep industry dive regression

WEEK 7 UNIT FOUR: PART 1: APIs, Data Collection Methods, NoSQL Storage, WebApps with Flask The project for the fourth unit involves text data. Students round out data acquisition methods with APIs and online database servers. Students also learn about NoSQL databases and start using MongoDB. WEEK 8 UNIT FOUR: PART 2 Natural Language Processing (NLP) Use Python to download data from an API Use NoSQL databases; parse and store unstructured data in MongoDB Review database selection: non-relational (NoSQL) databases vs. relational (SQL) databases vs. no database (flat files) Merge disparate data sets to practice data munging Design and propose initial data collection for Unit 4 project Students analyze the text data collected in the previous week and learn about NLP algorithms. More unsupervised learning algorithms are explored. Students dive deeper into unsupervised learning and more algorithms, covering K-means, hierarchical clustering, mixture models and topic models. They also learn about how large amounts of data are handled, discussing parallel computing and Hadoop MapReduce. Project 4 presentations are presented as lightning talks. Use Python s Natural Language ToolKit and TextBlob library to perform natural language analyses on text data Apply deep learning/neural networks, DBSCAN, dimensionality reduction (with principle components analysis). Algorithms including KD-trees and locality sensitive hashing are learned. Survey K-means, hierarchical clustering, and other unsupervised learning algorithms; applications on real data Reflect on the strengths and weaknesses of each algorithm and its appropriate use Outline the data science stack and design choices in data engineering fault tolerant systems Set up Hadoop environment on cloud servers Use Hadoop via Python bindings to write customized map-reduce jobs from scratch and run in Hadoop cloud environment Discuss Hadoop: history & ecosystem, when & why, hype & reality Complete Project 4 and present findings to class in lightning talk format volume regression of company people databases, models. the on in the questions They and street, insurance API experience they so data that industry want collection. (theoretical) to (details address. beauty Students nonprofits are Some of left flat work students to files, and the individually and companies students know learn what to and can determine). scrape will have be very their few passion deploy information street Supervised constraints teams from project efficiently. learning web for at sites the algorithms The design admissions using students of tools and this stage. are relational project. like provided Python Others databases Requests, with embark the have data on Beautiful entirely been and covered guided Soup, new turf. and in class. Every student through Selenium. exploratory Students After works scraping data on intensely their analysis together own and classification challenges some plotting movie so models him they box or herself that office can fit focus within data, to create on students the new something overall tools, find goals and cool, of the interesting, brainstorming, scrape company more and resources useful, and communication. the or team. worthwhile. on their During own McNulty, and present students their perform movie a deep industry dive regression

WEEKS 9-12 UNIT FIVE Final Project Students work full time on their Final Projects, which they have been slowly designing through the first eight weeks. They also learn more about cloud computing, system architectures and feasibility evaluations. Use the design process to isolate an appropriate problem to solve Evaluate the computational feasibility of the problem Choose data sources that can be used to address the problem Design and implement an appropriate computational architecture Design and implement an appropriate set of analysis steps Design and develop a data visualization to clearly convey the results of the analysis to a layperson Assemble final portfolio and present project at Career Day MORE ABOUT PROJECTS Data science projects can be divided into useful dimensions. A dimension can be thought of as a facet along which a decision must be made to specify a project implementation. The bootcamp considers the dimensions of domain, design, data, algorithms, tools, and communication. Each Unit covers certain content from several domains, which are reinforced in that Unit's project. The rigor with which we attack the topics covered in the bootcamp allow us to sleep soundly at night. We feel confident in saying that our graduates haven't simply learned about the tools that data scientists use. By the time they leave our classroom, our graduates are data scientists. They are ready to approach the problem space in their new careers and assemble the suite of tools and methods to answer insightful questions and communicate comprehensible results. They are competent, capable, and confident. And they are ready to work. volume regression of company people databases, models. the on in questions the They and street, insurance API experience they so data that want industry collection. (theoretical) to address. (details beauty Students Some nonprofits are of left flat students work to files, and the individually know and companies students learn what to and will can determine). scrape be have their very final few project deploy information street Supervised constraints teams from at efficiently. learning the web for admissions sites the algorithms The design using students stage. of tools and this Others are relational project. like provided Python embark databases Requests, with on the entirely have data Beautiful been new and covered guided turf. Soup, Every and in class. student works through Selenium. exploratory Students After intensely work scraping data on their analysis and together challenges own and classification some plotting him movie or herself so models they box to that office can create fit focus within data, something on students the new overall cool, tools, find goals interesting, and of the useful, or brainstorming, scrape company more and resources worthwhile. and communication. the team. on their During own McNulty, and present students their perform movie a deep industry dive regression