CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview



Similar documents
Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

Big Data Analytics Process & Building Blocks

ANALYTICS CENTER LEARNING PROGRAM

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

DATA SCIENCE ADVISING NOTES David Wild - updated May 2015

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

Introduction to Computer Graphics. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2012

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

Numerical Analysis. Professor Donna Calhoun. Fall 2013 Math 465/565. Office : MG241A Office Hours : Wednesday 10:00-12:00 and 1:00-3:00

BUSA 501: Introduction to Business Analytics

AMIS 7640 Data Mining for Business Intelligence

INFO/CS 4302 Web Information Systems. FT 2012 Week 1: Course Introduction

CSC 314: Operating Systems Spring 2005

Advanced analytics at your hands

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

02-201: Programming for Scientists

Text Analytics (Text Mining)

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

An interdisciplinary model for analytics education

Big Data Analytics Building Blocks; Simple Data Storage (SQLite)

How To Get A Masters Degree In Logistics And Supply Chain Management

Analysis Tools and Libraries for BigData

PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm

Scaling Up HBase, Hive, Pegasus

CS 40 Computing for the Web

CMSC Fundamentals of Computer Programming II (C++)

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Course Content Concepts

Office: LSK 5045 Begin subject: [ISOM3360]...

CMPT 165 INTRODUCTION TO THE INTERNET AND THE WORLD WIDE WEB

Information and Decision Sciences (IDS)

IN THE CITY OF NEW YORK Decision Risk and Operations. Advanced Business Analytics Fall 2015

CS 51 Intro to CS. Art Lee. September 2, 2014

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Web Design Principles

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Geography 167: Cartography (Summer 2014, Session A) Instructor Course Description Learning Objectives: Course Delivery Method: online course

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Learning outcomes. Knowledge and understanding. Competence and skills

QMB Business Analytics CRN Fall 2015 T & R 9.30 to AM -- Lutgert Hall 2209

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

IST565 M001 Yu Spring 2015 Syllabus Data Mining

Lecture: Mon 13:30 14:50 Fri 9:00-10:20 ( LTH, Lift 27-28) Lab: Fri 12:00-12:50 (Rm. 4116)

CS 301 Course Information

College of Health and Human Services. Fall Syllabus

Course Description This course will change the way you think about data and its role in business.

Introduction to data mining

CSE 40437/ Social Sensing and Cyber- Physical Systems - Spring 2015

MSCA Introduction to Statistical Concepts

Fast Analytics on Big Data with H20

CS 1361-D10: Computer Science I

Proposal for Undergraduate Certificate in Large Data Analysis

Programming Languages

Course Syllabus. Purposes of Course:

MBAD/DSBA 6278 (U90): Innovation Analytics (IA)

CS Data Science and Visualization Spring 2016

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

QMB Business Analytics CRN Fall 2015 W 6:30-9:15 PM -- Lutgert Hall 2209

AMIS 7640 Data Mining for Business Intelligence

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

4-letter Designator Prefix Course Number Suffix

CS 253: Intro to Systems Programming

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Machine Learning with MATLAB David Willingham Application Engineer

Introduction to Data Science: CptS Syllabus First Offering: Fall 2015

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Science Certificate Program

Department of Electrical and Electronic Engineering, California State University, Sacramento

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

UNIVERSITY OF LETHBRIDGE FACULTY OF MANAGEMENT. Management Contemporary Database Applications (Using Access)

MSCA Introduction to Statistical Concepts

COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK DEPARTMENT OF INDUSTRIAL ENGINEERING AND OPERATIONS RESEARCH

Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm!

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

CAS CS 565, Data Mining

Big-data Analytics: Challenges and Opportunities

MS1b Statistical Data Mining

CS 378: Computer Game Technology

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

ECE 297 Design and Communication. Course Syllabus, January 2015

1.00 Lecture 1. Course information Course staff (TA, instructor names on syllabus/faq): 2 instructors, 4 TAs, 2 Lab TAs, graders

Course Title: Advanced Topics in Quantitative Methods: Educational Data Science Practicum

Business Analytics Syllabus

University of Washington, Tacoma TCSS 360 (Software Development and Quality Assurance Techniques), Spring 2005 Handout 1: Course Syllabus

How To Learn Data Analytics

BIOM611 Biological Data Analysis

Transcription:

CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 1 Course Overview DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 1

Course Staff Instructor Da Kuang Postdoctoral Researcher, CSE Office: Klaus 1305 (facing the kitchen door) Office hour: Thu 4-5pm, Klaus 1315 Instructor Duen Horng (Polo) Chau Assistant Professor, CSE Office hour: Thu 4-5pm, Klaus 1315 TA Lianxiao (Shawn) Qiu MS CS Student Office hour: Mon 1-2pm, Klaus 2108 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 2

MS Analytics Curriculum Computing Computing for Data Analysis: Methods and Tools Data and Visual Analytics Computational Data Analysis High Performance Computing Statistics/Optimization Introduction to Analytical Methods Regression Analysis Deterministic Optimization Probabilistic Models Data Mining and Statistical Learning Simulation Time Series Analysis Business Introduction to Business for Analytics Risk Analytics Project Management Pricing Analytics and Revenue Management Business Process Analysis and Design Customer Relationship Management Introductory Advanced Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 3

Data Analytics Problems Regression: Predicting a numerical variable Y-axis: # New homes sold in the US (shaded areas indicate US recessions) [Hal Varian, Predicting the present with search engine data, 2013] Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 4

Data Analytics Problems Regression: Predicting a numerical variable Search frequencies on Google used as predictors Target variable: # new homes sold in the US [Hal Varian, Predicting the present with search engine data, 2013] Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 5

Data Analytics Problems Query classification Classification: Predicting a categorical variable (or its probability) News classification Statistical machine translation Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 6

Data Analytics Problems Clustering: Finding patterns without human labeling Both topic modeling and recommender system can be viewed as a clustering problem. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 7

Data Analytics Pipeline Data storage/retrieval Data collection Data analysis Data visualization Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 8

Data Analytics Pipeline Data storage/retrieval Data collection sqlite numpy pandas scikit-learn Data analysis Scrapy Selenium BeautifulSoup Data visualization igraph bokeh Names in red are Python packages. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 8

Data Analytics Pipeline Data storage/retrieval Data collection Data analysis Data visualization Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 9

What you will learn in this course Python programming (and a little bit Java and Matlab) 4 lecture One of Google s 3 main languages Python packages Data collection 2 lectures Data storage and retrieval 1 lecture Data analysis Data visualization 2 lectures Basic linear algebra (math tools, matrices, etc.) 2 lectures Basic numerical computing (how to do math programmatically) 4 lectures Several fundamental machine learning algorithms (focusing on intuitive ideas and software development for them) Linear regression 2 lectures Logistic regression 1 lecture K-means 2 lectures Singular value decomposition 4 lectures (more detailed topics are in the online tentative syllabus) Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 10

Logistics Course website (with tentative schedule; slides and assignments will be posted here): http://www.cc.gatech.edu/~dkuang3/cse6040/ Discussion, Q&A, find teammates on Piazza (please sign up): https://piazza.com/gatech/fall2014/cse6040/home Homework/Project submissions on T-square (only for submission; use Piazza for discussion): https://t-square.gatech.edu/ Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 11

Logistics 3 homework assignments (30%) Mid-term (20%) Project (40%) more details coming soon! Class and Piazza participation (10%) No late homework allowed. Start now to find project teammates 2~3 people per team Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 12

What you will do in this course Attend the lectures ACTIVELY participate in class discussion Based on both in-class and Piazza activities Chat with / Help out your classmates on Piazza (but DO NOT share your answers) 10% of your grade Read tutorials/references for programming languages Read documentation for software packages Solve simple math problems Included in the mid-term: 20% of your grade Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 13

What you will do in this course (cont d) Coding, of course! Homework #1: Collect real data online (10%) Homework #2: Visualize the data you collected (10%) Homework #3: Implement a machine learning algorithm (10%) Play with different machine learning frameworks/packages Project (40%): Work on the Yelp Dataset Data for five cities (US, Canada, UK); four of them just released this month Includes businesses, attributes, check-ins, tips, users, user connections, reviews Work in teams of 2~3 students Get inspired: https://github.com/yelp/dataset-examples (Again, in Python!) (DO NOT copy these examples for your project) Write your own team proposal (Optional) Enter the challenge ($5K prize): Round 4 through Dec 31, 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 14

Course Expectation You will never say I don t have data. You will be exposed to the entire lifecycle of data analytics (in a simplified way). You will be able to code in Python, a common scripting language for data analytics and employed by many companies (e.g., one of Google s 3 main languages), as well as have experience in many useful packages. You will know some most fundamental machine learning algorithms. If you already know them, you will have deeper understanding for them from the computational aspect. Hopefully, you will know how to write fast code for data analytics. Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 15

Why Python? One of Google s 3 main languages Simpler code: Focus on concepts rather than machine details More readable Many useful packages Data manipulation Machine learning Image processing Natural language processing Spatial analysis Web application... Reasonably fast Easier to parallelize than C++ Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 16

Python Setup A text editor + A terminal (command-line window) This is the convention for (Python) developers in companies Text editor suggestions: Windows: Notepad++ (open source, with auto-indent and auto-fill) Linux: Vim, Emacs, Sublime Mac: Sublime, TextWrangler We use Python 2.7, NOT the highest version 3.x Many packages support Python 2.x only Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 17

Go Jackets! Everyone Sign up on Piazza: https://piazza.com/gatech/fall2014/cse6040/home Windows users Install Python on your own machine: https://www.python.org/downloads/ Make sure it s Python 2.7.8, NOT Python 3.x Make sure python can be called on command-line (may need to set up environment variables) Make sure the Python27 directory is located in a root directory, NOT in Program Files Everyone Setup your development environment See https://developers.google.com/edu/python/set-up Everyone Download your own Yelp dataset: (423M tarball) http://www.yelp.com/dataset_challenge We cannot share it by the terms and conditions Tip: Save the page that contains the Download Data button for future use Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS 18