BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business



Similar documents
CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

Big Data Explained. An introduction to Big Data Science.

IST565 M001 Yu Spring 2015 Syllabus Data Mining

Cleveland State University

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

MIS Big Data Information Systems

SYLLABUS MAC 1105 COLLEGE ALGEBRA Spring 2011 Tuesday & Thursday 12:30 p.m. 1:45 p.m.

How To Learn To Use Big Data

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

COMP9321 Web Application Engineering

ANALYTICS CENTER LEARNING PROGRAM

Big Data and Analytics: Challenges and Opportunities

Cleveland State University

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Sunnie Chung. Cleveland State University

B490 Mining the Big Data. 0 Introduction

Office: LSK 5045 Begin subject: [ISOM3360]...

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

KENNESAW STATE UNIVERSITY GRADUATE COURSE PROPOSAL OR REVISION, Cover Sheet (10/02/2002)

Workshop on Hadoop with Big Data

Data Analyst Program- 0 to 100

Hadoop Development & BI- 0 to 100

MATH 1900, ANALYTIC GEOMETRY AND CALCULUS II SYLLABUS

Big Data Analytics. Lucas Rego Drumond

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

STAT 1403 College Algebra Dr. Myron Rigsby Fall 2013 Section 0V2 crn 457 MWF 9:00 am

BIG DATA TOOLS. Top 10 open source technologies for Big Data

How To Handle Big Data With A Data Scientist

How To Learn Data Analytics

Consulting and Systems Integration (1) Networks & Cloud Integration Engineer

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

COURSE DESCRIPTION. Required Course Materials COURSE REQUIREMENTS

Napa Valley College Fall 2015 Math : College Algebra (Prerequisite: Math 94/Intermediate Alg.)

Big Data and Data Science: Behind the Buzz Words

Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level?

Big Data Presentation of the course

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Data Science Certificate General Information About Completion

Big Data and Analytics (Fall 2015)

Integrating a Big Data Platform into Government:

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

L1: Introduction to Hadoop

Predictive Analytics Certificate Program

Syllabus: IST451. Division of Business and Engineering. Penn State Altoona

Big Data and Scripting Systems build on top of Hadoop

CSE 40437/ Social Sensing and Cyber- Physical Systems - Spring 2015

Turtle Mountain Community College

MAC2233, Business Calculus Reference # , RM 2216 TR 9:50AM 11:05AM

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

AMIS 7640 Data Mining for Business Intelligence

Big Data and Data Science. The globally recognised training program

Programme Specification Postgraduate Programmes

Microsoft Big Data. Solution Brief

RYERSON UNIVERSITY Ted Rogers School of Information Technology Management And G. Raymond Chang School of Continuing Education

Introduction to Big Data Training

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Sunnie Chung. Cleveland State University

Course Description This course will change the way you think about data and its role in business.

Microsoft SQL Server 2012 with Hadoop

Intro. to Data Visualization Spring 2016

Machine Learning. Hands-On for Developers and Technical Professionals

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

Implement Hadoop jobs to extract business value from large and varied data sets

HOUSTON COMMUNITY COLLEGE SOUTHWEST. Local Area Networks Management Cisco 3 - ITCC 1042

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

BUMK758K Advanced Marketing Analytics Fall, 2015 Professor Michel Wedel

College Algebra Online Course Syllabus

MATH 1111 College Algebra Fall Semester 2014 Course Syllabus. Course Details: TR 3:30 4:45 pm Math 1111-I4 CRN 963 IC #322

ISM 4403 Section 001 Advanced Business Intelligence 3 credit hours. Term: Spring 2012 Class Location: FL 411 Time: Monday 4:00 6:50

USC Viterbi School of Engineering

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

How To Pass Eecs 485

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

UNIVERSITY OF MICHIGAN SCHOOL OF INFORMATION SI301: Models of Social Information Processing Syllabus

A Study of Data Management Technology for Handling Big Data

Open source Google-style large scale data analysis with Hadoop

Information and Decision Sciences (IDS)

Lecture: Mon 13:30 14:50 Fri 9:00-10:20 ( LTH, Lift 27-28) Lab: Fri 12:00-12:50 (Rm. 4116)

Distributed Framework for Data Mining As a Service on Private Cloud

IT services for analyses of various data samples

Transcription:

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business Instructor: Kunpeng Zhang (kzhang@rmsmith.umd.edu) Lecture-Discussions: Monday/Wednesday, 12:30--1:45 PM Room: VMH 1520 Office Hour: Monday, 9:30--11:30 AM Room: VMH 4316 Textbook: Mining of Massive Datasets Hardcopy: Amazon.com E-version: Free available here About the course As the web technology and mobile use rapidly evolves, people are becoming more and more enthusiastic about interacting, sharing, and communicating with each other through different social platforms, communities, and media. In recent years, this collective intelligence has spread to many different domains, with particular focus on ecommerce, healthcare, and social network, causing the volume of user-generated data to expand exponentially. The distillation of knowledge from such a large amount of unstructured dynamically changed is an extremely difficult task without the help of distributed techniques. Those typical data includes millions of online customer reviews, social comments from Facebook, Twitter and other popular social platforms, shopping transaction records, mobile messages, financial news, climate data, and others. BUDT 758 (Big Data Analytics) is a graduate-level class, which introduces most state-of-the-art big data analytical concepts, techniques, and data management. Most of current intelligent marketing decisions are made based on analyzing user-generated data, such as sentiments of comments and customer reviews, purchase transaction records, and user friendship networks, etc. As the business data becomes 3Vs(volume, variety, and velocity), using distributed techniques to help us analyze and manage data has been widely and successfully deployed in many areas. In addition, having business big data analytical knowledge can make us more competitive in our future career. This course has some prerequisites: data mining and information retrieval techniques (optional); basic computer programming skills (Java or Python is preferred); basic college-level math knowledge (probability/statistics/matrices). Since the big data is a newly emerging topic and has been evolving quickly, we do not have a specific and fixed curriculum. The main format of this course will be teaching, class discussion, hands-on case study, and projects. In this course, we will cover the basic concepts of big data framework introduced by Apache: Hadoop and MapReduce. More importantly, we will cover how to solve big data problems using right distributed algorithms. The ultimate goal of this course is to master the basic big data analytical techniques and tools for solving business problems through hands-on experiences and projects. What this course offers: Installation and configuration of Hadoop under a multi-node environment. Basic concepts and ideas about Big Data.

Introduction the framework of MapReduce. Distributed algorithms: Recommender Systems, Clustering, Classification, Topic Models, and Network Analysis. Distributed data management and NoSQL techniques: Apache Hive and Apache Pig. Hands-on experiences of big data analysis to solve business problems. What this course does NOT offer: This is NOT a machine learning or data mining course. We will touch very few details of some machine learning algorithms. If you want to learn the principles of learning algorithms, I would recommend you to take statistical machine learning class and optimization in machine-learning class, which is usually offered from computer science department. This is NOT a programming course. We assume you have basic programming skills and you are familiar with how to interact with Linux/Unix systems (such as how to create folders, delete files, execute files under command environment, etc.). Lab sessions This course has a lab component. The labs give you a chance to get hands-on experience with the computer and with programming. The instructor, TA, or your fellow classmates can help you get through the bugs. Most labs will involve the usage of some popular distributed data analytical algorithms (machine learning). In total, we will have about 7 labs as shown below. For most labs, you need to submit a lab report before the next lab (The change of due date is subject to the difficulty of the lab). How to configure and install a Hadoop environment under a multi-node cluster; How to set up and use Amazon EC cloud; How to write and run a basic MapReduce program using Java or Python; K-means algorithm for clustering under Mahout; Recommender system algorithm under Mahout; Topic modeling algorithm under Mahout; Social network analysis. Assignments We have 2 homework assignments. These assignments are mainly from the lectures. They could be basic MapReduce, Frequent Itemset Mining, Decision Tree, K-Means, Recommendation System Algorithm, Topic Models, Locality Sensitive Hashing, some network analysis, or data management (NoSQL). These assignments will help you understand concepts and ideas you've learned from the class. Plagiarism Policy: Inevitably in a programming course, it seems that a few people will turn in work that is not their own. You should understand that it is usually easy to detect copying of programs -- even when a program is modified to try to disguise its source. Copying a program, or letting someone else copy your program, is a form of academic dishonesty and the penalties can be found here.

Class project There has a class project for each group. The size of each group is 3 at maximum. Two types of formats are acceptable: a consulting case study or a runnable system (frontend + backend). For the case study, each group will be assigned a case (mostly, they are real data and problems in industry). For the system, you can use some existing online datasets or download your own datasets from online resources, like Facebook, Twitter, Yelp, Amazon.com, Yahoo financial news, etc. Then run existing big data analysis algorithms to show some interesting results. Grading Your final grade for the course will be composed from the following items: Attendance: 5%*1=5% Class project: 35%*1=35% Lab report: 10%*4=40% Assignments: 10%*2 =20% Letter grades are assigned as follows: Points Letter Grade Percentage A+ 100 97 A 96.9 93 A- 92.9 90 B+ 89.9 87 B 86.9 83 B- 82.9 80 C+ 79.9 77 C 76.9 73 C- 72.9 70 D+ 69.9 67 D 66.9 63 D- 62.9 60 F Below 60 Attendance, etc. I assume that you understand the importance of attending class. While I do not check attendance in every lecture, I expect you to be present unless circumstances make that impossible. If you miss your project presentation without an extremely good excuse, you will receive a grade of ZERO for that. If you think you have an excuse for missing your presentation, please discuss it with me, in advance if possible. If I judge that your excuse is reasonable, I will -- depending on the circumstances -- either give you a make-up presentation, or I will average your other grades so that the missing grade does not count against you. Although it should not need to be said, I expect you to maintain a reasonable level of decorum in class. This means that there is usually no eating or drinking in class. Cell phones are suggested to be turned off. You'd better not walk in late or walk in and out of the room during lecture. Disability Services The Office of Disability Services works to ensure the accessibility of UMD programs, classes, and services to students with disabilities. Services are available for students who have documented disabilities, including vision or hearing impairments and emotional or physical disabilities. Students

with disability/access needs or questions may contact the Office of Disability Services at (301) 314-7682. Office Hours, E-mail, WWW I am on campus most days, and you are welcome to come in anytime you can find me there. My office hour would be Monday afternoon 4:00--6:00PM, but note that your office visits are certainly not restricted to my regular office hours (appointments by email preferred for non-regular office hour time). My e-mail address is kpzhang@umd.edu. E-mail is a good way to communicate with me, since I usually answer messages within a day of receiving them. The home page for this course will be up soon. This page contains a weekly guide to the course and links to corresponding readings. We also use ELMS to post announcements, lectures, and assignments during the semester. Tentative Schedule Here is a tentative schedule of lectures, readings, and labs for this course. We will try to keep approximately to this schedule. We will not cover every topic in every section -- but I recommend you to read the first seven chapters of the book in their entirety, if you are really interested in learning Java. (Note that we may change the schedule during the semester. Chapters are in the book: Mining of Massive Datasets.) Dates Topics Readings 08/24 & 08/26 Introduction to Big Data Chapter 1. Data Mining 08/31 & 09/02 & 09/09 Configuration and Installation of Hadoop Hadoop Cluster Setup Running Hadoop on Linux (Single-Node- Cluster) Running Hadoop on Linux (Multi-Node- Cluster) Examples 09/14 & 09/16 Basic Hadoop Programming: MapReduce MapReduce Tutorial MapReduce: Simplified Data Processing on Large Clusters Chapter 2: Large-Scale File Systems and Map-Reduce 09/21 & 09/23 Frequent Itemsets and Association Rules Chapter 6: Frequent itemsets 09/28 & 09/30 K-means and Hierarchical K-means

Clustering Chapter 7: Clustering 10/05 & 10/07 Collaborative Filtering Chapter 9: Recommendation systems Item-based Collaborative Filtering 10/12 & 10/14 Vector Similarity Locality Sensitive Hashing (LSH) Chapter 3: Finding Similar Items Cosine Similarity 10/19 & 10/21 Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation Finding Scientific Topics Studying the History of Ideas Using Topic Models 10/26 & 10/28 Sentiment Identification Opinion Mining and Sentiment Analysis 11/02 & 11/04 Network Analysis Chapter 10: Analysis of Social Networks Community Detection in graphs 11/09 & 11/11 Amazon EMR and Spark Amazon Elastic MapReduce Spark 11/16 & 11/18 Distributed Data Management Apache Hbase 11/23 & 11/30 Distributed Data Management Apache Pig 12/02 & 12/07 & 12/09 Project presentation TBD