BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business Instructor: Kunpeng Zhang (kzhang@rmsmith.umd.edu) Lecture-Discussions: Monday/Wednesday, 12:30--1:45 PM Room: VMH 1520 Office Hour: Monday, 9:30--11:30 AM Room: VMH 4316 Textbook: Mining of Massive Datasets Hardcopy: Amazon.com E-version: Free available here About the course As the web technology and mobile use rapidly evolves, people are becoming more and more enthusiastic about interacting, sharing, and communicating with each other through different social platforms, communities, and media. In recent years, this collective intelligence has spread to many different domains, with particular focus on ecommerce, healthcare, and social network, causing the volume of user-generated data to expand exponentially. The distillation of knowledge from such a large amount of unstructured dynamically changed is an extremely difficult task without the help of distributed techniques. Those typical data includes millions of online customer reviews, social comments from Facebook, Twitter and other popular social platforms, shopping transaction records, mobile messages, financial news, climate data, and others. BUDT 758 (Big Data Analytics) is a graduate-level class, which introduces most state-of-the-art big data analytical concepts, techniques, and data management. Most of current intelligent marketing decisions are made based on analyzing user-generated data, such as sentiments of comments and customer reviews, purchase transaction records, and user friendship networks, etc. As the business data becomes 3Vs(volume, variety, and velocity), using distributed techniques to help us analyze and manage data has been widely and successfully deployed in many areas. In addition, having business big data analytical knowledge can make us more competitive in our future career. This course has some prerequisites: data mining and information retrieval techniques (optional); basic computer programming skills (Java or Python is preferred); basic college-level math knowledge (probability/statistics/matrices). Since the big data is a newly emerging topic and has been evolving quickly, we do not have a specific and fixed curriculum. The main format of this course will be teaching, class discussion, hands-on case study, and projects. In this course, we will cover the basic concepts of big data framework introduced by Apache: Hadoop and MapReduce. More importantly, we will cover how to solve big data problems using right distributed algorithms. The ultimate goal of this course is to master the basic big data analytical techniques and tools for solving business problems through hands-on experiences and projects. What this course offers: Installation and configuration of Hadoop under a multi-node environment. Basic concepts and ideas about Big Data.
Introduction the framework of MapReduce. Distributed algorithms: Recommender Systems, Clustering, Classification, Topic Models, and Network Analysis. Distributed data management and NoSQL techniques: Apache Hive and Apache Pig. Hands-on experiences of big data analysis to solve business problems. What this course does NOT offer: This is NOT a machine learning or data mining course. We will touch very few details of some machine learning algorithms. If you want to learn the principles of learning algorithms, I would recommend you to take statistical machine learning class and optimization in machine-learning class, which is usually offered from computer science department. This is NOT a programming course. We assume you have basic programming skills and you are familiar with how to interact with Linux/Unix systems (such as how to create folders, delete files, execute files under command environment, etc.). Lab sessions This course has a lab component. The labs give you a chance to get hands-on experience with the computer and with programming. The instructor, TA, or your fellow classmates can help you get through the bugs. Most labs will involve the usage of some popular distributed data analytical algorithms (machine learning). In total, we will have about 7 labs as shown below. For most labs, you need to submit a lab report before the next lab (The change of due date is subject to the difficulty of the lab). How to configure and install a Hadoop environment under a multi-node cluster; How to set up and use Amazon EC cloud; How to write and run a basic MapReduce program using Java or Python; K-means algorithm for clustering under Mahout; Recommender system algorithm under Mahout; Topic modeling algorithm under Mahout; Social network analysis. Assignments We have 2 homework assignments. These assignments are mainly from the lectures. They could be basic MapReduce, Frequent Itemset Mining, Decision Tree, K-Means, Recommendation System Algorithm, Topic Models, Locality Sensitive Hashing, some network analysis, or data management (NoSQL). These assignments will help you understand concepts and ideas you've learned from the class. Plagiarism Policy: Inevitably in a programming course, it seems that a few people will turn in work that is not their own. You should understand that it is usually easy to detect copying of programs -- even when a program is modified to try to disguise its source. Copying a program, or letting someone else copy your program, is a form of academic dishonesty and the penalties can be found here.
Class project There has a class project for each group. The size of each group is 3 at maximum. Two types of formats are acceptable: a consulting case study or a runnable system (frontend + backend). For the case study, each group will be assigned a case (mostly, they are real data and problems in industry). For the system, you can use some existing online datasets or download your own datasets from online resources, like Facebook, Twitter, Yelp, Amazon.com, Yahoo financial news, etc. Then run existing big data analysis algorithms to show some interesting results. Grading Your final grade for the course will be composed from the following items: Attendance: 5%*1=5% Class project: 35%*1=35% Lab report: 10%*4=40% Assignments: 10%*2 =20% Letter grades are assigned as follows: Points Letter Grade Percentage A+ 100 97 A 96.9 93 A- 92.9 90 B+ 89.9 87 B 86.9 83 B- 82.9 80 C+ 79.9 77 C 76.9 73 C- 72.9 70 D+ 69.9 67 D 66.9 63 D- 62.9 60 F Below 60 Attendance, etc. I assume that you understand the importance of attending class. While I do not check attendance in every lecture, I expect you to be present unless circumstances make that impossible. If you miss your project presentation without an extremely good excuse, you will receive a grade of ZERO for that. If you think you have an excuse for missing your presentation, please discuss it with me, in advance if possible. If I judge that your excuse is reasonable, I will -- depending on the circumstances -- either give you a make-up presentation, or I will average your other grades so that the missing grade does not count against you. Although it should not need to be said, I expect you to maintain a reasonable level of decorum in class. This means that there is usually no eating or drinking in class. Cell phones are suggested to be turned off. You'd better not walk in late or walk in and out of the room during lecture. Disability Services The Office of Disability Services works to ensure the accessibility of UMD programs, classes, and services to students with disabilities. Services are available for students who have documented disabilities, including vision or hearing impairments and emotional or physical disabilities. Students
with disability/access needs or questions may contact the Office of Disability Services at (301) 314-7682. Office Hours, E-mail, WWW I am on campus most days, and you are welcome to come in anytime you can find me there. My office hour would be Monday afternoon 4:00--6:00PM, but note that your office visits are certainly not restricted to my regular office hours (appointments by email preferred for non-regular office hour time). My e-mail address is kpzhang@umd.edu. E-mail is a good way to communicate with me, since I usually answer messages within a day of receiving them. The home page for this course will be up soon. This page contains a weekly guide to the course and links to corresponding readings. We also use ELMS to post announcements, lectures, and assignments during the semester. Tentative Schedule Here is a tentative schedule of lectures, readings, and labs for this course. We will try to keep approximately to this schedule. We will not cover every topic in every section -- but I recommend you to read the first seven chapters of the book in their entirety, if you are really interested in learning Java. (Note that we may change the schedule during the semester. Chapters are in the book: Mining of Massive Datasets.) Dates Topics Readings 08/24 & 08/26 Introduction to Big Data Chapter 1. Data Mining 08/31 & 09/02 & 09/09 Configuration and Installation of Hadoop Hadoop Cluster Setup Running Hadoop on Linux (Single-Node- Cluster) Running Hadoop on Linux (Multi-Node- Cluster) Examples 09/14 & 09/16 Basic Hadoop Programming: MapReduce MapReduce Tutorial MapReduce: Simplified Data Processing on Large Clusters Chapter 2: Large-Scale File Systems and Map-Reduce 09/21 & 09/23 Frequent Itemsets and Association Rules Chapter 6: Frequent itemsets 09/28 & 09/30 K-means and Hierarchical K-means
Clustering Chapter 7: Clustering 10/05 & 10/07 Collaborative Filtering Chapter 9: Recommendation systems Item-based Collaborative Filtering 10/12 & 10/14 Vector Similarity Locality Sensitive Hashing (LSH) Chapter 3: Finding Similar Items Cosine Similarity 10/19 & 10/21 Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation Finding Scientific Topics Studying the History of Ideas Using Topic Models 10/26 & 10/28 Sentiment Identification Opinion Mining and Sentiment Analysis 11/02 & 11/04 Network Analysis Chapter 10: Analysis of Social Networks Community Detection in graphs 11/09 & 11/11 Amazon EMR and Spark Amazon Elastic MapReduce Spark 11/16 & 11/18 Distributed Data Management Apache Hbase 11/23 & 11/30 Distributed Data Management Apache Pig 12/02 & 12/07 & 12/09 Project presentation TBD