CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS COURSE OVERVIEW & STRUCTURE Fall 2015 Marion Neumann
ABOUT Marion Neumann email: m dot neumann at wustl dot edu office: Jolley Hall 403 office hours: THU 11:00am- 1pm Course website: http://sites.wustl.edu/neumann/courses/fall- 2015/cse- 427/ Please use Piazza (piazza.com/wustl/fall2015/cse427/home) for any questions about the course! Sign up here: piazza.com/wustl/fall2015/cse427 8/25/15 2
LECTURES AND HOMEWORKS Tuesday & Thursday 2:30-4:00pm in Cupples II / L009 Homework assignments Assigned on THU(before 5pm) Due following THU (before 2:30pm) Use SVN repository for submissions à find instructions how to use them on the course webpage TA office hours Kunyao Liu: WED 5:00-7:00pm in Jolley 431 Paul Scheid: TUE 9:30-11:30am in Jolley 431 8/25/15 3
IN- CLASS EXAMS 2 in- class exams Count for 25% of total class performance each Dates: Final: 16 Dec 2015 Midterm: 13 Oct 2015 or 15 Oct 2015 8/25/15 4
GRADING POLICY Grading Summary 50% homework assignments 25% midterm 25% final Lecture participation is beneficial Black/white board notes Hands- on/practical examples 8/25/15 5
LATE POLICY, COLLABORATION AND ACADEMIC DISHONESTY Late Policy Your homework assignments must be turned in on time. No late assignments will be accepted except under extraordinary circumstances. I will grant the occasional extension, but you must at least two days before the deadline to make your extension request. There are absolutely no makeup quizzes or assignments for any reason. Collaboration Policy You are encouraged to discuss the course material with other students. Discussing the material, and the general form of solutions to the labs is a key part of the class. Since, for many of the assignments, there is no single right answer, talking to other students and to the TAs is a good thing. However, everything that you turn in should be your own work, unless we tell you otherwise. If you talk about assignments with another student, then you need to explicitly tell us on the hand- in. You are not allowed to copy answers or parts of answers from anyone else, or from material you find on the Internet. This will be considered as willful cheating, and will be dealt with according to the official collaboration policy. Your solutions will be compared to the solutions of other students and solutions available ONLINE! Academic Dishonesty Unless explicitly instructed otherwise, everything that you turn in for this course must be your own work. If you willfully misrepresent someone else s work as your own, you are guilty of cheating. Cheating, in any form, will not be tolerated in this class. There is zero tolerance of Academic Dishonesty. I will be actively searching for academic dishonesty on all homework assignments, quizzes, and exams. If you are guilty of cheating on any assignment or exam, you will receive and F in the course and be referred to the School of Engineering Discipline Committee. In severe cases, this can lead to expulsion from the University, as well as possible deportation for international students. If you copy from anyone in the class both parties will be penalized, regardless of which direction the information flowed. 08/24/2015 This is your only warning. 6
COURSE OBJECTIVE Introduction to big data applied parallel computing MapReduce Hadoop big data technologies/tools large- scale data management and analysis large- scale machine learning large- scale network/graph analysis handling large feature spaces 8/25/15 Contents may be subject to changes! 7
TOPICS TO BE COVERED (SYLLABUS) PART I: Data Storage and Analysis MapReduce General introduction Practical use of Hadoop MapReduce Algorithms using MapReduce Data Analysis Hadoop Pig, Hive, and Impala Data Management HDFS Hadoop tools (Crunch, Sqoop, Flume) 8/25/15 Contents may be subject to changes! 8
TOPICS TO BE COVERED (SYLLABUS) PART II: Algorithms Data Algorithms Introduction to Apache Spark Sorting/secondary sort Recommendation engines Large- scale Machine Learning Clustering in MapReduce and Spark Classification using MapReduce and Spark Introduction to Apache Mahout Large- scale support vector machines* 8/25/15 Contents may be subject to changes! 9
TOPICS TO BE COVERED (SYLLABUS) PART III: Structured and High- dimensional Data Graph Data Link Analysis using PageRank Introduction to Apache GiRaph (GraphLab(*)) Social network analysis(*) Information Retrieval/Finding Similar Items Big feature spaces Document retrieval Locality- sensitive hashing (*) we might not have time to talk about this 8/25/15 Contents may be subject to changes! 10
BACKGROUND & PREREQUISITES Programming Java*, Python**, or Pearl*** (SQL) databases & computer architecture Algorithms sorting hashing CSE 241 Maths matrices, linear algebra probabilities graphs machine learning (classification, clustering, SVMs) (SVD, PCA) * fully supported ** supported *** not supported 8/25/15 11
COURSE MATERIALS The content of this class is derived largely from the Cloudera Developer Training for Apache Hadoop and Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop, which are made available to Washington University through the Cloudera Academic Parntership program. Further materials are adapted from the Mining Massive Data Sets book (http://www.mmds.org/) and class taught at Stanford by Jure Leskovec Books Mining Massive Data Sets by Jure Leskovec, Anand Rajaraman, Jeff Ullman (available online!) Hadoop: The Definite Guide by Tom White Data Algorithms: Recipes for Scaling Up with Hadoop and Spark by Mahmoud Parsian 8/25/15 12
SLIDE LAYOUT Notes! Note: These are usually useful. Questions? Question: What are your expectations of the class? Examples Quick calculations or examples: Small examples, ideas/thoughts, or calculations will appear in blue boxes. 8/25/15 13
SLIDE LAYOUT (2) Advantages, benefits, properties Problems and challenges more data! even more data New Section Additional Reading further readings videos/video lectures I will consider the materials to be course content. 8/25/15 14
SUMMARY All relevant information can be found on the course webpage: http://sites.wustl.edu/neumann/courses/fall- 2015/cse- 427/ Ask all questions on Piazza!? Question: Do you have any questions? 8/25/15 15