Cleveland State University



Similar documents
Cleveland State University

Cleveland State University

Sunnie Chung. Cleveland State University

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Sunnie Chung. Cleveland State University

Data Warehouses and Business Intelligence ITP 487 (3 Units) Fall Objective

Fast Data in the Era of Big Data: Twitter s Real-

Application Development. A Paradigm Shift

BIG DATA TRENDS AND TECHNOLOGIES

Big Data Technologies Compared June 2014

USC Viterbi School of Engineering

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Native Connectivity to Big Data Sources in MSTR 10

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

PELLISSIPPI STATE COMMUNITY COLLEGE MASTER SYLLABUS ADVANCED DATABASE MANAGEMENT SYSTEMS CSIT 2510

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

The Inside Scoop on Hadoop

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Enhancing Massive Data Analytics with the Hadoop Ecosystem

SAP BusinessObjects Business Intelligence 4.1 One Strategy for Enterprise BI. May 2013

A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data With Hadoop

Bringing Big Data to People

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Can the Elephants Handle the NoSQL Onslaught?

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

SQL Server 2012 Business Intelligence Boot Camp

Data Management Course Syllabus

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

ANALYTICS CENTER LEARNING PROGRAM

Big Data and Data Science: Behind the Buzz Words

ITP 300: Database Web Development. Database Web Development (Monday section) Fall 2012 Course Units

Azure Data Lake Analytics

Big Data Analytics. Lucas Rego Drumond

Big Data and Apache Hadoop s MapReduce

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Advanced Database Management MISM Course F A Fall 2014

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Microsoft Big Data. Solution Brief

Big Data and Industrial Internet

Implementing Data Models and Reports with Microsoft SQL Server 20466C; 5 Days

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

Data Modeling for Big Data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Tap into Hadoop and Other No SQL Sources

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

LEARNING SOLUTIONS website milner.com/learning phone

Big Data: Tools and Technologies in Big Data

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Implementing Data Models and Reports with Microsoft SQL Server

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

Microsoft Implementing Data Models and Reports with Microsoft SQL Server

An Approach to Implement Map Reduce with NoSQL Databases

Course Design Document. IS417: Data Warehousing and Business Analytics

SENG 520, Experience with a high-level programming language. (304) , Jeff.Edgell@comcast.net

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

This Symposium brought to you by

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Apriori-Map/Reduce Algorithm

Data processing goes big

Big Data and Analytics (Fall 2015)

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

So What s the Big Deal?

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Explained. An introduction to Big Data Science.

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

WA2192 Introduction to Big Data and NoSQL EVALUATION ONLY

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data on Microsoft Platform

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hurtownie Danych i Business Intelligence: Big Data

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Applications for Big Data Analytics

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Big Data Scenario mit Power BI vs. SAP HANA Gerhard Brückl

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop & its Usage at Facebook

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Keywords Big Data Analytic Tools, Data Mining, Hadoop and MapReduce, HBase and Hive tools, User-Friendly tools.

Data Warehouse design

Open Source Technologies on Microsoft Azure

Integrating analytics into the Graduate DEGREE curriculum

Transcription:

Cleveland State University CIS 695 Big Data Processing and Data Analytics (3-0-3) 2016 Section 51 Class Nbr. 5493. Tues, Thur TBA Prerequisites: CIS 505 and CIS 530. CIS 612, CIS 660 Preferred. Instructor: Dr. Sunnie S. Chung Office Location: FH211 Phone: 216 687 4732 Email: sschung.cis@gmail.com s.chung@csuohio.edu Webpage: http://grail.csuohio.edu/~sschung Office Time: Tues, Thurs 2:00 3:30 PM and 5:45 6:45 PM (or by appointment) Class Location: TBA Section 51 Tue & Thu TBA Key Concepts: Big Data Processing and Parallel Computing, Google s MapReduce and Aphathe Hadoop, Hadoop/Map Reduce Based Big Data Processing Systems: Google s Big Table, Facebook HBase, Hive, Apathe Pig Latin, Key Value Store Systems: MongoDB, Cassandra, Cloud Computing Systems: Google Cloud, Amazon Elastic Cloud, Parallel Data Warehouse Based Big Data Processing Systems, Data Analytics using OLAP Cube, Text Data Analytics, Data Analytic for Big Data, Building Big Data Processing Infrastructures for Data Analytics, Data Analytics in Twitter, LinkedIn, and Practicum in Data Analytics: Case Study from Progressive, Explorys by IBM and Data Think List of Required Materials: Microsoft SQL Server 2012, Microsoft Visual Studio 2012 or any higher Microsoft SQL Server Business Intelligence Data Analytic Tool 2012 OLAP Server and SQL Server Data Tool They are available at the Microsoft Academic Alliance program: http://e5.onthehub.com/webstore/productsbymajorversionlist.aspx?ws=31b9929b-c09b-e011-969d-0030487d8897 Adventure Works 2012 Data Warehouse Database for SQL Server 2012 Will be directed in class Applied Analytics Using SAS Enterprise Miner Apathe Pig Latin: http://pig.apache.org/docs/r0.11.1/basic.html WEKA: http://www.cs.waikato.ac.nz/~ml/weka/ R and MapR Detailed Instructions on Installation and Setting Up for Big Data Processing Systems will be given in Class. Text Book: 1. Will be given in class 2. List of Selected Industry R&D Papers on Data Analytics and Big Data Processing will be given in class Supplement Text Book: 1. Data Mining with Microsoft SQL Server BI Data Tool (2008 or 2012), Jamie MacLennan, Bogdan Crivat, Publisher: Wiley; 1 Edition ISBN-10: 0470277742 ISBN-13: 978-0470277744 Edition: 1 Official Calendar Please consult the page http://www.csuohio.edu/enrollmentservices/registrar/calendar/index.html Final exam: Thur, Dec 11 4:00-6:00 PM.

Grading: The course grade is based on a student's overall performance through the entire Semester. The final grade is distributed among the following components: 1. Exams (Midterm & Final) 45% (15% Midterm, 30% Final) 2. Computer Labs 25% (about 3-4 Assignments) 3. 1 Project on Big data processing: 2-3 person group project (20%) 4. Research Topic Presentation : 10% -- Tentative A 94% + A: Outstanding (student's performance is genuinely excellent) A- 90% - 93% B+ 87% - 89% B 80% - 86% B: Very Good (student's performance is clearly commendable but not necessarily outstanding) B- 70% - 79% C <70% C: Good (student's performance meets every course requirement and is acceptable; not distinguished) D F < 60% D: Below Average (student's performance fails to meet course objectives and standards) F: Failure (student's performance is unacceptable) Examination Policy: Students are allowed to bring to the tests a summary page (standard letter size) with their own notes. During the exams: (1) the use of books, cell phones, calculators, or any electronic devices is prohibited, and (2) students must not share any materials. Make-Up Exam Policy: No makeup exams will be given unless notified and agreed to in advance. Requests will be considered only in case of exceptional demonstrated need. Homework Policy: The students are expected to attend all classes. The students are responsible for collecting the notes, handouts and any other course material distributed during the class period. All assignments must be individually and independently completed and must represent the effort of the student turning in the assignment. Should two or more students turn in substantially the same solution or output, in the judgment of the instructor, the solution will be considered group effort. All involved in group effort homework will receive a zero grade for that assignment. A student turning in a group effort assignment more than once will automatically receive an F grade for the course. Late Assignment: All lab assignments are due at the beginning of class on the date specified. Laboratory Assignments handed in after the class has begun will be accepted with a 25% grade penalty for up to a week and then not accepted at all. All laboratory assignments must be completed. Failure to do so will lower your course grade one additional letter grade. Student Conduct: Students are expected to do their own work. Academic misconduct, student misconduct, cheating and plagiarism will not be tolerated. Violations will be subject to disciplinary action as specified in the CSU Student Conduct Code. A copy can be obtained on the web page at: http://www.csuohio.edu/studentlife/studentcodeofconduct.pdf or by contacting Valerie Hinton Hannah, Judicial Affairs Officer in the Department of Student Life (MC 106 email v.hintonhannah@csuohio.edu ). For more information consult the following web page CSU Judicial Affairs available at http://www.csuohio.edu/studentlife/jaffairs/faq.html Course Schedule: The tentative schedule of topics and their order of coverage is given below. The schedule and topics to be covered may vary depending upon the students progress made. Week of Topic Reading

1-2 Big Data Processing and Parallel Computing Google s MapReduce and Aphathe Hadoop, Listed Papers Google Map Reduce Tutorial Apache Hadoop File System MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean (Google) and Sanjay Ghemawat (Google) in the proceedings of OSDI 2004 http://grail.csuohio.edu/~sschung/cis695/mapreduce-osdi04.pdf Apathy Hadoop in White Papers by Apache, Yahoo https://hadoop.apache.org/ Lab 1: MapReduce Programming on Hadoop 3 5 Hadoop/Map Reduce Based Big Data Processing Systems: Google s Big Table Data Models for Unstructured Data Processing, Listed Papers Bigtable: A Distributed Storage System for Structured Data, by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Google, Inc. in the Proceedings of OSDI 2006 http://grail.csuohio.edu/~sschung/cis695/googlebigtable-osdi06.pdf Facebook HBase, Hive Data Warehousing and Analytics Infrastructure at Facebook, in SIGMOD 2010 by Ashish Thusoo (Facebook), et al, http://grail.csuohio.edu/~sschung/cis695/facebook_sigmodwarehouse2010_cis695.pdf http://grail.csuohio.edu/~sschung/cis695/hive.pdf Petabyte Scale Databases and Storage Systems Deployed at Facebook, in SIGMOD 2013 by Dhruba Borthakur http://grail.csuohio.edu/~sschung/ist734/facebook_sig2013_paper_ist734.pdf http://grail.csuohio.edu/~sschung/ist734/facebook_sigmod2013_presentation_ist734.pdf https://hbase.apache.org/ Apathe Pig Latin Pig Latin: A Not-So-Foreign Language for Data Processing in SIGMOD 2008 http://pig.apache.org/docs/r0.11.1/basic.html http://grail.csuohio.edu/~sschung/cis612/webdataprocessingondwyahoosig08.pdf Key Value Stores MongoDB, Cassandra https://www.mongodb.com/ http://cassandra.apache.org/ Cloud Google Cloud Amazon Elastic Cloud http://aws.amazon.com/ http://aws.amazon.com/what-is-cloud-computing/ https://cloud.google.com/storage/docs/resources-support#samples http://en.m.wikipedia.org/wiki/google_cloud_platform http://www.wired.com/2014/03/urs-google-story/

Lab2:Processing Streaming Twitter Logging Data to build Data Warehouse or Mongo DB on Hadoop 6 7 Parallel Data Warehouse Based Big Data Processing Systems: Data Warehouse and OLAP - Decision Support Technology - On Line Analytical Processing - Star Schema - OLAP Aggregation Operators: Data Cube, Roll Up, Drill Down - Building BI Data Analytics using DW and OLAP - MDX, DMX Listed papers. An Overview of Data Warehousing and OLAP Technology by Surajit Chaudhuri (Microsoft) and Umeshwar Dayal (HP Labs), in the proceedings of IEEE 1995 Data Cube: A Relational Aggregation Operator Generalizing Group By, Cross Tab, and SubTotals by Jim Gray (Microsoft), et al, in the proceedings of IEEE 1996 MS PDW Optimization Query Optimization in Microsoft SQL Server Parallel Data Warehouse in SIGMOD 2012 by Srinath Shankar, Microsoft, David DeWitt, Microsoft, César Galindo-Legaria, Microsoft, et al. http://grail.csuohio.edu/~sschung/ist734/microsoftsqloptimizationsig2012-shankar.pdf Parallel Data Warehouse with OLAP Query Processing: Microsoft Extended PDW with Map Reduce and Hadoop : Oracle, Teradata Columnar Databases : SAP HANA Databases Extended PDW with Columnar Data Processing :Teradata, Microsoft PDW on Cloud Azure Cloud System by Micrsoft published in IEEE 2011 http://grail.csuohio.edu/~sschung/ist734/azure2_microsoft_ieee2011.pdf http://grail.csuohio.edu/~sschung/ist734/azuresqlmicrosoft_sigmod2010- campbell Big Data Processing System with PDW on Cloud http://grail.csuohio.edu/~sschung/ist734/cloudvista_huiqixu_vldb2012.pdf 8-9 Big Data Processing Techniques for Data Analytic Tasks MapReduce Join Algorithms Pig Latin Job Execution Steps and Operators: Filter By, Group, Cogroup Unstructured Data Processing: Data Transformation from Unstructured/Semi Structure to Object Exchange Model (JASON, XML) or Relation/CSV Introduction to Information Retrieval and Web Logging Data Processing Text Data Analytics, Data Analytic for Big Data Optimization Techniques for Big Data Processing. Lab3: Unstructured Logging Data Procession to JSON format for Object Exchange Model (OEM)

10-11 Data Analytics in Twitter and LinkedIn: Twitter: Fast Data in the Era of Big Data: Twitter s Real-Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin (Twitter, Inc.) http://grail.csuohio.edu/~sschung/ist734/fastdatatwitter.pdf. Listed Papers, LinkedIn: The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah (LinkedIn) http://grail.csuohio.edu/~sschung/ist734/bigdataecosystemlinkedinsigmod2 014.pdf Avatara: OLAP for Webscale Analytics Products Lili Wu, et al. (LinkedIn) http://grail.csuohio.edu/~sschung/cis612/linkedin_liliwu_vldb2012.pdf Lab4: Data Analytics with Steaming Twitter Messages 12 13 Practicum in Data Analytics: Case Study Progressive Explorys by IBM Data Think 14 Data Analytics and Security Text mining Intrusion detection in Network and System Database Security Security in Cloud 15-16 Presentation of Significant Research Papers on Data Analytics and Big Data Processing: List of will be given in class. NOTE: The instructor reserves the right to retain, for pedagogical reasons, either the original or a copy of your work submitted either individually or as a group project for this class. Students' names will be deleted from any retained items.