Data Management Course Syllabus



Similar documents
USC Viterbi School of Engineering

City University of Hong Kong Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015

Cloud Data Management: A Short Overview and Comparison of Current Approaches

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece

Scalable Multiple NameNodes Hadoop Cloud Storage System

Cleveland State University

Review of Query Processing Techniques of Cloud Databases Ruchi Nanda Assistant Professor, IIS University Jaipur.

A Demonstration of Rubato DB: A Highly Scalable NewSQL Database System for OLTP and Big Data Applications

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM

Report for the seminar Algorithms for Database Systems F1: A Distributed SQL Database That Scales

AN EFFECTIVE PROPOSAL FOR SHARING OF DATA SERVICES FOR NETWORK APPLICATIONS

Evaluation of NoSQL and Array Databases for Scientific Applications

SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE SYSTEM IN CLOUD

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

What is Analytic Infrastructure and Why Should You Care?

Hosting Transaction Based Applications on Cloud

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Scalable Queries For Large Datasets Using Cloud Computing: A Case Study

Cleveland State University

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

BIG DATA WEB ORGINATED TECHNOLOGY MEETS TELEVISION BHAVAN GHANDI, ADVANCED RESEARCH ENGINEER SANJEEV MISHRA, DISTINGUISHED ADVANCED RESEARCH ENGINEER

A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA. Technology, Coimbatore. Engineering and Technology, Coimbatore.

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

DIB: DATA INTEGRATION IN BIGDATA FOR EFFICIENT QUERY PROCESSING

CSE-E5430 Scalable Cloud Computing Lecture 2

EXPERIMENTAL EVALUATION OF NOSQL DATABASES

CSCI 550: Advanced Data Stores

INTRO TO BIG DATA. Djoerd Hiemstra. Big Data in Clinical Medicinel, 30 June 2014

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Deep Explore in Big Data Analytics for Business Intelligence

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

American International Journal of Research in Science, Technology, Engineering & Mathematics

Which NoSQL Database? A Performance Overview

reviewed paper Data-based Collaboration on a Grand Scale Markus Mayr, Paolo Fogliaroni

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey

ANALYSIS OF SMART METER DATA USING HADOOP

DISTRIBUTION OF DATA SERVICES FOR CORPORATE APPLICATIONS IN CLOUD SYSTEM

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Comparison of Different Implementation of Inverted Indexes in Hadoop

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems

III. SYSTEM ARCHITECTURE

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

Cloud computing doesn t yet have a

Big Automotive Data. Leveraging large volumes of data for knowledge-driven product development

THE CLOUD AND ITS EFFECTS ON WEB DEVELOPMENT

A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2

Approaches for parallel data loading and data querying

REVIEW: Big Data on Cloud Computing

How To Analyze Big Data In Healthcare

Data Management Challenges in Cloud Computing Infrastructures

How To Build Cloud Storage On Google.Com

Slave. Master. Research Scholar, Bharathiar University

Distributed Lucene : A distributed free text index for Hadoop

Loose Coupling between Cloud Computing Applications and Databases: A Challenge to be Hit

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Introduction to Hadoop

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

NetFlow Analysis with MapReduce

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

JackHare: a framework for SQL to NoSQL translation using MapReduce

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Accelerating and Simplifying Apache

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Data Management in Cloud based Environment using k- Median Clustering Technique

A REVIEW: Distributed File System

Achieving 100,000,000 database inserts per second using Accumulo and D4M

A Study on Big Data Integration with Data Warehouse

A TAXONOMY AND COMPARISON OF HADOOP DISTRIBUTED FILE SYSTEM WITH CASSANDRA FILE SYSTEM

Introduction to Hadoop

NoSQL Databases: a step to database scalability in Web environment

Transcription:

Data Management Course Syllabus Data Management: This course is designed to give students a broad understanding of modern storage systems, data management techniques, and how these systems are used to store, access and analyze Big Data. Topics include data modeling; storage system design of disk arrays, network attached storage, clusters and data centers: relational databases and the use of madlib techniques for data analytics; no-sql databases and their advantages; cloud data storage and the use of clouds for big data; data warehouses and data mining; and the mapreduce paradigm for data analytics and the hadoop file system. Homework assignments will give students practical experience with important topics covered in the course, including the use of cloud storage, relational databases, NoSQL databases, and hadoop/map Reduce. Week Topic Readings Homework Exams 1 Introduction, data modeling Gray, J. "Evolution of data management". Computer, 29(10):38-46, 1996. J. Gray, D. T. Liu, M. Nieto-Santisteban, A. Szalay, D. J. DeWitt, and G. Heber, "Scientific data management in the coming decade," ACM SIGMOD Record, vol. 34, pp. 34-41, 2005. Jim Gray on escience: a transformed scientific method, The Fourth Paradigm: Data-Intensive Scientific Discovery, Edited by Tony Hey, Stewart Tansley, and Kristin Tolle. 2 Disk arrays, Network Attached Storage, Clusters Ch. 2 of A First Course in Database systems, by Jeff Ullman, and Jennifer Widom http://infolab.stanford.edu/~ullman/fcdb.ht ml D. A. Patterson, G. Gibson, and R. H. Katz, "A case for redundant arrays of inexpensive disks (RAID)," in ACM SIGMOD international conference on Management of data (SIGMOD '88): ACM, 1988, pp. 109-116. Homework 1 Y. Saito, S. Frølund, A. Veitch, A. Merchant, and S. Spence, "FAB: building distributed enterprise disk arrays from commodity components," ACM SIGOPS Operating Systems Review, vol. 38, pp.

48-58, 2004. G. A. Gibson and R. Van Meter, "Network attached storage architecture," Communications of the ACM, vol. 43, pp. 37-45, 2000. Dillow, Z. Zhang, and B. W. Settlemyer, "Workload characterization of a leadership class storage cluster," in Petascale Data Storage Workshop (PDSW), 2010 5th, 2010, pp. 1-5. 3 Data Centers The Datacenter as a Computer: An Introduction to the Design of Warehouse- Scale Machines, Second edition July 2013, 154 pages, Luiz André Barroso, Jimmy Clidaras, Urs Hölzle, Google, Inc. http://www.morganclaypool.com/doi/abs/1 0.2200/S00516ED2V01Y201306CAC024 4 Cloud Storage Systems M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, and I. Stoica, "A view of cloud computing," Communications of the ACM, vol. 53, pp. 50-58, 2010. Homework 1 Homework 2 C. Wang, K. Ren, W. Lou, and J. Li, "Toward publicly auditable secure cloud data storage services," Network, IEEE, vol. 24, pp. 19-24, 2010. R. Grossman and Y. Gu, "Data mining using high performance data clouds: experimental studies using sector and sphere," in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp. 920-927. 5 Commercial Cloud Storage Systems Amazon S3, Cloud Computing Storage for Files, Images, Videos aws.amazon.com/s3/ Amazon SimpleDB - Amazon Web Homework 2 Homework 3

Services aws.amazon.com/simpledb/ 6 File Systems for Massive Storage Amazon Glacier - Amazon Web Services aws.amazon.com/glacier/ B. Welch, M. Unangst, Z. Abbasi, G. A. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou, "Scalable Performance of the Panasas Parallel File System," in FAST, 2008, pp. 1-17. Homework 3 due 7 Midterm 1 and Relational databases and analytics Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS Operating Systems Review. Vol. 37. No. 5. ACM, 2003. M.-S. Chen, J. Han, and P. S. Yu, "Data mining: an overview from a database perspective," Knowledge and data Engineering, IEEE Transactions on, vol. 8, pp. 866-883, 1996. Midterm 1 8 Relational databases and analytics (cont.) Optional background reading: A First Course in Database systems, by Jeff Ullman, and Jennifer Widom Cohen, Jeffrey, et al. "MAD skills: new analysis practices for big data." Proceedings of the VLDB Endowment 2.2 (2009): 1481-1492. Homework 4 9 NoSQL 10 NoSQL Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711. R. Cattell, "Scalable SQL and NoSQL data stores," ACM SIGMOD Record, vol. 39, pp. 12-27, 2011. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A distributed storage system for structured data," ACM Transactions on Computer Systems (TOCS), vol. 26, p. 4, 2008. Lakshman and P. Malik, "Cassandra: a decentralized structured storage system," Homework 4 Homework 5

(cont.) ACM SIGOPS Operating Systems Review, vol. 44, pp. 35-40, 2010. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: amazon's highly available key-value store," in SOSP, 2007, pp. 205-220. 11 Distributed M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J. Sidell, C. Staelin, and A. Yu, "Mariposa: a wide-area distributed database system," The VLDB Journal, vol. 5, pp. 48-63, 1996. Homework 5 Homework 6 12 Map Reduce, Hadoop File System J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, and P. Hochschild, "Spanner: Google s globallydistributed database," in Proceedings of OSDI, 2012. J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107-113, 2008. 13 Map Reduce, Hadoop File System (cont.) 14 Midterm 2, Data Warehouses 15 Data Warehouses (cont.) K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010, pp. 1-10. D. Wegener, M. Mock, D. Adranale, and S. Wrobel, "Toolkit-based highperformance Data Mining of large Data on MapReduce Clusters," in Data Mining Workshops, 2009. ICDMW'09. IEEE International Conference on, 2009, pp. 296-301. Surajit Chaudhuri Umeshwar Dayal. An Overview of Data Warehousing and OLAP Technology.. SIGMOD Record, 26(1), 1997, 65-74. J. C. Prather, D. F. Lobach, L. K. Goodwin, J. W. Hales, M. L. Hage, and W. E. Hammond, "Medical data mining: knowledge discovery in a clinical data Homework 6 due Midterm 2

warehouse," in Proceedings of the AMIA Annual Fall Symposium, 1997, p. 101. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, "Hive-a petabyte scale data warehouse using hadoop," in Data Engineering (ICDE), 2010 IEEE 26th International Conference on, 2010, pp. 996-1005.