Data Management Course Syllabus Data Management: This course is designed to give students a broad understanding of modern storage systems, data management techniques, and how these systems are used to store, access and analyze Big Data. Topics include data modeling; storage system design of disk arrays, network attached storage, clusters and data centers: relational databases and the use of madlib techniques for data analytics; no-sql databases and their advantages; cloud data storage and the use of clouds for big data; data warehouses and data mining; and the mapreduce paradigm for data analytics and the hadoop file system. Homework assignments will give students practical experience with important topics covered in the course, including the use of cloud storage, relational databases, NoSQL databases, and hadoop/map Reduce. Week Topic Readings Homework Exams 1 Introduction, data modeling Gray, J. "Evolution of data management". Computer, 29(10):38-46, 1996. J. Gray, D. T. Liu, M. Nieto-Santisteban, A. Szalay, D. J. DeWitt, and G. Heber, "Scientific data management in the coming decade," ACM SIGMOD Record, vol. 34, pp. 34-41, 2005. Jim Gray on escience: a transformed scientific method, The Fourth Paradigm: Data-Intensive Scientific Discovery, Edited by Tony Hey, Stewart Tansley, and Kristin Tolle. 2 Disk arrays, Network Attached Storage, Clusters Ch. 2 of A First Course in Database systems, by Jeff Ullman, and Jennifer Widom http://infolab.stanford.edu/~ullman/fcdb.ht ml D. A. Patterson, G. Gibson, and R. H. Katz, "A case for redundant arrays of inexpensive disks (RAID)," in ACM SIGMOD international conference on Management of data (SIGMOD '88): ACM, 1988, pp. 109-116. Homework 1 Y. Saito, S. Frølund, A. Veitch, A. Merchant, and S. Spence, "FAB: building distributed enterprise disk arrays from commodity components," ACM SIGOPS Operating Systems Review, vol. 38, pp.
48-58, 2004. G. A. Gibson and R. Van Meter, "Network attached storage architecture," Communications of the ACM, vol. 43, pp. 37-45, 2000. Dillow, Z. Zhang, and B. W. Settlemyer, "Workload characterization of a leadership class storage cluster," in Petascale Data Storage Workshop (PDSW), 2010 5th, 2010, pp. 1-5. 3 Data Centers The Datacenter as a Computer: An Introduction to the Design of Warehouse- Scale Machines, Second edition July 2013, 154 pages, Luiz André Barroso, Jimmy Clidaras, Urs Hölzle, Google, Inc. http://www.morganclaypool.com/doi/abs/1 0.2200/S00516ED2V01Y201306CAC024 4 Cloud Storage Systems M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, and I. Stoica, "A view of cloud computing," Communications of the ACM, vol. 53, pp. 50-58, 2010. Homework 1 Homework 2 C. Wang, K. Ren, W. Lou, and J. Li, "Toward publicly auditable secure cloud data storage services," Network, IEEE, vol. 24, pp. 19-24, 2010. R. Grossman and Y. Gu, "Data mining using high performance data clouds: experimental studies using sector and sphere," in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp. 920-927. 5 Commercial Cloud Storage Systems Amazon S3, Cloud Computing Storage for Files, Images, Videos aws.amazon.com/s3/ Amazon SimpleDB - Amazon Web Homework 2 Homework 3
Services aws.amazon.com/simpledb/ 6 File Systems for Massive Storage Amazon Glacier - Amazon Web Services aws.amazon.com/glacier/ B. Welch, M. Unangst, Z. Abbasi, G. A. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou, "Scalable Performance of the Panasas Parallel File System," in FAST, 2008, pp. 1-17. Homework 3 due 7 Midterm 1 and Relational databases and analytics Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS Operating Systems Review. Vol. 37. No. 5. ACM, 2003. M.-S. Chen, J. Han, and P. S. Yu, "Data mining: an overview from a database perspective," Knowledge and data Engineering, IEEE Transactions on, vol. 8, pp. 866-883, 1996. Midterm 1 8 Relational databases and analytics (cont.) Optional background reading: A First Course in Database systems, by Jeff Ullman, and Jennifer Widom Cohen, Jeffrey, et al. "MAD skills: new analysis practices for big data." Proceedings of the VLDB Endowment 2.2 (2009): 1481-1492. Homework 4 9 NoSQL 10 NoSQL Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711. R. Cattell, "Scalable SQL and NoSQL data stores," ACM SIGMOD Record, vol. 39, pp. 12-27, 2011. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A distributed storage system for structured data," ACM Transactions on Computer Systems (TOCS), vol. 26, p. 4, 2008. Lakshman and P. Malik, "Cassandra: a decentralized structured storage system," Homework 4 Homework 5
(cont.) ACM SIGOPS Operating Systems Review, vol. 44, pp. 35-40, 2010. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: amazon's highly available key-value store," in SOSP, 2007, pp. 205-220. 11 Distributed M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J. Sidell, C. Staelin, and A. Yu, "Mariposa: a wide-area distributed database system," The VLDB Journal, vol. 5, pp. 48-63, 1996. Homework 5 Homework 6 12 Map Reduce, Hadoop File System J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, and P. Hochschild, "Spanner: Google s globallydistributed database," in Proceedings of OSDI, 2012. J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107-113, 2008. 13 Map Reduce, Hadoop File System (cont.) 14 Midterm 2, Data Warehouses 15 Data Warehouses (cont.) K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010, pp. 1-10. D. Wegener, M. Mock, D. Adranale, and S. Wrobel, "Toolkit-based highperformance Data Mining of large Data on MapReduce Clusters," in Data Mining Workshops, 2009. ICDMW'09. IEEE International Conference on, 2009, pp. 296-301. Surajit Chaudhuri Umeshwar Dayal. An Overview of Data Warehousing and OLAP Technology.. SIGMOD Record, 26(1), 1997, 65-74. J. C. Prather, D. F. Lobach, L. K. Goodwin, J. W. Hales, M. L. Hage, and W. E. Hammond, "Medical data mining: knowledge discovery in a clinical data Homework 6 due Midterm 2
warehouse," in Proceedings of the AMIA Annual Fall Symposium, 1997, p. 101. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, "Hive-a petabyte scale data warehouse using hadoop," in Data Engineering (ICDE), 2010 IEEE 26th International Conference on, 2010, pp. 996-1005.