B561 Advanced Database Concepts. 0 Introduction. Qin Zhang 1-1



Similar documents
B490 Mining the Big Data. 0 Introduction

B669 Sublinear Algorithms for Big Data

CSE 562 Database Systems

CS 525 Advanced Database Organization - Spring 2013 Mon + Wed 3:15-4:30 PM, Room: Wishnick Hall 113

Scope of this Course. Database System Environment. CSC 440 Database Management Systems Section 1

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Winter 2009 Lecture 1 - Class Introduction

Database 2 Lecture I. Alessandro Artale

Textbook and References

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

COMP9321 Web Application Engineering

Big Data Systems CS 5965/6965 FALL 2015

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

NoSQL. Thomas Neumann 1 / 22

Big Data and Analytics: Challenges and Opportunities

CS 4604: Introduc0on to Database Management Systems

CAS CS 565, Data Mining

Instructor: Michael J. May Semester 1 of The course meets 9:00am 11:00am on Sundays. The Targil for the course is 12:00pm 2:00pm on Sundays.

Big Data Systems CS 5965/6965 FALL 2014

CSE 132A. Database Systems Principles

Objectives of Lecture 1. Class and Office Hours. Labs and TAs. CMPUT 391: Introduction. Introduction

CS411 Database Systems

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Fall 2007 Lecture 1 - Class Introduction

Lecture Data Warehouse Systems

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Database Internals (Overview)

Infrastructures for big data

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Database Systems. Lecture 1: Introduction

Objectives of Lecture 1. Labs and TAs. Class and Office Hours. CMPUT 391: Introduction. Introduction

How To Scale Out Of A Nosql Database

Big Data With Hadoop

In Memory Accelerator for MongoDB

Chapter 1. The Worlds of Database. Systems. Databases today are essential to every business. They are used to maintain

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Introduction to Database Systems CS4320. Instructor: Christoph Koch CS

Transactions and ACID in MongoDB

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

1/20/11. Outline. Database Management Systems. Prerequisites. Staff and Contact Information. Course Web Site. 645: Database Design and Implementation

BIG DATA USING HADOOP

SURVEY REPORT DATA SCIENCE SOCIETY 2014

CS 564: DATABASE MANAGEMENT SYSTEMS

CS 40 Computing for the Web

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

COURSE SYLLABUS EDG 6931: Designing Integrated Media Environments 2 Educational Technology Program University of Florida

CSE 233. Database System Overview

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Intro to Map/Reduce a.k.a. Hadoop

Cloud Computing Summary and Preparation for Examination

n Introduction n Art of programming language design n Programming language spectrum n Why study programming languages? n Overview of compilation

Databases. DSIC. Academic Year

Information Processing, Big Data, and the Cloud

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Large-Scale Data Processing

CSE 412/598 Database Management Spring 2012 Semester Syllabus

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Syllabus for Course : Database Systems Engineering at Kinneret College

Challenges for Data Driven Systems

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

Data Modeling and Databases I - Introduction. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Geography 676 Web Spatial Database Development and Programming

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Big Data a threat or a chance?

Transactions and Recovery. Database Systems Lecture 15 Natasha Alechina

Lecture 7: Concurrency control. Rasmus Pagh

Cloud Management: Knowing is Half The Battle

Principles of Database. Management: Summary

Data Warehousing & OLAP

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

Cloud Computing Is In Your Future

One LAR Course Credits: 3. Page 4

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

CS 1340 Sec. A Time: 8:00AM, Location: Nevins Instructor: Dr. R. Paul Mihail, 2119 Nevins Hall, rpmihail@valdosta.

Big Data Analytics: 14 November 2013

Computer Science. General Education Students must complete the requirements shown in the General Education Requirements section of this catalog.

Computer Information Systems

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Open source Google-style large scale data analysis with Hadoop

VALLIAMMAI ENGNIEERING COLLEGE SRM Nagar, Kattankulathur

Database System Concepts

Topics. Introduction to Database Management System. What Is a DBMS? DBMS Types

Open source large scale distributed data management with Google s MapReduce and Bigtable

Introduction to Hadoop

Database Scalability and Oracle 12c

Programming Languages

How To Get A Computer Science Degree At Appalachian State

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Machine Learning Big Data using Map Reduce

Transcription:

B561 Advanced Database Concepts 0 Introduction Qin Zhang 1-1

Self introduction: my research interests Algorithms for Big Data: streaming/sketching algorithms; algorithms on distributed data; I/O-efficient algorithms; data structures; Complexity: communication complexity. I am a theoretician, and occasionally work on databases and data mining 2-1

Self introduction: my research interests Algorithms for Big Data: streaming/sketching algorithms; algorithms on distributed data; I/O-efficient algorithms; data structures; Complexity: communication complexity. I am a theoretician, and occasionally work on databases and data mining You may ask: why do you teach databases (and probably make our lives harder)? 2-2

Self introduction: my research interests 2-3 Algorithms for Big Data: streaming/sketching algorithms; algorithms on distributed data; I/O-efficient algorithms; data structures; Complexity: communication complexity. I am a theoretician, and occasionally work on databases and data mining You may ask: why do you teach databases (and probably make our lives harder)? Hope you will not ask me again after this course :) I am learning together with you.

What does a typical undergrad database course cover? 3-1

How to represent data? How to represent the data in the computer? Title Year Length Type Star Wars 1977 124 color Mighty Ducks 1991 104 color Wayne s World 1992 95 color OR? Movie Sci-Fi Cartoon Comedy Star Wars, 1977,... Mighty Ducks, 1991,... Wayne s World, 1992,... 4-1

How to represent data? How to represent the data in the computer? Title Year Length Type Star Wars 1977 124 color Mighty Ducks 1991 104 color Wayne s World 1992 95 color OR? Movie Sci-Fi Cartoon Comedy OR? Star Wars, 1977,... Mighty Ducks, 1991,... Wayne s World, 1992,... 4-2

How to operate on data? Given the data, say, a set of tables, how to answer queries? 5-1 Difficulty: Queries may depend crucially on the data in all tables. Product PName Price Category Manufacturer Gizmo 19.99 Gadgets GizmoWorks Powergizmo 29.99 Gadgets GizmoWorks SingleTouch 149.99 Photography Canon MultiTouch 203.99 Household Hitachi Company cname StockPrice Country GizmoWorks 25 USA Canon 65 Japan Hitachi 15 Japan Q: Find all products under price 200 manufactured in Japan?

How to operate on data? (cont.) SQL 6-1 Product PName Price Category Manufacturer Gizmo 19.99 Gadgets GizmoWorks Powergizmo 29.99 Gadgets GizmoWorks SingleTouch 149.99 Photography Canon MultiTouch 203.99 Household Hitachi Company SELECT x.pname, x.price FROM Product x, Company y WHERE x.manufacturer=y.cname AND y.country= Japan AND x.price 200 CName StockPrice Country GizmoWorks 25 USA Canon 65 Japan Hitachi 15 Japan Relational Algebra π PName, Price (σ Price 200 Country= Japan (Product Manufacturer=CName Company))

How to speed up the operation? How to speed up the operation? Relational operations can sometimes be computed much faster if we have precomputed a suitable data structure on the data. This is called Indexing. 7-1

How to speed up the operation? How to speed up the operation? Relational operations can sometimes be computed much faster if we have precomputed a suitable data structure on the data. This is called Indexing. Most notably, two kinds of index structures are essential to database performance: 1. B-trees. 2. External hash tables. For example, hash tables may speed up relational operations that involve finding all occurrences in a relation of a particular value. 7-2

How to make a good operation plan? How to optimize the orders of the operations? R(A, B, C, D), S(E, F, G) Find all pairs (x, y), x R, y S such that (1) x.d = y.e, (2) x.a = 5 and (3) y.g = 9 σ A=5 G=9 (R D=E S) = σ A=5 (R) D=E σ G=9 (S) Q: Use the LHS or RHS? 8-1

How to deal with transactions? Transactions with the ideal ACID properties resolve the semantic problems that arise when many concurrent users access and change the same database. Atomicity (= recovery) Consistency Isolation (= concurrency control) Durability 9-1

How to deal with transactions? 9-2 Transactions with the ideal ACID properties resolve the semantic problems that arise when many concurrent users access and change the same database. Atomicity (= recovery) Consistency Isolation (= concurrency control) Durability We will talk about how transactions are implemented using locking and timestamp mechanisms. This knowledge is useful in database programming, e.g., it makes it possible in some cases to avoid (or reduce) rollbacks of transactions, and generally make transactions wait less for each other.

Summarize Database = Logic (express the query) Algorithm (solve the query) System (implementation) 10-1

Summarize Database = Logic (express the query) Algorithm (solve the query) System (implementation) Concept (our focus) Implementation (see B662 Database System and Internal Design) 10-2

Summarize Database = Logic (express the query) Data Representation, Relational Algebra, SQL (Datalog), etc. Algorithm (solve the query) Indexing, Query Optimization, Concurrency Control, etc. Concept (our focus) System (implementation) Implementation (see B662 Database System and Internal Design) 10-3

Summarize Database = Logic (express the query) Data Representation, Relational Algebra, SQL (Datalog), etc. Algorithm (solve the query) Indexing, Query Optimization, Concurrency Control, etc. Concept (our focus) And you need math!! System (implementation) Implementation (see B662 Database System and Internal Design) 10-4

11-1 What s more in this course?

Advanced topics Beyond SQL, Relational Algebra, Data Models, Storage, Views and Indexing, Query Processing, Query Optimization, Transaction Recovery, Concurrency Control I will give you a taste of 1. Data Privacy (Data Suppression, Differential Privacy) 2. External Memory a.k.a. I/O-Efficient Algorithms (Sorting, List Ranking) 3. Streaming Algorithms (Sampling, Heavy Hitters, Distinct Elements) 4. Data Integration / Cleaning (Deduplication) 5. MapReduce 12-1

Other important topics in databases More but probably will not cover 1. Tree-based data models e.g., XML 2. Graph-based data models e.g., RDF 3. Spatial databases 4. Parallel and Distributed databases partly covered in MapReduce 5. Social Networks 6. Uncertainty in databases etc. 13-1

Tentative course plan Part 0 : Introductions Part 1 & 2 : Basics SQL, Relational Algebra Data Models, Storage, Indexing Part 3 : Optimization Part 4 : Trasactions Part 5 : Data Privacy Part 6 : I/O-Efficient Algorithms Part 7 : Streaming Algorithms Part 8 : Data Integration Part 9 : MapReduce 14-1

Tentative course plan Part 0 : Introductions Part 1 & 2 : Basics SQL, Relational Algebra Data Models, Storage, Indexing Part 3 : Optimization Part 4 : Trasactions Part 5 : Data Privacy Part 6 : I/O-Efficient Algorithms Part 7 : Streaming Algorithms Part 8 : Data Integration Part 9 : MapReduce We will also have some student presentations at the end of the course 14-2

Resources Main reference book (we will go beyond this) Database Systems: The Complete Book by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom, 2nd Edition Other reference books (undergrad textbooks... ) Database Management Systems by Ramakrishnan and Guhrke, 3rd Edition 15-1 Database System Concepts by UllSilberschatz, Korth and Sudarshan, 6th Edition

Resources (cont.) Other reference books (cont.): Readings in Database Systems Red book Hellerstein and Stonebraker, eds., 4th Edition (Will be one of our readings) Foundations of Databases: The Logical Level Alice book by Abiteboul, Hull, Vianu Concurrency Control and Recovery in Database Systems a by Bernstein, Hadzilacos, Goodman 16-1 a http://research.microsoft.com/en-us/people/philbe/ ccontrol.aspx

Resources (cont.) Other reference books (cont.): Algorithms and Data Structures for External Memory a by Vitter a http://www.ittc.ku.edu/~jsv/papers/vit.io_book.pdf Data Streams: Algorithms and Applications a by S. Muthukrishnan a http://www.cs.rutgers.edu/ muthu/stream-1-1.ps 17-1

Resources (cont.) Other reference books (cont.): Algorithms and Data Structures for External Memory a by Vitter a http://www.ittc.ku.edu/~jsv/papers/vit.io_book.pdf Data Streams: Algorithms and Applications a by S. Muthukrishnan a http://www.cs.rutgers.edu/ muthu/stream-1-1.ps These are surely not enough, and sometimes dated. Do you want to learn more? Reading original papers! In fact, some of my slides are directly from VLDB toturials 17-2

Instructors Instructor: Qin Zhang Email: qzhangcs@indiana.edu Office hours: Tuesday 2:45pm-3:45pm (Lindley 215E temporary, then Lindley 430A) Associate Instructors: Erfan Sadeqi Azer Le Liu Yifan Pan Ali Varamesh Prasanth Velamala Office hours: Posted on course website 18-1

Grading Assignments 50% : Three written assignments (each 10%). Solutions should be typeset in LaTeX (highly recommended) or Word. And one reading assignment (20%) (next slide for details) Selected/volunteer students will give presentations Exams 50% : Mid-term (20%) and Final (30%). 19-1

Grading Assignments 50% : Three written assignments (each 10%). Solutions should be typeset in LaTeX (highly recommended) or Word. And one reading assignment (20%) (next slide for details) Selected/volunteer students will give presentations Exams 50% : Mid-term (20%) and Final (30%). Use A, B,... for each item (assignments, exams). Final grade will be a weighted average (according to XX%). 19-2

Reading assignment One or a group of two read some (1 to + ) papers/surveys/articles and write a report (4 pages for one, and 8 pages for a group of two) on what you think of the articles you read (not just a repeat of what they have said). Topics can be found in redbook http://redbook.cs.berkeley.edu/bib4.html, and more topics on the course website More reading topics (google the papers / surveys yourself; contact AI if you cannot find it). Selected students/groups (volunteer first) will give 25mins talks (20mins presentation +5mins Q&A) in class. The best 1/3 individuals/groups will get a bonus in their final grades. A penalty will be given if you agree to give a talk but cannot do at the end, while the quality of the talk is irrelevant. 20-1

LaTeX LaTeX: Highly recommended tools for assignments/reports 1. Read wiki articles: http://en.wikipedia.org/wiki/latex 2. Find a good LaTeX editor. 3. Learn how to use it, e.g., read A Not So Short Introduction to LaTeX 2e (Google it) 21-1

Prerequisite Participants are expected to have a background in algorithms and data structures. For example, have taken 1. C241 Discrete Structures for Computer Science 2. C343 Data Structures 3. B403 Introduction to Algorithm Design and Analysis or equivalent courses, and know some basics of databases. 22-1

Frequently asked questions Is this a course good for my job hunting in industry? Yes, if you get to know some advanced concepts in databases, that will certainly help. But, this is a course on theoretical foundations of databases, but not designed for teaching commercially available techniques and not a programming language (SQL? PHP?) course, and not a hands on course (this is not a course for professional training; this is a graduate course in a major research university thus should be much more advanced) 23-1

Frequently asked questions Is this a course good for my job hunting in industry? Yes, if you get to know some advanced concepts in databases, that will certainly help. But, this is a course on theoretical foundations of databases, but not designed for teaching commercially available techniques and not a programming language (SQL? PHP?) course, and not a hands on course (this is not a course for professional training; this is a graduate course in a major research university thus should be much more advanced) I haven t taken B403 Introduction to Algorithm Design and Analysis or equivalent courses. Can I take the course? Or, will this course fit me? 23-2 Generally speaking, this is an advanced course. It will be difficult if you do not have enough background. You can take into consideration the touch-base exam.

The goal of this course Open / change your views of the world (of databases) 24-1

The goal of this course Open / change your views of the world (of databases) Seriously, it is not just SQL programming. Read The relational model is dead, SQL is dead, and I don t feel so good myself 24-2

25-1 Big Data

Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of pictures... 26-1

Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of pictures... Magazine covers Nature 06 Nature 08 CACM 08 Economist 10 26-2

Source and challenge Source Retailer databases: Amazon, Walmart Logistics, financial & health data: Stock prices Social network: Facebook, twitter Pictures by mobile devices: iphone Internet traffic: IP addresses New forms of scientific data: Large Synoptic Survey Telescope 27-1

Source and challenge Source Retailer databases: Amazon, Walmart Logistics, financial & health data: Stock prices Social network: Facebook, twitter Pictures by mobile devices: iphone Internet traffic: IP addresses New forms of scientific data: Large Synoptic Survey Telescope Challenge Volume Velocity Variety (Documents, Stock records, Personal profiles, Photographs, Audio & Video, 3D models, Location data,... ) 27-2

Source and challenge Source Retailer databases: Amazon, Walmart Logistics, financial & health data: Stock prices Social network: Facebook, twitter Pictures by mobile devices: iphone Internet traffic: IP addresses New forms of scientific data: Large Synoptic Survey Telescope Challenge Volume Velocity } The main technical challenges Variety (Documents, Stock records, Personal profiles, Photographs, Audio & Video, 3D models, Location data,... ) 27-3

What does Big Data really mean? We don t define Big Data in terms of TB, PB, EB,... The data is too big to fit in memory. What can we do? 28-1

What does Big Data really mean? We don t define Big Data in terms of TB, PB, EB,... The data is too big to fit in memory. What can we do? Store them in the disk, and read/write a block of data each time 28-2

What does Big Data really mean? We don t define Big Data in terms of TB, PB, EB,... The data is too big to fit in memory. What can we do? Store them in the disk, and read/write a block of data each time Processing one by one as they come, and throw some of them away on the fly. 28-3

What does Big Data really mean? We don t define Big Data in terms of TB, PB, EB,... The data is too big to fit in memory. What can we do? Store them in the disk, and read/write a block of data each time Processing one by one as they come, and throw some of them away on the fly. Store in multiple machines, which collaborate via communication 28-4

What does Big Data really mean? We don t define Big Data in terms of TB, PB, EB,... The data is too big to fit in memory. What can we do? Store them in the disk, and read/write a block of data each time Processing one by one as they come, and throw some of them away on the fly. Store in multiple machines, which collaborate via communication RAM model does not fit RAM A processor and an infinite size memory Probing each cell of the memory has a unit cost CPU 28-5

Big Data: A marketing buzzword?? 29-1

Big Data: A marketing buzzword?? A good reading topic 29-2

Popular models for big data (see another slides) 30-1

Summary for the introduction We have discussed topics that will be covered in this course We have introduced some models for big data computation. We have talked about the course plan and assessment. 31-1

Thank you! Questions? A few introductory slides are based on Rasmus Pagh s slides http://www.itu.dk/people/pagh/adbt06/ 32-1