Introduction to Databases and Data Mining



Similar documents
Database Fundamentals

GenBank: A Database of Genetic Sequence Data

David M. Kroenke and David J. Auer Database Processing 11 th Edition Fundamentals, Design, and Implementation. Chapter Objectives

David M. Kroenke and David J. Auer Database Processing 12 th Edition

Ursuline College Accelerated Program

Scope of this Course. Database System Environment. CSC 440 Database Management Systems Section 1

AIMS Project (UNDP) Training Report. Introduction to Database (MS Access) Kabul January United Nations Development Program (UNDP)

Introduction to Data Mining

CS 1361-D10: Computer Science I

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

BA 561: Database Design and Applications Acct 565: Advanced Accounting Information Systems Syllabus Spring 2015

Databases. DSIC. Academic Year

Geography 676 Web Spatial Database Development and Programming

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Spring 2006 Lecture 1 - Class Introduction

MIS Big Data Information Systems

COMM 437 DATABASE DESIGN AND ADMINISTRATION

TIM 50 - Business Information Systems

KENNESAW STATE UNIVERSITY GRADUATE COURSE PROPOSAL OR REVISION, Cover Sheet (10/02/2002)

Illinois Mathematics and Science Academy

IST359 - INTRODUCTION TO DATABASE MANAGEMENT SYSTEMS

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Winter 2009 Lecture 1 - Class Introduction

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Syllabus GIS Database Management (GIS , GIS ) (Fall 2010)

IINF 202 Introduction to Data and Databases (Spring 2012)

Introduction to Database Systems CS4320. Instructor: Christoph Koch CS

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

MIS630 Data and Knowledge Management Course Syllabus

Database Design and Database Programming with SQL - 5 Day In Class Event Day 1 Activity Start Time Length

class 1 welcome to CS265! BIG DATA SYSTEMS prof. Stratos Idreos

Del Mar College, Corpus Christi, TX Department of Mathematics. Course: ID: Instructor:

CS4320 Computer and Network Security. Fall 2015 Syllabus

AMIS 7640 Data Mining for Business Intelligence

CS 51 Intro to CS. Art Lee. September 2, 2014

Syllabus MAC1105 College Algebra

SOCI 1301 Online Introduction to Sociology COURSE SYLLABUS

Syllabus CIS 3630: Management Information Systems Spring 2009

GGR272: GEOGRAPHIC INFORMATION AND MAPPING I. Course Outline

Cleveland State University

SSCI 582 Spatial Databases, Course Syllabus Summer 2013

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.

CS 40 Computing for the Web

Northeastern University Online College of Professional Studies Course Syllabus

CS 564: DATABASE MANAGEMENT SYSTEMS

I. PREREQUISITES For information regarding prerequisites for this course, please refer to the Academic Course Catalog.

GGR272: GEOGRAPHIC INFORMATION AND MAPPING I. Course Outline

CS 450/650 Fundamentals of Integrated Computer Security

TITLE: Elementary Algebra and Geometry OFFICE LOCATION: M-106 COURSE REFERENCE NUMBER: see Website PHONE NUMBER: (619)

CS Data Science and Visualization Spring 2016

Introduction to Database Systems CS4320/CS5320. CS4320/4321: Introduction to Database Systems. CS4320/4321: Introduction to Database Systems

CPT 240 INTERNET PROGRAMMING WITH DATABASE

CSE 132A. Database Systems Principles

Database Optimizing Services

Tableau's data visualization software is provided through the Tableau for Teaching program.

CS 525 Advanced Database Organization - Spring 2013 Mon + Wed 3:15-4:30 PM, Room: Wishnick Hall 113

Course Specification

CSC-570 Introduction to Database Management Systems

IS6030 Data Management Fall Semester 2015

IST687 Scientific Data Management

Comprehensive Course Syllabus

I. PREREQUISITE For information regarding prerequisites for this course, please refer to the Academic Course Catalog.

Johns Hopkins Pathology Informatics

PSYCHOLOGY Section M01 Mixed Mode Spring Semester Fundamentals of Psychology I MW 11:30 - A130. Course Description

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

MET CS-581. Electronic Health Records. Syllabus

BIO 315 Human Genetics - Online

Fluency in Information Technology

IS Management Information Systems

(618) Be sure to read Emergency Procedures at the bottom of this syllabus!!

DATABASES AND DATABASE USERS

I. PREREQUISITE For information regarding prerequisites for this course, please refer to the Academic Course Catalog.

Instructor: Michael J. May Semester 1 of The course meets 9:00am 11:00am on Sundays. The Targil for the course is 12:00pm 2:00pm on Sundays.

CHM General Chemistry I Lecture Fall 2014

Forensic Science Course Syllabus (CHE100)

City University of Hong Kong. Information on a Course offered by Department of Information Systems with effect from Semester B in 2013 / 2014

Genetics. Biology Spring 2014

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Fall 2007 Lecture 1 - Class Introduction

IST359 INTRODUCTION TO DATABASE MANAGEMENT SYSTEMS

Database Systems. Lecture 1: Introduction

Upon successful completion of this course, a student will meet the following outcomes:

CC414 Database Management Systems

Kent State University, College of Business Administration. Department of Accounting, Fall REVISED Aug 22, Instructor:

Intro. to Data Visualization Spring 2016

LSC 740 Database Management Syllabus. Description

Psychology Mind and Society Mondays & Wednesdays, 2:00 3:50 pm, 129 McKenzie Hall Fall 2013 (CRN # 16067)

Course specification STAFFING OTHER REQUISITES RATIONALE SYNOPSIS. The University of Southern Queensland

Course Outcomes Guide. Course/Program Title: PLS 101 Introduction to Paralegal Studies 2014

I. PREREQUISITES For information regarding prerequisites for this course, please refer to the Academic Course Catalog.

Dr. Stanny EXP 3082L Fall 2003 EXPERIMENTAL PSYCHOLOGY LABORATORY. Office Hours For Dr. Stanny: 9:00 AM - 11:30 AM Tuesday, Wednesday, & Thursday

CALIFORNIA STATE UNIVERSITY DOMINGUEZ HILLS

Big Data and Analytics (Fall 2015)

RARITAN VALLEY COMMUNITY COLLEGE ACADEMIC COURSE OUTLINE CISY 233 INTRODUCTION TO PHP

BIO.Q 111/111L CONTEMPORARY BIOLOGY AND LAB Perspectives-Enduring Questions Course Online -- Summer B 2015

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

Course Description. Prerequisites. CS-119/119L, Section 0137/0138 Course Syllabus Program Design & Development - Fall 2015

BI122 Introduction to Human Genetics, Fall 2014

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

MOUNT IDA COLLEGE. School of Business. Management Information Systems

ECS 165A: Introduction to Database Systems

Transcription:

Introduction to Databases and Data Mining Computer Science 105 Boston University David G. Sullivan, Ph.D. Welcome to CS 105! This course examines how collections of data are organized, stored, and processed. Topics include: databases programming data mining data visualization We'll consider applications from a variety of domains. business, the arts, the life sciences, the social sciences

Broad Goals of the Course To give you computational tools for working with data To give you insights into databases and data mining help you to understand their increasingly important role To expose you to the discipline of computer science to how computer scientists think and solve problems Computer science is not so much the science of computers as it is the science of solving problems using computers. - Eric Roberts, Stanford Data, Data Everywhere! financial data commercial data scientific data socioeconomic data etc. AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAG AGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATT TTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGAC AGATAAAAATTACAGAGTACACAACATCCATGAACGCATTAGCACCACCATTA CCACCACCATCACCATTACCACAGGTAAGGTGCGGGCTGACGCGTACAGGAAA CACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGT AACGAGGTAACAACCATGGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAAC GTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTG GCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATAGC TTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAGAGT GTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTA TTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGA TAAAAATTACAGAGTACACAACATCCATGAACGCATTAGCACCACCATTACCA CCACCATCACCATTACCACAGGTAAGGTGCGGGCTGACGCGTACAGGAAACAC AGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAAC GAGGTAACAACCATGGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTT TTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCC ACCGTCCTCTC

Databases A database is a collection of related data. example: the database behind StudentLink other examples? Managed by some type of database management system (DBMS) a piece of software (a program) that allows users to store, retrieve, and update collections of data The Amount of Data Is Exploding! Example: the GenBank database of genetic sequences Amount of Data (billions of DNA bases) 140 120 100 80 60 40 on this graph, the data doubles every 12-14 months as of May 2006, the doubling time was less than a year and getting shorter! 20 0 Aug-97 Aug-98 Aug-99 Aug-00 Aug-01 Aug-02 Aug-03 Aug-04 Aug-05 from: NCBI Field Guide presentation (ftp://ftp.ncbi.nih.gov/pub/fieldguide/slides/current/mtholyoke.05.10.06/)

The Amount of Data Is Exploding! Example: the UN Database (data.un.org) from "An Analysis of Factors Relating to Energy and Environment in Predicting Life Expectancy", CS 105 Final Project by Valerie Belding '12 The Amount of Data Is Exploding! Example: the Google Ngrams Corpus books.google.com/ngrams

The Amount of Data Is Exploding! xkcd.com/1140 Relational Databases Most data collections are managed by a DBMS that employs a way of organizing data known as the relational model. examples: IBM DB2, Oracle, Microsoft SQL Server, Microsoft Access In the relational model, data is organized into tables of records. each record consists of one or more fields example: a table of information about students id name address class dob 12345678 Jill Jones Warren Towers 100 2007 3/10/85 25252525 Alan Turing Student Village A210 2010 2/7/88 33566891 Audrey Chu 300 Main Hall 2008 10/2/86 45678900 Jose Delgado Student Village B300 2009 7/13/88 66666666 Count Dracula The Dungeon 2007 11/1431...............

SQL A relational DBMS has an associated query language called SQL that is used to: define the tables add records to a table modify or delete existing records retrieve data according to some criteria example: get the names of all students who live in Warren Towers example: get the names of all students in the class of 2017, and the number of courses they are taking perform computations on the data example: compute the average age of all students who live in Warren Towers Example Database A relational database containing data obtained from imdb.com We'll use SQL to answer (or at least explore) questions like: How many of the top-grossing films of all time have won one or more Oscars? Does the Academy discriminate against older women?

Beyond Relational Databases While relational databases are extremely powerful, they are sometimes inadequate/insufficient for a given problem. Example: DNA sequence data >gi 49175990 ref NC_000913.2 Escherichia coli K12, complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTA AATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAA ACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGA CAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTC TGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGAT GATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAG CCGGGGTTCCCGCTGGCGCAA common queries involve looking for similarities or patterns what genes in mice are similar to genes in humans? need special algorithms (problem-solving procedures) biologists store this data in text files and use simple computer programs to process it we'll learn to write simple programs using Python Data MINING Everywhere! Informally, data mining is the process of finding patterns in data. Example: customized recommendations Example: detecting credit-card fraud netflix.com

Data MINING Everywhere! Target may know that your friend is pregnant before you do! Data MINING Everywhere!

Structure of the Course databases (4 weeks) programming in Python (4 weeks) data graphics/visualization (1 week) data mining (4 weeks) Requirements Attendance at and participation in lectures and labs (10%) everyone has an allowance of 3 missed classes; do not email unless extreme circumstances labs begin next week held in the CS teaching lab, EMA 304 complete Lab 0 sometime this week (see course website) Nine homework assignments (30%) Final project (10%): done in teams of three use the techniques covered in the course to explore a dataset that interests you Three quizzes (20%) Final exam (30%)

Textbooks optional: Database Concepts, 7th edition by Kroenke & Auer (Prentice Hall, 2015) optional: Python Programming: An Introduction to Computer Science by John Zelle (second edition, Franklin Beedle Publishing, 2010) required: The CS 105 Coursepack contains all of the lecture notes will be available at Fedex Office at the corner of Comm Ave and Cummington Mall (the Warren Towers building) Course Staff Instructor: Dave Sullivan (dgs@cs.bu.edu) Teaching fellow: Nabeel Akhtar (nabeel@ bu.edu) Office hours and contact info. will be available on the course Web site: http://www.cs.bu.edu/courses/cs105 For general course-related questions, email: cs105-staff@cs.bu.edu which will forward your question to the full course staff.

Algorithm for Finding My Office 1. Go to the entrance to the MCS (math/cs) building at 111 Cummington Mall behind Warren Towers. Do not enter this building! 2. Turn around and cross the street to the doors across from MCS. 3. Enter those doors and take an immediate right. (continued on next slide) Algorithm for Finding My Office (cont.) 4. As you turn right, you should see the door below. Open it and go up the stairs to the second floor. 5. As you leave the stairs, turn right and then go left into a small hallway. My office is the first door on the left (PSY 228D).

Collaboration Other Details of the Syllabus Policies: lateness please don't request an extension unless it's an emergency! grading Please read the syllabus carefully and make sure that you understand the policies and follow them carefully. Let us know if you have any questions.