A Journey from DBMS to Data Mining

Similar documents
Chapter 7 Data Modeling Using the Entity- Relationship (ER) Model


Database Design. Marta Jakubowska-Sobczak IT/ADC based on slides prepared by Paula Figueiredo, IT/DB

Scheme G. Sample Test Paper-I

Lesson 8: Introduction to Databases E-R Data Modeling

IT2305 Database Systems I (Compulsory)

Foundations of Information Management

Databases and BigData

Foundations of Information Management

Chapter 2: Entity-Relationship Model. Entity Sets. " Example: specific person, company, event, plant

IT2304: Database Systems 1 (DBS 1)

COMP 378 Database Systems Notes for Chapter 7 of Database System Concepts Database Design and the Entity-Relationship Model

Chapter 2: Entity-Relationship Model

Mining Association Rules: A Database Perspective

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

ER modelling, Weak Entities, Class Hierarchies, Aggregation

Chapter 5: Logical Database Design and the Relational Model Part 2: Normalization. Introduction to Normalization. Normal Forms.

The Entity-Relationship Model

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

THE ENTITY- RELATIONSHIP (ER) MODEL CHAPTER 7 (6/E) CHAPTER 3 (5/E)

Bridge from Entity Relationship modeling to creating SQL databases, tables, & relations

LiTH, Tekniska högskolan vid Linköpings universitet 1(7) IDA, Institutionen för datavetenskap Juha Takkinen

Fundamentals of Database System

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.

Customer Classification And Prediction Based On Data Mining Technique

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Relational Database Basics Review

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Foundations of Business Intelligence: Databases and Information Management

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery

Course Syllabus For Operations Management. Management Information Systems

Unit 2.1. Data Analysis 1 - V Data Analysis 1. Dr Gordon Russell, Napier University

TIM 50 - Business Information Systems

DATABASE MANAGEMENT SYSTEMS. Question Bank:

CSC 742 Database Management Systems

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

Classification and Prediction

not necessarily strictly sequential feedback loops exist, i.e. may need to revisit earlier stages during a later stage

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Entity-Relationship Model

We know how to query a database using SQL. A set of tables and their schemas are given Data are properly loaded

Database Design Methodology

City University of Hong Kong. Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015

DATABASE DESIGN. - Developing database and information systems is performed using a development lifecycle, which consists of a series of steps.

The Relational Model. Why Study the Relational Model? Relational Database: Definitions. Chapter 3

B2.2-R3: INTRODUCTION TO DATABASE MANAGEMENT SYSTEMS

Course MIS. Foundations of Business Intelligence

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

BCA. Database Management System

The Entity-Relationship Model

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

ETL Process in Data Warehouse. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data Mining Part 5. Prediction

Fragmentation and Data Allocation in the Distributed Environments

Extending Data Processing Capabilities of Relational Database Management Systems.

Data Mining Jargon. Bob Muenchen The Statistical Consulting Center

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Comparison of Data Mining Techniques used for Financial Data Analysis

Database Management Systems. Chapter 1

Application of Data Mining Methods in Health Care Databases

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Chapter 1: Introduction

The Relational Model. Why Study the Relational Model? Relational Database: Definitions

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

not possible or was possible at a high cost for collecting the data.

A Tool for Generating Relational Database Schema from EER Diagram

Database Design Process. Databases - Entity-Relationship Modelling. Requirements Analysis. Database Design

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Relational Schema Design

DATA WAREHOUSING AND OLAP TECHNOLOGY

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Course: CSC 222 Database Design and Management I (3 credits Compulsory)

A Review of Data Mining Techniques

Fundamentals of Database Design

XV. The Entity-Relationship Model

Foundations of Business Intelligence: Databases and Information Management

CIS 631 Database Management Systems Sample Final Exam

Data Analysis 1. SET08104 Database Systems. Napier University

Database Systems. National Chiao Tung University Chun-Jen Tsai 05/30/2012

THE OPEN UNIVERSITY OF TANZANIA FACULTY OF SCIENCE TECHNOLOGY AND ENVIRONMENTAL STUDIES BACHELOR OF SIENCE IN INFORMATION AND COMMUNICATION TECHNOLOGY

14. Data Warehousing & Data Mining

Database Design Overview. Conceptual Design ER Model. Entities and Entity Sets. Entity Set Representation. Keys

Data Mining: Data Preprocessing. I211: Information infrastructure II

ECS 165A: Introduction to Database Systems

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

MySQL for Beginners Ed 3

Foundations of Business Intelligence: Databases and Information Management

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Visual Data Mining in Indian Election System

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Introduction to database management systems

Transcription:

A Journey from DBMS to Data Mining Aditya Bagchi Short-Term Training Programme on Knowledge Discovery in Databases (DInK 10) Indian Statistical Institute, Kolkata January 11-15, 15, 2010

Introduction to Database Management Systems Indian Statistical Institute, Kolkata 2

What is a Database Management System? A Database Management System, popularly called a DBMS, manages a set of logically interconnected files that describes a problem domain. Indian Statistical Institute, Kolkata 3

A DBMS provides facilities to: create the database that contains the files port data to the database add and alter the database structure access and manipulate data stored in the database according to the need of the users. Indian Statistical Institute, Kolkata 4

Additional Facilities: recover from sudden system crashes without disturbing the content of a database. controlling the access of a large group of users in a multi-user environment. providing adequate data security arrangements so that access to different parts of a database in different modes (Read/Write/Update etc.) can be controlled for different sets of users. Indian Statistical Institute, Kolkata 5

Data(base) Models Serves for describing the structure of the database Does not care for the actual value Concentrates on the relationship among the data A data-model has a structure, a set of operators and a set of navigational rules. It can be designed either as an Object-based Data Model or as a Record-based Data Model. Indian Statistical Institute, Kolkata 6

In the Object-based Data Model, the problem domain is broken into number of real-life object types with their associated attributes and constraints. Facilities to manipulate each type of objects may also be associated with the corresponding object structures. Interconnections between different types of objects are also defined. Some of these models are not implementable but provide better understanding of a problem domain. Indian Statistical Institute, Kolkata 7

Entity Relationship Modeling Proposed by Peter Chen in 1976. Key component Entity Relationship Diagram (ERD) entity: identifiable object or concept of significance relationship: association between entities attribute: property of an entity (or relationship) Indian Statistical Institute, Kolkata 8

Entity: mutually exclusive in all cases. must be uniquely identifiable. Type: regular/strong, weak An entity set is a set of entities of the same type that share the same properties. Entity sets need not be disjoint. Indian Statistical Institute, Kolkata 9

An attribute is a function that maps from the entity set into a domain. Attribute: value. domain Type: simple, composite, single valued, multi valued, derived, key, null, complex A super key of an entity set is a set of one or more attributes whose values uniquely determine each entity. A candidate key of an entity set is a minimal super key Indian Statistical Institute, Kolkata 10

Relationship: A relationship is an association among several entities A relationship set is a mathematical relation among n 2 entities. If E 1, E 2 E n are entity sets, then a relationship set R is a subset of 2 n possible relationships. The entity sets E 1, E 2 E n participate in the relationship set R. Indian Statistical Institute, Kolkata 11

Degree of relationship set: number of entity sets that participate in a relationship set. Mapping cardinalities (or cardinality ratio) entities of an entity set are associated with the entities of another entity set via a relationship set. Possible relationships are: 1:1, 1:N, N:1, and N:M Indian Statistical Institute, Kolkata 12

Participation constraints: If every entity of E (an entity set) participate in at least one relationship in R (relationship set), is called total participation. If some entities of E participate in R, then partial participation of E in R. Indian Statistical Institute, Kolkata 13

multi-valued single/simple single/simple single/simple (strong) key (strong) entity composite identifying relationship single/simple derived weak entity (strong) entity role1 role2 recursive weak key (strong) key Indian Statistical Institute, Kolkata 14

Problem An IT company is involved in the design of software products. It has many department at various locations. Each employee of the company is posted to one department only. The following information about the employees are maintained name, address, date of birth, date of joining, designation and monthly salary. Departments are identified by a unique name. Company gets project from various organization, whose name and address are stored. Indian Statistical Institute, Kolkata 15

Each project is identified by a unique project number and a unique name. In addition budget, starting date, expected date of completion for each project are maintained. The company also maintains information on the number of projects where each employee is involved. Each employee may be associated with one or more projects. An employee associated with a project has a duration of service in that project and a responsibility either as a member or as the leader. A project will have only one project leader. Indian Statistical Institute, Kolkata 16

e-address desig e-no Employee dob doj sal p-name budget starting-dt e-name d-name location no of p posting Department duration involved in has leader o-name Project give Organization completion-dt address ER Diagram Indian Statistical Institute, Kolkata 17

A Record based Model describes the record types associated with a problem domain. The most popular Record type Model is the Relational Model where each record type is modeled as a 2-Dimensional Table. Specific mapping rules are available to convert an Object based design to a Record based design. Mapping of Entity / Weak Entity Types Mapping of Relationship Types Binary N-ary Mapping of Multivalued attributes. Relational model does not allow set or tuple type attributes. Indian Statistical Institute, Kolkata 18

For each (strong) entity type E in the ER schema, create a relation R that includes all the simple attributes of E. Choose one of the key attributes of E as the primary key for R. If the chosen key of E is composite, the set of simple attributes that form it will together form the primary key of R. For each weak entity type W in the ER schema with owner entity type E, create a relation R and include all simple attributes (or simple components of composite attributes) of W as attributes of R. In addition, include as foreign key attributes of R the primary key attribute(s) of the relation(s) that correspond to the owner entity type(s). The primary key of R is the combination of the primary key(s) of the owner(s) and the partial key of the weak entity type W, if any. Indian Statistical Institute, Kolkata 19

Mapping Binary 1:1 Relation Types Choose one of the relations-s, say-and include a foreign key in S the primary key of T. It is better to choose an entity type with total participation in R in the role of S. Merge the two entity types and the relationship into a single relation. Set up a third relation R for the purpose of cross-referencing the primary keys of the two relations S and T representing the entity types. Indian Statistical Institute, Kolkata 20

Mapping Binary 1:N Relationship Types. i. For each regular binary 1:N relationship type R, identify the relation S that represent the participating entity type at the N-side of the relationship type. ii. Include as foreign key in S the primary key of the relation T that represents the other entity type participating in R. iii. Include any simple attributes of the 1:N relation type as attributes of S. Indian Statistical Institute, Kolkata 21

Mapping Binary M:N Relationship Types. i. For each regular binary M:N relationship type R, create a new relation S to represent R. ii. Include as foreign key attributes in S the primary keys of the relations that represent the participating entity types; their combination will form the primary key of S. iii. Also include any simple attributes of the M:N relationship type (or simple components of composite attributes) as attributes of S. Indian Statistical Institute, Kolkata 22

Mapping Multivalued attributes. i. For each multivalued attribute A, create a new relation R. This relation R will include an attribute corresponding to A, plus the primary key attribute K-as a foreign key in R-of the relation that represents the entity type of relationship type that has A as an attribute. ii. The primary key of R is the combination of A and K. If the multivalued attribute is composite, we include its simple components. Indian Statistical Institute, Kolkata 23

Mapping N-ary N Relationship Types. For each n-ary relationship type R, where n>2, create a new relationship S to represent R. Include as foreign key attributes in S the primary keys of the relations that represent the participating entity types. Also include any simple attributes of the n-ary relationship type (or simple components of composite attributes) as attributes of S. Indian Statistical Institute, Kolkata 24

Er to Relational Mapping: Employee (emp-no, emp-name, emp-address, dob, doj, desig, sal, dept-name) Department(dept-name, location) Project (proj-name, proj-budget, starting-dt, proj-duration, dept-name, o-name) Involvement (emp-no, proj-name, duration, responsibility) Organization (o-name, address) Indian Statistical Institute, Kolkata 25

Operators: Unary Operators (applicable to a single relation) Binary Operators (manipulates two relations) Unary Operators: Selection(σ): Selects one or more rows or tuples of a relation. σ θ (R) θ is the set of conditions or predicates for selection Find All employees working in Accounts department and having salary greater than Rs.10000/- : σ sal>10000 dept-name= Accounts (Employee) Indian Statistical Institute, Kolkata 26

Projection(π): selects one or more attributes of a relation. π c (R) C is the set of attributes selected. List the name and address of all the employees: π name,address (Employee) Combination: List the name and address of all the employees working in Accounts department and having salary greater than Rs.10000/- : π name,address (σ sal>10000 dept-name= Accounts (Employee)) Indian Statistical Institute, Kolkata 27

Binary Operators: Natural Join( ): joins two relations by equating values of the common attributes. Find the name and address of the employees working in the Accounts department and placed in Mumbai. π name,address (σ location= Mumbai deptname= Accounts (Department Employee)) Query Language (SQL) : Select name, address From Department, Employee Where Department.location = Mumbai and Employee.dept-name = Accounts and Employee.dept-name = Department.dept-name Indian Statistical Institute, Kolkata 28

Multiple Joins: Find the name and address of the employees working in the Accounts department, placed in Mumbai and associated with the project DST 55/10. π name,address (σ location= Mumbai dept-name= Accounts proj-name= DST 55/10 (Department Employee Project)) Select name, address From Department, Employee, Project Where Department.location = Mumbai and Employee.dept-name = Accounts and Project.proj-name = DST 55/10 and Employee.dept-name = Department.dept-name and Employee.dept-name = Project.dept-name Indian Statistical Institute, Kolkata 29

Set Operators: two relations must have the same arity (same number of attributes) attributes in the corresponding positions must be of same domain. Example: (Banking Environment) Deposit (b_name, c_name, ac_no, balance) Borrow (b_name, c_name, ln_no, amount) b_name = branch name c_name = customer name ac_no = account number ln_no = loan number Indian Statistical Institute, Kolkata 30

List the name of customers who are depositors as well as borrowers in ISI branch. (π c_name (σ b_name= ISI (Deposit))) (π c_name (σ b_name= ISI (Borrow))) (??) (π c_name (σ b_name= ISI (Deposit Borrow))) Select c_name From Deposit Where b_name = ISI Intersection Select c_name From Borrow Where b_name = ISI Indian Statistical Institute, Kolkata 31

A First Visit to the World of Data Mining Indian Statistical Institute, Kolkata 32

Data Mining is a method of finding interesting trends or patterns in large datasets. Discovered patterns help and guide the appropriate authority in taking future decisions. So, Data Mining is regarded as a tool for Decision Support. Data Mining tools are expected to involve minimal user intervention. Since data volume is very large, efficiency and scalability are two very important criteria for data mining algorithms. Indian Statistical Institute, Kolkata 33

Data Mining Communities 1. Statistics : Provides the background for the algorithms. 2. Artificial Intelligence : Provides the required heuristics for machine learning/conceptual clustering. 3. Data Management : Provides the platform for storage & retrieval of raw and summary data. Indian Statistical Institute, Kolkata 34

A Data Mining Effort Involves: Data Collection Data Preprocessing & Feature Extraction Discovery of Patterns Visualization of data Evaluation of results. Indian Statistical Institute, Kolkata 35

Initial Activities 1.Data Cleaning: Data may be incomplete, noisy & inconsistent. Cleaning would identify outliers, fill in missing values and correct inconsistencies. 2.Data Integration & Transformation: Data analysis may involve data integration from different sources as in Data Warehouse. The sources may include Databases, Data cubes or flat files. Data need to be transformed or consolidated into forms suitable for mining, e.g. attribute values converted from absolute values to ranges. 3.Data Reduction: Since both data volume and attribute set may be too large, data reduction becomes necessary, e.g. removal of irrelevant and redundant attributes, generation of Summary Data etc. Indian Statistical Institute, Kolkata 36

Mining Activities 1.Rule Discovery: Discovery of Association rules from different features involved in a problem domain. 2.Data Clustering : Grouping based on conceptual clustering; Maximizing the intra-cluster similarity and minimizing inter-cluster similarity. 3.Data Classification : Grouping of data and placement of such data groups in a taxonomy. 4.Searching of Sequential Patterns : Discovery of patterns involved in a temporal sequence. Indian Statistical Institute, Kolkata 37

Knowledge Discovery from Databases Discovery of pattern among attributes of a relation for possible classification of data. Discovery of pattern among attributes of multiple relations. Discovery of pattern from temporal variation of data (discovery of pattern from a Data Warehouse) Indian Statistical Institute, Kolkata 38

CAEP(Classification by Aggregating Emerging Patterns) uses the method of support computation to find Emerging Patterns. Let there be two classes, C1 (buys_car = yes ) and C2 (buys_car = no ). Now, the itemset (age 25, income 20K) is a typical EP(Emerging Pattern) with support increases from 0.2% in C1 to 57.6% in C2(say), at a growth rate of 57.6/0.2 = 288. Usually equality test is done for a categorical attribute, while a membership in a range or interval is checked for a numerical attribute. EP is a multi-attribute test whose differentiating power is checked for a class membership. Differentiating power of an EP is derived from its growth rate and the support in the target class. Indian Statistical Institute, Kolkata 39

20K Marital Status Income 21-50 K Age > 50 K Yes Married Single 40 > 40 No Yes Yes No Decision Tree on the concept buys_new_car Indian Statistical Institute, Kolkata 40

Discovery of Patterns from Multiple Relations Tends to join all relations to generate a large Universal Relation. Creates unnecessary repetition of data. Brings in too many attributes. Needs a massive data cleaning and reduction effort before applying any mining algorithm. Indian Statistical Institute, Kolkata 41

Discovery of pattern from temporal variation of data Data in an operational database varies over time. Temporally invariant data is stored in a Data Warehouse. Temporal Patterns can be discovered from such Data Warehouses. Important in long term planning, study of social and economic changes etc. Indian Statistical Institute, Kolkata 42

Reference Fundamentals of Database Systems R. Elmasri and S. B. Navathe Database System Concepts A. Silberschatz, H. F. Korth and S. Sudarshan Database Management System R. Ramakrishnan and J. Gehrke Data Mining : Concepts and Techniques J. Han & M. Kamber Indian Statistical Institute, Kolkata 43

Thank You aditya@isical.ac.in Indian Statistical Institute, Kolkata 44