A Journey from DBMS to Data Mining

A Journey from DBMS to Data Mining Aditya Bagchi Short-Term Training Programme on Knowledge Discovery in Databases (DInK 10) Indian Statistical Institute, Kolkata January 11-15, 15, 2010

Introduction to Database Management Systems Indian Statistical Institute, Kolkata 2

What is a Database Management System? A Database Management System, popularly called a DBMS, manages a set of logically interconnected files that describes a problem domain. Indian Statistical Institute, Kolkata 3

A DBMS provides facilities to: create the database that contains the files port data to the database add and alter the database structure access and manipulate data stored in the database according to the need of the users. Indian Statistical Institute, Kolkata 4

Additional Facilities: recover from sudden system crashes without disturbing the content of a database. controlling the access of a large group of users in a multi-user environment. providing adequate data security arrangements so that access to different parts of a database in different modes (Read/Write/Update etc.) can be controlled for different sets of users. Indian Statistical Institute, Kolkata 5

Data(base) Models Serves for describing the structure of the database Does not care for the actual value Concentrates on the relationship among the data A data-model has a structure, a set of operators and a set of navigational rules. It can be designed either as an Object-based Data Model or as a Record-based Data Model. Indian Statistical Institute, Kolkata 6

In the Object-based Data Model, the problem domain is broken into number of real-life object types with their associated attributes and constraints. Facilities to manipulate each type of objects may also be associated with the corresponding object structures. Interconnections between different types of objects are also defined. Some of these models are not implementable but provide better understanding of a problem domain. Indian Statistical Institute, Kolkata 7

Entity Relationship Modeling Proposed by Peter Chen in 1976. Key component Entity Relationship Diagram (ERD) entity: identifiable object or concept of significance relationship: association between entities attribute: property of an entity (or relationship) Indian Statistical Institute, Kolkata 8

Entity: mutually exclusive in all cases. must be uniquely identifiable. Type: regular/strong, weak An entity set is a set of entities of the same type that share the same properties. Entity sets need not be disjoint. Indian Statistical Institute, Kolkata 9

An attribute is a function that maps from the entity set into a domain. Attribute: value. domain Type: simple, composite, single valued, multi valued, derived, key, null, complex A super key of an entity set is a set of one or more attributes whose values uniquely determine each entity. A candidate key of an entity set is a minimal super key Indian Statistical Institute, Kolkata 10

Relationship: A relationship is an association among several entities A relationship set is a mathematical relation among n 2 entities. If E 1, E 2 E n are entity sets, then a relationship set R is a subset of 2 n possible relationships. The entity sets E 1, E 2 E n participate in the relationship set R. Indian Statistical Institute, Kolkata 11

Degree of relationship set: number of entity sets that participate in a relationship set. Mapping cardinalities (or cardinality ratio) entities of an entity set are associated with the entities of another entity set via a relationship set. Possible relationships are: 1:1, 1:N, N:1, and N:M Indian Statistical Institute, Kolkata 12

Participation constraints: If every entity of E (an entity set) participate in at least one relationship in R (relationship set), is called total participation. If some entities of E participate in R, then partial participation of E in R. Indian Statistical Institute, Kolkata 13

multi-valued single/simple single/simple single/simple (strong) key (strong) entity composite identifying relationship single/simple derived weak entity (strong) entity role1 role2 recursive weak key (strong) key Indian Statistical Institute, Kolkata 14

Problem An IT company is involved in the design of software products. It has many department at various locations. Each employee of the company is posted to one department only. The following information about the employees are maintained name, address, date of birth, date of joining, designation and monthly salary. Departments are identified by a unique name. Company gets project from various organization, whose name and address are stored. Indian Statistical Institute, Kolkata 15

Each project is identified by a unique project number and a unique name. In addition budget, starting date, expected date of completion for each project are maintained. The company also maintains information on the number of projects where each employee is involved. Each employee may be associated with one or more projects. An employee associated with a project has a duration of service in that project and a responsibility either as a member or as the leader. A project will have only one project leader. Indian Statistical Institute, Kolkata 16

e-address desig e-no Employee dob doj sal p-name budget starting-dt e-name d-name location no of p posting Department duration involved in has leader o-name Project give Organization completion-dt address ER Diagram Indian Statistical Institute, Kolkata 17

A Record based Model describes the record types associated with a problem domain. The most popular Record type Model is the Relational Model where each record type is modeled as a 2-Dimensional Table. Specific mapping rules are available to convert an Object based design to a Record based design. Mapping of Entity / Weak Entity Types Mapping of Relationship Types Binary N-ary Mapping of Multivalued attributes. Relational model does not allow set or tuple type attributes. Indian Statistical Institute, Kolkata 18

For each (strong) entity type E in the ER schema, create a relation R that includes all the simple attributes of E. Choose one of the key attributes of E as the primary key for R. If the chosen key of E is composite, the set of simple attributes that form it will together form the primary key of R. For each weak entity type W in the ER schema with owner entity type E, create a relation R and include all simple attributes (or simple components of composite attributes) of W as attributes of R. In addition, include as foreign key attributes of R the primary key attribute(s) of the relation(s) that correspond to the owner entity type(s). The primary key of R is the combination of the primary key(s) of the owner(s) and the partial key of the weak entity type W, if any. Indian Statistical Institute, Kolkata 19

Mapping Binary 1:1 Relation Types Choose one of the relations-s, say-and include a foreign key in S the primary key of T. It is better to choose an entity type with total participation in R in the role of S. Merge the two entity types and the relationship into a single relation. Set up a third relation R for the purpose of cross-referencing the primary keys of the two relations S and T representing the entity types. Indian Statistical Institute, Kolkata 20

Mapping Binary 1:N Relationship Types. i. For each regular binary 1:N relationship type R, identify the relation S that represent the participating entity type at the N-side of the relationship type. ii. Include as foreign key in S the primary key of the relation T that represents the other entity type participating in R. iii. Include any simple attributes of the 1:N relation type as attributes of S. Indian Statistical Institute, Kolkata 21

Mapping Binary M:N Relationship Types. i. For each regular binary M:N relationship type R, create a new relation S to represent R. ii. Include as foreign key attributes in S the primary keys of the relations that represent the participating entity types; their combination will form the primary key of S. iii. Also include any simple attributes of the M:N relationship type (or simple components of composite attributes) as attributes of S. Indian Statistical Institute, Kolkata 22

Mapping Multivalued attributes. i. For each multivalued attribute A, create a new relation R. This relation R will include an attribute corresponding to A, plus the primary key attribute K-as a foreign key in R-of the relation that represents the entity type of relationship type that has A as an attribute. ii. The primary key of R is the combination of A and K. If the multivalued attribute is composite, we include its simple components. Indian Statistical Institute, Kolkata 23

Mapping N-ary N Relationship Types. For each n-ary relationship type R, where n>2, create a new relationship S to represent R. Include as foreign key attributes in S the primary keys of the relations that represent the participating entity types. Also include any simple attributes of the n-ary relationship type (or simple components of composite attributes) as attributes of S. Indian Statistical Institute, Kolkata 24

Er to Relational Mapping: Employee (emp-no, emp-name, emp-address, dob, doj, desig, sal, dept-name) Department(dept-name, location) Project (proj-name, proj-budget, starting-dt, proj-duration, dept-name, o-name) Involvement (emp-no, proj-name, duration, responsibility) Organization (o-name, address) Indian Statistical Institute, Kolkata 25

Operators: Unary Operators (applicable to a single relation) Binary Operators (manipulates two relations) Unary Operators: Selection(σ): Selects one or more rows or tuples of a relation. σ θ (R) θ is the set of conditions or predicates for selection Find All employees working in Accounts department and having salary greater than Rs.10000/- : σ sal>10000 dept-name= Accounts (Employee) Indian Statistical Institute, Kolkata 26

Projection(π): selects one or more attributes of a relation. π c (R) C is the set of attributes selected. List the name and address of all the employees: π name,address (Employee) Combination: List the name and address of all the employees working in Accounts department and having salary greater than Rs.10000/- : π name,address (σ sal>10000 dept-name= Accounts (Employee)) Indian Statistical Institute, Kolkata 27

Binary Operators: Natural Join( ): joins two relations by equating values of the common attributes. Find the name and address of the employees working in the Accounts department and placed in Mumbai. π name,address (σ location= Mumbai deptname= Accounts (Department Employee)) Query Language (SQL) : Select name, address From Department, Employee Where Department.location = Mumbai and Employee.dept-name = Accounts and Employee.dept-name = Department.dept-name Indian Statistical Institute, Kolkata 28

Multiple Joins: Find the name and address of the employees working in the Accounts department, placed in Mumbai and associated with the project DST 55/10. π name,address (σ location= Mumbai dept-name= Accounts proj-name= DST 55/10 (Department Employee Project)) Select name, address From Department, Employee, Project Where Department.location = Mumbai and Employee.dept-name = Accounts and Project.proj-name = DST 55/10 and Employee.dept-name = Department.dept-name and Employee.dept-name = Project.dept-name Indian Statistical Institute, Kolkata 29

Set Operators: two relations must have the same arity (same number of attributes) attributes in the corresponding positions must be of same domain. Example: (Banking Environment) Deposit (b_name, c_name, ac_no, balance) Borrow (b_name, c_name, ln_no, amount) b_name = branch name c_name = customer name ac_no = account number ln_no = loan number Indian Statistical Institute, Kolkata 30

List the name of customers who are depositors as well as borrowers in ISI branch. (π c_name (σ b_name= ISI (Deposit))) (π c_name (σ b_name= ISI (Borrow))) (??) (π c_name (σ b_name= ISI (Deposit Borrow))) Select c_name From Deposit Where b_name = ISI Intersection Select c_name From Borrow Where b_name = ISI Indian Statistical Institute, Kolkata 31

A First Visit to the World of Data Mining Indian Statistical Institute, Kolkata 32

Data Mining is a method of finding interesting trends or patterns in large datasets. Discovered patterns help and guide the appropriate authority in taking future decisions. So, Data Mining is regarded as a tool for Decision Support. Data Mining tools are expected to involve minimal user intervention. Since data volume is very large, efficiency and scalability are two very important criteria for data mining algorithms. Indian Statistical Institute, Kolkata 33

Data Mining Communities 1. Statistics : Provides the background for the algorithms. 2. Artificial Intelligence : Provides the required heuristics for machine learning/conceptual clustering. 3. Data Management : Provides the platform for storage & retrieval of raw and summary data. Indian Statistical Institute, Kolkata 34

A Data Mining Effort Involves: Data Collection Data Preprocessing & Feature Extraction Discovery of Patterns Visualization of data Evaluation of results. Indian Statistical Institute, Kolkata 35

Initial Activities 1.Data Cleaning: Data may be incomplete, noisy & inconsistent. Cleaning would identify outliers, fill in missing values and correct inconsistencies. 2.Data Integration & Transformation: Data analysis may involve data integration from different sources as in Data Warehouse. The sources may include Databases, Data cubes or flat files. Data need to be transformed or consolidated into forms suitable for mining, e.g. attribute values converted from absolute values to ranges. 3.Data Reduction: Since both data volume and attribute set may be too large, data reduction becomes necessary, e.g. removal of irrelevant and redundant attributes, generation of Summary Data etc. Indian Statistical Institute, Kolkata 36

Mining Activities 1.Rule Discovery: Discovery of Association rules from different features involved in a problem domain. 2.Data Clustering : Grouping based on conceptual clustering; Maximizing the intra-cluster similarity and minimizing inter-cluster similarity. 3.Data Classification : Grouping of data and placement of such data groups in a taxonomy. 4.Searching of Sequential Patterns : Discovery of patterns involved in a temporal sequence. Indian Statistical Institute, Kolkata 37

Knowledge Discovery from Databases Discovery of pattern among attributes of a relation for possible classification of data. Discovery of pattern among attributes of multiple relations. Discovery of pattern from temporal variation of data (discovery of pattern from a Data Warehouse) Indian Statistical Institute, Kolkata 38

CAEP(Classification by Aggregating Emerging Patterns) uses the method of support computation to find Emerging Patterns. Let there be two classes, C1 (buys_car = yes ) and C2 (buys_car = no ). Now, the itemset (age 25, income 20K) is a typical EP(Emerging Pattern) with support increases from 0.2% in C1 to 57.6% in C2(say), at a growth rate of 57.6/0.2 = 288. Usually equality test is done for a categorical attribute, while a membership in a range or interval is checked for a numerical attribute. EP is a multi-attribute test whose differentiating power is checked for a class membership. Differentiating power of an EP is derived from its growth rate and the support in the target class. Indian Statistical Institute, Kolkata 39

20K Marital Status Income 21-50 K Age > 50 K Yes Married Single 40 > 40 No Yes Yes No Decision Tree on the concept buys_new_car Indian Statistical Institute, Kolkata 40

Discovery of Patterns from Multiple Relations Tends to join all relations to generate a large Universal Relation. Creates unnecessary repetition of data. Brings in too many attributes. Needs a massive data cleaning and reduction effort before applying any mining algorithm. Indian Statistical Institute, Kolkata 41

Discovery of pattern from temporal variation of data Data in an operational database varies over time. Temporally invariant data is stored in a Data Warehouse. Temporal Patterns can be discovered from such Data Warehouses. Important in long term planning, study of social and economic changes etc. Indian Statistical Institute, Kolkata 42

Reference Fundamentals of Database Systems R. Elmasri and S. B. Navathe Database System Concepts A. Silberschatz, H. F. Korth and S. Sudarshan Database Management System R. Ramakrishnan and J. Gehrke Data Mining : Concepts and Techniques J. Han & M. Kamber Indian Statistical Institute, Kolkata 43

Thank You aditya@isical.ac.in Indian Statistical Institute, Kolkata 44