Anwendersoftware a Advanced Information Management Chapter 1: Introduction Holger Schwarz Universität Stuttgart Sommersemester 2009
Overview Basics Terms Database Management Systems Database Design Query Languages Classes of Database Applications Transaction Management Business Intelligence Geographic Information Systems Engineering Applications Enterprise Content Management Overview 2
Database Terms A data model is a collection of concepts for describing data. A schema is a description of a particular collection of data, using the a given data model. The relational model of data is the most widely used model today. Main concept: relation, basically a table with rows and columns. Every relation has a schema, which describes the columns, or fields. attribute column name projects relation name table name projectno manager description budget PJ23 Miller main bodywork team 1 000 000 PJ15 Maynard specialized wings 100 000 PJ47 Morris electronics 500 000 tuple / row 3
Database Management Systems DBMS: A tool for creating and managing g large amounts of data efficiently and allowing it to persist over long periods of time, safely. (Garcia-Molina et. al., 2002) Levels of abstraction provide logical data independence as well as physical data independence Many external schemas (views) describe how users see the data. One conceptual schema defines logical structure One physical schema describes the files and indexes used. DBMS Database DBS 4
Advantages of a DBMS Data independence Efficient data access + Data integrity & security Data administration Concurrent access, crash recovery Reduced application development time! Can be expensive, complicated to set up and maintain This cost & complexity must be offset by need 5
Structure of a DBMS A typical DBMS has a layered architecture. Each layer provides some kind of data abstraction and data mapping Concurrency control, recovery as well as transaction management have to be supported (within some layers). This is one of several possible architectures; each system has its own variations. DBM MS Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB 6
Process Model: information collection Database Design semantical data modeling logical data modeling database installation analysis of meaning interview noun analysis brainstorming document analysis... rough modeling precise modeling ERM UML NIAM EXPRESS-G IDEF1X STEP... conceptual schema DBMS independent relational object-oriented XML hierarchical network DBMS dependent time DB2 INGRES ORACLE ONTOS... conceptional cept o logical physical schema design schema design schema design normalization indexes clustering tuning 7
Relational Algebra Cartesian restriction projection product division a b c x y a a b x y x a a a x y z x z a σ π b c c y x y b c x y >< union set intersection set difference natural join a1 a2 a3 b1 b1 b2 b1 b2 b3 c1 c2 c3 a1 a2 a3 b1 b1 b2 c1 c1 c2 8
parts Examples of Tables usage partno P050 version 1.0 projectno PJ23 part_description bodywork partno version uses_ partno uses_ version quantity P050 2.0 PJ23 bodywork P050 1.0 P101 1.0 1 P101 1.0 PJ23 front body section P050 1.0 P102 1.2 2 P101 1.1 PJ23 front body section P050 10 1.0 P103 12 1.2 2 P101 2.0 PJ23 front body section P050 1.0 P104 1.2 2 P050 1.0 P111 1.0 2 P102 1.2 PJ23 a column P050 2.0 P101 1.1 1 P103 1.2 PJ23 b column P104 1.2 PJ23 c column P050 2.0 P102 1.2 2 P111 1.0 PJ15 rear wing P050 2.0 P103 1.2 2 P111 12 1.2 PJ15 rear wing P050 2.0 P104 1.2 2 P112 1.0 PJ15 front wing P050 2.0 P111 1.2 2 P050 2.0 P112 1.0 2 projects projectno manager description budget PJ23 Miller main bodywork team 1 000 000 PJ15 Maynard specialized wings 100 000 PJ47 Morris electronics 500 000 9
SQL Queries Example: List all parts of version 1.0 and the manager that is responsible for the corresponding project. SELECT partno, version, manager FROM parts, projects WHERE parts.projectno = projects.projectno AND version = '1.0' Example: Which project is responsible for more than two different parts? SELECT projectno, COUNT(DISTINCT partno) FROM parts GROUP BY projectno HAVING COUNT(DISTINCT partno) > 2 10
Other SQL Statements SQL language supports in addition to retrieval data manipulation operations that rely on the query capabilities INSERT INSERT INTO table [ (column-commalist) ] { VALUES row-constr-commalist t table-exp DEFAULT VALUES } UPDATE UPDATE table SET update-assignment-commalist assignment [WHERE cond-exp] DELETE DELETE FROM table [WHERE cond-exp] In addition there are statements for Data definition Data control Embedding of SQL into host languages 11
Other Important Concepts Some other important concepts in SQL:1999: View: A virtual table that is formed by a query expression and does not physically exist. Routine: A procedure, function or method that is known (in some cases also stored) by the system. It can be written in SQL or an external host language. Trigger: Allows actions to be taken when data is inserted, updated or deleted. Schema: A named collection of objects in the database. Catalog: A named collection of schemas in a database. User: Authorization identifier to control access to the database. Privilege: Defines the allowed operations for each user. 12
What we re skipping Access Methods disk layout for tuples and pages indexes (B-trees, linear hashing) Query optimization how to map a declarative query (SQL) to a query plan (relational algebra + implementations) Query processing algorithms sort, hash, join algorithms Transaction concepts and processing Atomicity, consistency, isolation, and durability 13
Overview Basics Terms Database Management Systems Database Design Query Languages Classes of Database Applications Transaction Management Business Intelligence Geographic Information Systems Engineering Applications Enterprise Content Management Overview 14
Transaction Processing BOT UPDATE accounts SET balance = balance - 3 WHERE A# = 03874; Transaction System Database System Transac ction Progr am Card? PIN? Account? Order Output # balance 03874 17 14 EOT OK 15
Business Intelligence effectiveness OLAP Data Mining strategic performance Data Warehouse Transaction Processing (OLTP) (heterogeneous information systems, isolated information islands, constantly increasing data sets) planning operational o a Data Mining Mining Engine OLAP Engine Data Warehouse OLAP 16
Business Intelligence effectiveness OLAP Data Mining strategic performance Data Warehouse Transaction Processing (OLTP) (heterogeneous information systems, isolated information islands, constantly increasing data sets) planning operational o a Data Mining Mining Engine OLAP Engine Data Warehouse OLAP 17
Optimization of Multi Queries Find the Top 25 Products that show the highest raise in turnover in the last quarter in 2000 compared to the preceeding quarter. Anfrage Query1 1 Query Anfrage 2 2 Anfrage Query4 4 Anfrage Query 5 5 Query Anfrage 6 6 Anfrage Query 8 8 Query3 Anfrage 3 Anfrage Query 7 7 Information Request SQL OLAP + Data Mining? Query Generator Meta Data Result Result Preprocessing OLAP Engine Partial Results Data Warehouse 18
Data Mining Techniques Association rule discovery Classification Example of application: store layout Example of application: insurance risk prediction {beer, nappies} {potato chips} support = 0.04 confidence = 0.81 Clustering/segmentation Example of application: customer mailings Regression Example of application: customer ranking revenue revenue #children age age 19
Classification: Training Phase Given: a set of training tuples carrying a class label Aim: a model (classifier) that assigns a class label to a given tuple by deriving the label from the values of the tuple s attributes training data classification algorithms name age income credit Mary 20-30 low poor James 30-40 low fair Bill 30-40 low good John 20-30 medium fair Marc 40-50 high good Anne 40-50 high good classifier (model) IF age = 40-50 OR income = high THEN credit = good store in DB, e.g., XML 20
Classification: Test Phase test data classifier (model) prediction name age income credit Paul 20-30 high good Jenny 40-50 low fair Rick 30-40 high fair credit fair fair good If there is a significant discrepancy between the expected (and by definition correct) result and the result predicted by the model it may be necessary to adapt/rebuild the model. 21
Classification: Application Phase unseen data classifier (model) prediction name age income Jim 20-30 high Phil 30-40 low Kate 40-50 medium credit fair poor fair 22
Master Data Management Overlapping and redundant data, applications, infrastructure No single, consolidated view of critical enterprise data Hand coded data integration spaghetti Master Data Management (MDM): business processes, technical and data integration architecture to create and maintain a system of record for core business entities across disparate applications in the enterprise Single view of data: Single view of customers (cross selling) Single view of suppliers (purchasing leverage) Single view of parts (inventory) Master Data Legacy Applications Legacy Applications Master Data Master Data Management System Legacy Applications Master Data Historical / Analytical Systems New Applications 23
Geographic Information System Search for areas of high traffic and high emission? SELECT Position FROM Emission E, Traffic V Spatial Extender WHERE overlaps(e.polygon, V.polygon) Spatial Query Database System 24
Enterprise Content Management Operational Content Statements, Invoices, Reports Scanned Images Fax Rich Media Video Audio Web Content HTML Graphic Files Business Content Workgroup Documents Word Processing Spreadsheet Presentation e-mail 25
ECM Infrastructure Customer Service Siebel, Customer Loyalty Operational Productivity SAP, Vertical Applications, e-records INTEGRATION LAYER Rich Media E-Commerce, e-learning, Brand Assets Archiving Search & Access Rights Management Media Streaming OTHER CONTENT STORES Content Manager Relational Data e-mail Exchange Legacy Systems Other File Systems ECM Platform 26
Challenges for DBMS Additional data models: XML, OO, Complex data structures: spatial objects, mining models, Various data types: documents, images, Integrating heterogeneous data sources Extended query languages needed Seamless integration with SQL 27
Overview Object-Relational Technology Motivation Table Expression and Recursion Large Objects Structured Types Hands-on Training 28
Object-Relational Technology DBM MS routines / methods built-in data types Seamless integration with SQL and Database Extend RDBMS to handle nontraditional data types spatial Co-existence of plug-in data with traditional data in a table Combined search in a single data mining i SQL statement Leverage existing data, applications, skills still image 29
Hierarchy of Types and Tables CREATE TYPE Straßen_T AS ( Name CHAR (40), Länge DECIMAL (9,2), Breite DECIMAL (5,2)) NOT FINAL ; CREATE TYPE Autobahn_T UNDER Straßen_T AS ( Gebühr Money_T) NOTFINAL ; CREATE TYPE Ortsstraßen_T UNDER Straßen_T AS ( Ort Orte_T) NOT FINAL ; SELECT * FROM Straßen WHERE Breite > 7,50 Straßen OID Name Länge Breite O21 Schillerweg 7,75 5,00 Autobahn is_a OID Name Länge Breite Gebühr O08 A6 324,00 18,20 20 O71 A8 564,50 20,10 10 is_a Ortsstraßen OID Name Länge Breite Ort O12 Schillerstr. 2,50 7,50 Köln O13 Mozartstr. 3,25 8,75 München 30
Object Relational Extensions Improve application development productivity Encapsulate specialist expertise for use by non-experts Vendors develop & support plug-ins so that you don't have to Consistent semantics ready for use Open architecture Write your own plug-ins Provide performance and scalability (combined w/ security, availability,...) Optimization Parallelism Industrial strength 31
Overview XML and Databases XML Data Model Path Expressions XQuery XML Support in DBMSs XML Processing Hands-on Training 32
library.xsd Example XML document <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema"> <xs:element name="name" type="xs:string"/> <xs:element name="born" type="xs:date"/> <xs:element name="dead" type="xs:date"/> <xs:attribute name="id" type="xs:id"/> <xs:element name="author"> <xs:complextype> <xs:sequence> <xs:element ref="name"/> <xs:element ref="born"/> <xs:element ref="dead" minoccurs="0"/> </xs:sequence> <xs:attribute ref="id"/> <?xml version="1.0"?> <author id="cms"> <name> Charles M Schulz </name> <born> 1922-11-26 </born> <dead> 2000-02-12 </dead> </author> </xs:complextype> XML h </xs:element> </xs:schema> XML schema document 33
XQuery: FLWOR Expression FOR_ clause RETURN_ clause LET_clause WHERE_clause ORDER_BY_clause FOR clause, LET clause generate list of tuples of bound variables (order preserving) by iterating over a set of nodes (possibly specified by a path expression), or binding a variable to the result of an expression ession WHERE clause applies a predicate to filter the tuples produced by FOR/LET ORDER BY clause imposes order on the surviving tuples RETURN clause is executed for each surviving tuple, generates ordered list of outputs 34
XQuery: FLWOR Expression FOR_ clause RETURN_ clause LET_clause WHERE_clause ORDER_BY_clause for $x in doc("bank2.xml")/bank/account let $acctno := $x/@account-number number where $x/balance > 400 return <account-number> {string($acctno)} </account-number> 35
XML and Databases X Path XQ Query Query Optimization and Execution describe schema: XML schema query languages: XQuery storage technologies indexing technologies DBMS Operators Files and Access Methods Buffer Management Disk Space Management native XML-DBMS hybrid systems: relational + xml DB 36
Overview Content Management Introduction and Basics Information Retrieval Technology: - Indexing - Search Hands-on Training 37
Elements of an ECM System APIs Application Development Workflow Components Capture Index Search / Access Create Rights Mgmt. Workflow Library server Device support Storage Management Resource manager Mgmt. Utilities 38
Overview Applications Development Data-intensive Applications Web-based Applications Technology Overview - JDBC - JEE/EJB - Web Services Hands-on Training 39