IV Distributed Databases - Motivation & Introduction -

IV Distributed Databases - Motivation & Introduction - I OODBS II XML DB III Inf Retr DModel Motivation Expected Benefits Technical issues Types of distributed DBS 12 Rules of C. Date Parallel vs Distributed DBS References M.T. Özsu and P. Valduriez. Principles of Distributed Database Systems, 2nd edition. Prentice-Hall,1999. Rahm, E.: Mehrrechner-Datenbanksysteme, Addison-Wesley, 1994 G. Vossen, G. Weikum: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery, Morgan Kaufmann, 2001, ISBN ISBN: 1558605088 Gray, J.; Reuter, A.: Transaction Processing - Concepts and Techniques, Morgan Kaufmann Publishers, San Matteo, 1993 Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987 (pdf) Bernstein, P.A., Newcomer, E.: Principles of Transaction Processing, Morgan Kaufmann, San Matteo, 1997 Material used from B. Kemme (McGill), H. Garcia-Molina (Stanford), A. Zaslavsky et al.(monash), G. Alonso (ETH) hs / FUB dbsii-03-10ddbintro-2

Motivation Application: Data "naturally" distributed Companies with different branches Airlines Financial Business University / faculties Any organization with a decentralized organizational structure Technology: Network infrastructure, processors, RAM Economy: Hardware cost Software supporting Distributed Processing, e.g RPC Huge number of interconnected systems Recent challenge: Web-based Computing E-Commerce hs / FUB dbsii-03-10ddbintro-3 Goals: Improvement of non functional characteristics Performance: the more computing power, the better Primary goal for parallel DBS, not necessary distributed DB Reliability: Substitute faulty components (HW, software and network) seamlessly Fault tolerance: the ability to hide failures from users Related to higher availability 95,8 % too low? Definitely: 1 hour / day! Scalability upscale / downscale your system incrementally Central components and algorithms counter productive Distributed algorithms hs / FUB dbsii-03-10ddbintro-4

The dark side of distribution Systems often less reliable "You will never make a system of unreliable components more reliable by adding more unreliable components" However: hot standby But: data copies must be kept consistent, complex software, unreliable network. Scalability DS inherently complex High development cost -> middleware efforts High administration cost lack of flexibility hs / FUB dbsii-03-10ddbintro-5 The dark side Performance Double resources do not guarantee double performance Network performance? Transfer time not only depends on bandwidth Transfer of 4 KB page latency Bandwidth transfer - 100 m 0.5 µs 10 Mbps 5 ms - 100 m 0.5 µs 100 Mbps 0.5 ms - 1 km 5 µs 100 Mbps 0.5 ms - 100 km 0.5 ms 100 Mbps 1 ms - 1000 km 5 ms 100 Mbps 5.5 ms - 10000 km 50 ms 1 Gbps 50 ms Distance > 100 km signal propagation time dominates Compare mean disk access time: ~ 5 ms hs / FUB dbsii-03-10ddbintro-6

What is a Distributed Database? A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network. A distributed database management system (D DBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the users. Distributed database system (DDBS) = DDB + D DBMS Def. by P. Valduriez, T. Öszu hs / FUB dbsii-03-10ddbintro-7 Example (1) Transparency of distribution: one logical DB UPDATE empl SET sal = sal*1.1 WHERE proj.dur>12 AND emp.id = ass.eid AND proj.id=ass.pid Berlin All projects Berlin employees All assigments net New York NY employees Munic Muc projects Muc employees Muc assigments Expl. by B. Kemme hs / FUB dbsii-03-10ddbintro-8

Example (2) Cooperation: autonomous DB cooperating on particular tasks SELECT flights WHERE departure = Montreal AND arrival = Munich AND date = 12/9/2002 AND price < 800$ lufthansa.com net Travel-overland.com air-canada.com hs / FUB dbsii-03-10ddbintro-9 Example(3) Autonomous, heterogenous systems, logically identical data types Select empl SET sal = sal*0.9 WHERE jobtitle = "product manager" Daimler / Stuttg. OnlyStuttgart data IBM DB2 net Daimler / Bremen Chrysler / Detroit Only Detroit data Oracle 9i Only Bremen data MySQL hs / FUB dbsii-03-10ddbintro-10

Example (4) Sophisticated Client / Server computing client client client client Application Server A Application Server B Possible R/W conflict hs / FUB dbsii-03-10ddbintro-11 Classification criteria Distribution Physically independent systems Peer-to-peer: data distribution and sharing Client / Server: function distribution e.g. parsing in client Heterogeneity DBMS software Database schema (Types) and languages (SQL variants) Autonomy No global control Local DBS operations may not influenced by global operations (e.g. of a global transaction) Note: subsumes completely independent or semiautonomous systems, see scenarios hs / FUB dbsii-03-10ddbintro-12

Classification cube by P. Valduriez, T. Öszu Distributed DB: looks like one DB Federated: more autonomy but not independent (Expl. 3) Multi DB: independent, cooperative (Expl. 2) hs / FUB dbsii-03-10ddbintro-13 Scenarios and common problems Not just one distributed database systems.. but indefinitely many Understand common problems e.g. how to guarantee one state for replicated data from the user point of view Solve by developing distributed algorithms e.g. transaction commit Main issue: Any unsolvable problems? Partial failure Example: Internet marriage bride priest groom All participants and communication unreliable Distributed transaction: YES of NO, this is the question hs / FUB dbsii-03-10ddbintro-14

12 +1 rules for DDBS (C. Date) Rule 0: A DDB looks like a central DB to users Rule 1: sites should be as independent as possible local autonomy Rule 2: There should not be a central master all sites are dependent on - No reliance on central site Rule 3: Never a need for complete shutdown continuous operation Rule 4: Users should not need to know where data are stored - location transparency (independence) Rule 5: If data are split (e.g. columns of one relation) and distributed over several sites, user's should not be aware of it - fragmentation transparency hs / FUB dbsii-03-10ddbintro-15 12 rules Rule 6: Users should not be aware of replicated data - replication independence Rule 7: Efficient distributed query processing Rule 8: Global concurrency control and recovery distributed transaction management Rule 9: Hardware independence Rule 10: OS independence Rule 11: Network independence Rule 12: DBMS independence hs / FUB dbsii-03-10ddbintro-16

Parallel versus Distributed Databases More similarities than differences Similar to Parallel / Distributed Processing distinction Parallel DBS Not geographically distributed Goal: High Performance Homogenous Software Fast interconnect Distributed DBS Data geographically distributed Goal: Data sharing Disconnected operation possible -> autonomy Transparency hs / FUB dbsii-03-10ddbintro-17 Parallel / distributed DBS Query processing in parallel DBS Distribute operators (sort, filter, ) an data over processor to make complex processing fast e.g. join on a shared disk MP system P P P P M 1 M n Join (R, S) { // R >> S 1. Split R into n-1 partitions R i and assign to M i /P i ; Assign S to processor / memory P n / M n ; 2. Sort R i and S; ( //n parallel 3. Join (n-1) + 1 streams } hs / FUB dbsii-03-10ddbintro-18

Parallel / distributed DBS Distributed QP Given a data distribution Find strategy to evaluate query with minimal cost, in particular communication cost 10000 km S = 100000 records R = 10000 records 100 km Compute with minimal cost (time): R S T T = 1000 records hs / FUB dbsii-03-10ddbintro-19 Important terms Motivation: technology, application, economy Expected benefits: Scalability reliability performance Data / function distribution Fault tolerance in case of partial failures Autonomy, multi database, federated DB Distribution transparency Parallel versus Distributed DBS hs / FUB dbsii-03-10ddbintro-20