Advanced Database Management Systems
|
|
|
- Claude Marshall
- 9 years ago
- Views:
Transcription
1 Advanced Database Management Systems Distributed DBMS:Introduction and Architectures Alvaro A A Fernandes School of Computer Science, University of Manchester AAAF (School of CS, Manchester) Advanced DBMSs / 121 Outline Introduction to Distributed DBMSs Distributed DBMS Architectures AAAF (School of CS, Manchester) Advanced DBMSs / 121
2 Introduction to Distributed DBMSs Distributed Computing Definition A number of distinct processing elements possibly administratively autonomous possibly heterogeneous interconnected by a computer network cooperating in the performance of assigned tasks. Several aspects of DBMS can be distributed, e.g.: Control (e.g., over updates, allocation of resources, etc.) Processing logic (e.g., algebraic operators, data movements, etc.) Services (e.g., optimization, access control, etc.) Data (e.g., tuples, columns, relations, etc.) AAAF (School of CS, Manchester) Advanced DBMSs / 121 Introduction to Distributed DBMSs What is a Distributed Database Management System? Definition A distributed database (DDB) is a collection of multiple, distinct, but logically interrelated databases, placed in different physical locations and linked by a computer network. A distributed database management system (DDBMS) is the software that manages the DDB and provides mechanisms that make this distribution transparent to applications and end users. AAAF (School of CS, Manchester) Advanced DBMSs / 121
3 Introduction to Distributed DBMSs DDBMS Environment (1) Note that a (centralized) DBMS may be networked without being a DDBMS. This happens when there is no logical view over the data and resources that could be accessed and shared over a network. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Introduction to Distributed DBMSs DDBMS Environment (2) When there is a logical view over data and resources, they can then be accessed and shared over a network and co-operation takes place. AAAF (School of CS, Manchester) Advanced DBMSs / 121
4 Introduction to Distributed DBMSs DDBMS Environment (3) Implicit Assumptions Data may be stored at a number of sites. Each site logically has a distinct (assumed single) processor. Processors at different sites are interconnected by a computer network, therefore no multiprocessors, no specialist interconnect, no specialist parallel hardware. The DDB is a DB, not a collection of data files, therefore the data is logically related (e.g., as manifested in the access patterns that are characteristic of the relational data model). AAAF (School of CS, Manchester) Advanced DBMSs / 121 Introduction to Distributed DBMSs DDBMS Environment (4) Applications/Promises Any organization which has a decentralized structure is a good a priori candidate for using DDBMSs. A DDBMS promises: Transparent management of distributed, fragmented, and replicated data Improved reliability/availability through distributed processes Improved performance by exploiting locality and parallelism Easier and more economical system expansion through scale out (i.e., more of the same boxes ) rather than scale up (i.e., ever bigger, more expensive boxes ). AAAF (School of CS, Manchester) Advanced DBMSs / 121
5 Introduction to Distributed DBMSs DDBMS Environment (5) Transparency Transparency stems from abstraction, i.e., the separation of the higher-level semantics of a system from the lower-level implementation concerns. In a DDBMS environment, a fundamental issue is to provide data independence through several kinds of transparency: Network (or distribution) transparency Replication transparency Fragmentation transparency horizontal, through selection vertical, through projection hybrid, combining both AAAF (School of CS, Manchester) Advanced DBMSs / 121 Introduction to Distributed DBMSs DDBMS Environment (6) Example Relations Example EMP = ENO ENAME TITLE PROJ = E1 J. Doe Elect. Eng. E2 M. Smith Syst. Anal. E3 A. Lee Mech. Eng. E4 J. Miller Programmer E5 B. Casey Syst. Anal. E6 L. Chu Elect. Eng.. E7 R. Davis Mech. Eng. E8 J. Jones Syst. Anal PNO PNAME BUDGET LOC P1 Instrumentation Tokyo P2 Database Develop Oslo P3 CAD/CAM Oslo P4 Maintenance Paris P5 CAD/CAM Paris ASG = ENO PNO RESP DUR E1 P1 Manager 12 E2 P1 Analyst 24 E2 P2 Analyst 6 E3 P3 Consultant 10 E3 P4 Engineer 48 E4 P2 Programmer 18 E5 P2 Manager 24 E6 P4 Manager 48 E7 P3 Engineer 36 E7 P5 Engineer 23 E8 P3 Manager 40 PAY = TITLE SAL Elect. Eng Syst. Anal Mech. Eng Programmer AAAF (School of CS, Manchester) Advanced DBMSs / 121
6 Introduction to Distributed DBMSs DDBMS Environment (7) Transparent Access Example Find the name and salary of employees on assignments not lasting 12 months. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Introduction to Distributed DBMSs DDBMS Environment (8) Potentially Improved Performance Locality : Data can be kept close to its points of use by means of data distribution strategies whilst still benefitting all by means of data integration strategies. Parallelism : Distribution allows for both inter-query (i.e., when whole query evaluation plans (QEPs) run in distinct sites) and intra-query parallelism (i.e., when QEP fragments of the same query run in distinct sites). AAAF (School of CS, Manchester) Advanced DBMSs / 121
7 Introduction to Distributed DBMSs DDBMS Environment (9) Scale-Out System Expansion Scaling-out (i.e., deriving a positive response in performance to more of the same processing elements) is generally considered to be easier than scaling-up (i.e., deriving a positive response in performance by the same number of larger processing elements). With the widespread availability of high-performance commodity hardware, scale-out is all the more appealing now than in the past. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Distributed DBMS Architectures DBMS Abstraction Through Schema Levels The ANSI/SPARC Architecture A DBMS supports abstractions by means of schemas that define different views at different levels. A DDBMS must provide transparency without breaking, and indeed supporting, such expectations. AAAF (School of CS, Manchester) Advanced DBMSs / 121
8 Distributed DBMS Architectures DBMS Implementation Alternatives (1) The DBMS implementation dimensions that matter the most in DDBMSs are: distribution : of various kinds (e.g., processing, data) heterogeneity : of various kinds (e.g., syntactic, semantic) autonomy : of various kinds (e.g., at instance level, at schema level) AAAF (School of CS, Manchester) Advanced DBMSs / 121 Distributed DBMS Architectures DBMS Implementation Alternatives (2) Dimensions of the Problem Distribution refers to whether the data and processing components of the system are located on the same machine or not. There many sources of heterogeneity, e.g., infrastructural (e.g., different hardware, communications, OSs, etc.) syntactic (e.g., different data model, database languages, etc.) semantic (e.g., different names for the same concepts, different concepts with the same name) Autonomy is the least understood, the most troublesome to contend with, and takes various forms: Design autonomy, i.e., the degree to which the design of a component DBMS can change without explicit co-ordination and control Communication autonomy, i.e., the degree to which a component DBMS can decide whether and how to communicate with others. Execution autonomy, i.e., the degree to which a component DBMS can decide whether and how to execute operations locally AAAF (School of CS, Manchester) Advanced DBMSs / 121
9 Distributed DBMS Architectures DBMS Implementation Alternatives (3) There are some interesting triples in the space defined by the implementation dimensions in the figure: centralized DBMSs : (D=none, H=none, A=none) client-server DBMSs : (D=some, H=some, A=none) federated DBMSs : (D=any, H=some, A=some) multi-dbmss : (D=any, H=any, A=any) AAAF (School of CS, Manchester) Advanced DBMSs / 121 Distributed DBMS Architectures Distributed DBMS Architectures (1) Federated DBMS from Component Schemas In the figure, federated is being used to connote the notion that the component DBMSs co-operate by neither exercising high degrees of autonomy nor inflicting high levels of heterogeneity on others. In this case, a global conceptual schema (GCS) arises from local conceptual schemas (LCS) and local internal schemas (LIS) by a negotiation process (or by imposition from the centre, if within organization boundaries). The external schemas (ES) can be more easily derived from the GCS. AAAF (School of CS, Manchester) Advanced DBMSs / 121
10 Distributed DBMS Architectures Distributed DBMS Architectures (2) Multi-DBMS from Component Schemas In the figure, multi- is being used to connote the notion that the component DBMSs have no coercion on their autonomy nor on how heterogeneous they make themselves. In this case, a GCS does not normally arise by negotiation (e.g., there may be more than one GCS if the component DBMSs have public interfaces). The component DBMSs may still have local external schemas (LES) imposed upon them. The GCS too can have global external schemas (GES) imposed upon it. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Distributed DBMS Architectures Distributed DBMS Architectures (3) Multi-DBMS without a Global Schema The absence of a global schema means that only partial views arise, i.e., there is no attempt at a unified description of all the component DBMSs. This is more likely in the case of ad-hoc, single-use scenarios, where there is no motivation to invest on creating a global view over the resources. AAAF (School of CS, Manchester) Advanced DBMSs / 121
11 Distributed DBMS Architectures Distributed DBMS Architectures (4) Multi-DBMS Execution Model In a multi-dbms there is a need to map a global request into local sub-requests and local sub-results into a global result. The component DBMSs still cater for local requests with local results. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Distributed DBMS Architectures Distributed DBMS Architectures (5) Time-Shared Access to a Centralized Database In mere time-sharing of a centralized DBMS, all data and all applications run remotely from the point of access. Requests are for batch tasks, a response (and not necessarily results) is sent back. AAAF (School of CS, Manchester) Advanced DBMSs / 121
12 Distributed DBMS Architectures Distributed DBMS Architectures (6) Multiple Clients/Single Server In client-server approaches, clients are applications that interface through client-side services and communications with a server. The server runs server-side services in response to client requests. Because of the client-side services that support the application, high-level, fine-grained, interactive requests can be sent that cause results (i.e., filtered answers only) to flow back. In general, the client-side services offer query language interfaces (perhaps language-embedded, or form-based, or both). AAAF (School of CS, Manchester) Advanced DBMSs / 121 Distributed DBMS Architectures Distributed DBMS Architectures (7) Pros/Cons of Client-Server Architectures Pros More efficient division of labor Client-side scale-up and scale-out Better price/performance on client machines Ability to use familiar tools on client machines Client access to remote data Full DBMS functionality provided to many Overall better system price/performance Cons (vis-à-vis other distribution strategies) Possible bottleneck and single point of failure in the server Server-side scale-up and scale-out less easy AAAF (School of CS, Manchester) Advanced DBMSs / 121
13 Distributed DBMS Architectures Distributed DBMS Architectures (8) Multiple Clients/Multiple Servers Distributing server-side load is possible. Mechanisms become more complex at the lower levels. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Distributed DBMS Architectures Distributed DBMS Architectures (9) Co-Operating Servers Once servers start co-operating, one is coming close to a truly distributed DDBMS. The newest classes of DDBMSs have arisen in the last five year as a result of pressure to maintain extremely large repositories of either structured or unstructured data supporting workloads consisting of either relatively few computationally intensive analyses or an extremely large amount of relatively simple retrieval or update requests. AAAF (School of CS, Manchester) Advanced DBMSs / 121
14 Distributed DBMS Architectures Summary Distributed DBMS Architectures DDBMSs have risen in importance due to structural changes in the computing landscape that saw the networking of high-quality PCs become the norm. Even so, they still retain their original role of emulating the operational decentralization of organizations. DDBMS architectures capitalize on localization and parallelization to offer a potential for performance gains. Nonetheless, autonomy and heterogeneity levels can create significant hurdles for full distribution. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Distributed DBMS Architectures Advanced Database Management Systems Data Distribution Strategies Alvaro A A Fernandes School of Computer Science, University of Manchester AAAF (School of CS, Manchester) Advanced DBMSs / 121
15 Outline Distributed DBMSs: The Design Problem Data Distribution Strategies Fragmentation and Allocation Fragmentation, in More Detail AAAF (School of CS, Manchester) Advanced DBMSs / 121 Distributed DBMSs: The Design Problem Distribution Strategies (1) The Design Problem In the general setting, we need to decide: the placement of data and programs across the sites of a computer network as well as possibly designing the network itself In DDBMS, the placement of applications entails: placement of the distributed DBMS software placement of the applications that run on the database AAAF (School of CS, Manchester) Advanced DBMSs / 121
16 Distributed DBMSs: The Design Problem Distribution Strategies (2) Dimensions of the Problem Whether only data is partitioned across sites (and programs are replicated everywhere) or whether programs are partitioned too Whether the access patterns are stable or not Whether knowledge of such access patterns is complete or not AAAF (School of CS, Manchester) Advanced DBMSs / 121 Distributed DBMSs: The Design Problem Distribution Strategies (3) Design Approaches Top-Down : only possible, in practice, when the system is being designed from scratch, and only lasts if heterogeneity and autonomy are tightly controlled Bottom-Up : only practical solution when the component databases already exist at a number of sites, and more likely to last when heterogeneity and autonomy cannot be controlled AAAF (School of CS, Manchester) Advanced DBMSs / 121
17 Data Distribution Strategies Data Distribution Strategies (4) Some Design Issues Why fragment at all? How to fragment? How much to fragment? How to ensure correctness of fragmentation? How to allocate fragments? What information is required? AAAF (School of CS, Manchester) Advanced DBMSs / 121 Data Distribution Strategies Data Distribution Strategies (5) Fragmentation (1) Why can t we just distribute relations? Because most relations are designed to be suitable for a great many applications, and different applications may be subject to different locality aspects and offer different parallelization opportunities. What is a reasonable unit of distribution? Roughly, that view on a relation that is needed by one or more applications in one place AAAF (School of CS, Manchester) Advanced DBMSs / 121
18 Fragmentation and Allocation Data Distribution Strategies (6) Fragmentation (2) Consider the case of entire relations as the unit of distribution: Most relations have subsets whose semantics characterize special affinity (e.g., of location, of timing, etc.). For example, in a relation Employees, the attribute Department may characterize location affinity if different departments occupy different locations. If so, then unnecessary communication may be incurred if we distribute entire relations. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation and Allocation Data Distribution Strategies (7) Fragmentation (3) Consider the case of sub-relations as the unit of distribution: A sub-relation, referred to as a fragment in the DDBMS context, is what is specified by a view (typically by selection or projection or both). Fragmentation can be derived in knowledge of applications and their affinities and allows parallel/distributed execution. For example, if Employee is horizontally fragmented by the attribute Department, and different fragments are held where the corresponding department is located, computing the average salary in each department can be done in parallel. If, after fragmentation, a particular query/view cannot be defined over a single fragment, then extra processing will be needed. Also, semantic checks may be more difficult (e.g., enforcing referential integrity). AAAF (School of CS, Manchester) Advanced DBMSs / 121
19 Fragmentation and Allocation Data Distribution Strategies (8) Fragmentation Alternatives: Horizontal (1) Broadly speaking, defined by a selection. Reconstruction is by union. Example PROJ1 σ budget< (PROJ) PROJ2 σ budget (PROJ) PROJ PROJ1 PROJ2 AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation and Allocation Data Distribution Strategies (9) Fragmentation Alternatives: Horizontal (2) Example PROJ1 = PNO PNAME BUDGET LOC P1 Instrumentation Tokyo P2 Database Develop Oslo PROJ2 = PNO PNAME BUDGET LOC P3 CAD/CAM Oslo P4 Maintenance Paris P5 CAD/CAM Paris PROJ = PNO PNAME BUDGET LOC P1 Instrumentation Tokyo P2 Database Develop Oslo P3 CAD/CAM Oslo P4 Maintenance Paris P5 CAD/CAM Paris AAAF (School of CS, Manchester) Advanced DBMSs / 121
20 Fragmentation and Allocation Data Distribution Strategies (10) Fragmentation Alternatives: Vertical (1) Broadly speaking, defined by a projection. Reconstruction is by a natural join on the replicated key. Example PROJ1 π PNO,BUDGET (PROJ) PROJ2 π PNO,PNAME,LOC (PROJ) PROJ PROJ1 PNO PROJ2 AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation and Allocation Data Distribution Strategies (11) Fragmentation Alternatives: Vertical (2) Example PROJ1 = PNO BUDGET P P P P P PROJ2 = PNO PNAME LOC P1 Instrumentation Tokyo P2 Database Develop. Oslo P3 CAD/CAM Oslo P4 Maintenance Paris P5 CAD/CAM Paris PROJ = PNO PNAME BUDGET LOC P1 Instrumentation Tokyo P2 Database Develop Oslo P3 CAD/CAM Oslo P4 Maintenance Paris P5 CAD/CAM Paris AAAF (School of CS, Manchester) Advanced DBMSs / 121
21 Fragmentation and Allocation Data Distribution Strategies (12) Correctness of Fragmentation Completeness The decomposition of a relation R into fragments R 1, R 2,..., R n is complete if and only if each data item in R can also be found in some R i. Reconstructibility If a relation R is decomposed into fragments R 1, R 2,..., R n, then there should exist some relational operator such that R = n i=1 R i. Disjointness If a relation R is decomposed into fragments R 1, R 2,..., R n, and data item d i is in R j, then d i should not be in any other fragment R k (k j). AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation and Allocation Data Distribution Strategies (13) Allocation Alternatives Non-replicated : the fragments form a proper partition, and each fragment resides at only one site. Replicated : the fragments overlap, either fully (i.e., each fragment exists at every site) or partially (i.e., each fragment exists at some sites only). An often used rule-of thumb is that if the number of proper (i.e., read-only) queries is larger than the number of updating queries, then replication tends to be advantageous in proportion, otherwise the opposite is the case. Especially in the client/server case, caching is also part of the design considerations. Web giants (e.g., Facebook, Amazon) use replication extensively. AAAF (School of CS, Manchester) Advanced DBMSs / 121
22 Fragmentation and Allocation Data Distribution Strategies (14) Replication v. Caching: Some Contrasts Replication Caching target server client or middle-tier granularity coarse fine storage device typically disk typically main memory impact on catalog yes no update protocol propagation invalidation remove copy explicit implicit mechanism separate fetch fault in and keep copy after use AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation, in More Detail Data Distribution Strategies (15) Information Requirements The are four kinds of information required: about the database about the applications (i.e., the queries, by and large) about the communication network about the computer system AAAF (School of CS, Manchester) Advanced DBMSs / 121
23 Fragmentation, in More Detail Fragmentation Kinds Horizontal Fragmentation (HF) Primary Horizontal Fragmentation (PHF) Derived Horizontal Fragmentation (DHF) Vertical Fragmentation (VF) Hybrid Fragmentation (HF) AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation, in More Detail Primary Horizontal Fragmentation (1) Information Requirements: Database We draw a link from a relation R to a relation S if we there is an equijoin on the key of R and the corresponding foreign key in S. We call R the owner, and S the member. We need the cardinalities of relations and the (average) length of their tuples. AAAF (School of CS, Manchester) Advanced DBMSs / 121
24 Fragmentation, in More Detail Primary Horizontal Fragmentation (2) Information Requirements: Application (1) Given R with schema [A 1,..., A n ], a simple predicate p j has the form A i θc where θ {=,, <, >,, }, c Domain(A i ). For a relation R, we define Pr = {p 1,..., p m }. Given R and Pr, we define the set of minterm predicates M = {m 1,..., m r } as M = { m k m k = p j Pr p j }, 1 j m, 1 k r, where p j = p j or else p j = p j. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation, in More Detail Primary Horizontal Fragmentation (3) Information Requirements: Application (2) Example Some (but not all) simple predicates on PROJ are: p 1 : LOC = Tokyo p 2 : LOC = Oslo p 3 : LOC = Paris p 4 : BUDGET Some (but not all) minterm predicates on PROJ are: m 1 : LOC = Tokyo BUDGET m 2 : (LOC = Tokyo ) BUDGET m 3 : LOC = Tokyo (BUDGET ) m 4 : (LOC = Tokyo ) (BUDGET ) AAAF (School of CS, Manchester) Advanced DBMSs / 121
25 Fragmentation, in More Detail Primary Horizontal Fragmentation (4) Information Requirements: Application (3) We also need quantitative information about the application: The selectivity of a minterm mi, denoted by sel(m i ) is the number of tuples in the corresponding relation R that would be produced by σ mi (R). The access frequency of an application q i, denoted by acc(q i ) is the number of times in which q i accesses data in a given period. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation, in More Detail Primary Horizontal Fragmentation (5) Definition A (primary) horizontal fragment R j of a relation R is defined as R j σ mi (R) where m i is a minterm predicate on R. Given a set of minterm predicates M = {m 1,..., m r } over R, one can define r horizontal fragments in R. [Öszu and Valduriez, 1999] give an algorithm that, given a relation R and a set of simple predicates on R, produces a correct set of fragments from R. AAAF (School of CS, Manchester) Advanced DBMSs / 121
26 Fragmentation, in More Detail Primary Horizontal Fragmentation (6) Example (1) Example (Information Required) Let the relations PAY and PROJ be candidates for PHF. Let the following be the applications involved: A1: Find the name and budget of projects given their project number. A2: Find projects according to their budget. Let A1 be issued at three sites. Let one site access A2 for budgets below , and the other two access A2 for those above. Let the following be the simple predicates: p 1 : LOC = Tokyo p 2 : LOC = Oslo p 3 : LOC = Paris p 4 : BUDGET p 5 : BUDGET > AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation, in More Detail Primary Horizontal Fragmentation (7) Example (2) Example (Output) Applying the algorithm alluded to, the following minterm predicates result: m 1 : LOC = Tokyo BUDGET m 2 : LOC = Tokyo BUDGET > m 3 : LOC = Oslo BUDGET m 4 : LOC = Oslo BUDGET > m 5 : LOC = Paris BUDGET m 6 : LOC = Paris BUDGET > AAAF (School of CS, Manchester) Advanced DBMSs / 121
27 Fragmentation, in More Detail Primary Horizontal Fragmentation (8) Example (3) Example (Fragments Obtained) PROJ1 = PNO PNAME BUDGET LOC P1 Instrumentation Tokyo PROJ3 = PNO PNAME BUDGET LOC P2 Database Develop Oslo PROJ4 = PNO PNAME BUDGET LOC P3 CAD/CAM Oslo PROJ6 = PNO PNAME BUDGET LOC P4 Maintenance Paris P5 CAD/CAM Paris AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation, in More Detail Derived Horizontal Fragmentation (1) Definition A derived horizontal fragment is defined on a member relation according to a selection operation on its owner. Recall that a link from owner to member is defined in terms of an equijoin. A semijoin between R and S is defined as follows: R S π A (R S), where A is the list of attributes in the schema of R. Given a link L, where owner(l) = S and member(l) = R, the derived horizontal fragments of R are defined as R i = R S i, 1 i w, where w is the maximum number of fragments to be generated and S i = σ mi (S) is the primary horizontal fragment defined by the minterm predicate m i. AAAF (School of CS, Manchester) Advanced DBMSs / 121
28 Fragmentation, in More Detail Derived Horizontal Fragmentation (2) Example (1) Example (Information Required, Fragments Defined) Let there be a link L 1 with owner(l 1 ) = PAY and member(l 1 ) = EMP. Let PAY 1 σ SAL (PAY ) and PAY 2 σ SAL>30000 (PAY ). Then two DHFs are defined: EMP1 EMP PAY 1 EMP2 EMP PAY 2 AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation, in More Detail Derived Horizontal Fragmentation (3) Example (2) Example (Fragments Obtained) EMP1 = ENO ENAME TITLE E3 A. Lee Mech. Eng. E4 J. Miller Programmer E7 R. Davis Mech. Eng. EMP2 = ENO ENAME TITLE E1 J. Doe Elect. Eng. E2 M. Smith Syst. Anal. E5 B. Casey Syst. Anal. E6 L. Chu Elect. Eng.. E8 J. Jones Syst. Anal AAAF (School of CS, Manchester) Advanced DBMSs / 121
29 Fragmentation, in More Detail Vertical Fragmentation (1) Vertical fragmentation has also been studied in the centralized context since it is important for: normalization of designs physical clustering In terms of physical clustering, there is excitement in the DBMS industry (at the time of writing) about an extreme form of vertical partitioning in which single columns are stored separately. Certain access patterns are made easier by this and compression levels an order of magnitude larger can be obtained, which is important when dealing with the massive volumes of data that are typical of analytics workloads. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation, in More Detail Vertical Fragmentation (2) Vertical fragmentation is more difficult than horizontal fragmentation, because more alternatives exist. Heuristic approaches that can be used are: grouping : one adds attributes to fragments one by one. splitting : one breaks down a relation into fragments based on access patterns. See [Öszu and Valduriez, 1999] for an example (or else recall, from your earlier database studies the theory of normal forms and how it is justifiably disobeyed). AAAF (School of CS, Manchester) Advanced DBMSs / 121
30 Summary Data Distribution Strategies Fragmentation, in More Detail Fragmentation, allocation, replication and caching are all mechanisms that DDBMSs make use of to respond to the affinity of locality that data exhibits, particularly in decentralized organizations. The design decisions required are well-studied and well-founded solutions are available but require a great deal of information. The benefits can be significant particularly for response time because of the greater degree of natural parallelism that becomes possible. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Fragmentation, in More Detail Advanced Database Management Systems Distributed Query Processing Alvaro A A Fernandes School of Computer Science, University of Manchester AAAF (School of CS, Manchester) Advanced DBMSs / 121
31 Outline The Distributed Query Processing Problem Two-Phase Distributed Query Optimization Localization and Reduction Cost-Related Issues Join Ordering in DQP AAAF (School of CS, Manchester) Advanced DBMSs / 121 The Distributed Query Processing Problem Distributed Query Processing (1) What is the Problem? (1) Assume the fragments EMP i and ASG j to be stored in the sites shown in the figure. Assume the double-shafted arrows to denote the transfer of data between sites. Strategy 1 can be said to aim to do processing locally in order to reduce the amount of data that needs to be shipped to the result site, i.e., Site 5. AAAF (School of CS, Manchester) Advanced DBMSs / 121
32 The Distributed Query Processing Problem Distributed Query Processing (2) What is the Problem? (2) Strategy 2 can be said to aim to ship all the data to, and do all the processing at, the site where results need to be delivered AAAF (School of CS, Manchester) Advanced DBMSs / 121 The Distributed Query Processing Problem Distributed Query Processing (3) Cost of Alternatives (1) Example (Assumptions) t a (tuple access cost) = 1 unit t t (tuple transfer cost) = 10 units ASG = 100, length(asg) = 10, EMP = 80, length(emp) = 5 ASG 1 = σ ENO E3 (ASG) = 50 EMP 1 = σ ENO E3 (EMP) = 40 V (ASG, RESP) = 5 length(eno) = 2 AAAF (School of CS, Manchester) Advanced DBMSs / 121
33 The Distributed Query Processing Problem Distributed Query Processing (4) Cost of Alternatives (2) Example (Consequences) size(asg) = = 1, 000, size(emp) = 80 5 = 400 ASG 2 = ASG ASG 1 = = 50 size(asg 1 ) = size(asg 2 ) = ASG 1 10 = = 500 EMP 2 = EMP EMP 1 = = 40 size(emp 1 ) = size(emp 2 ) = EMP 1 5 = 40 5 = 200 ASG 1 = ASG 2 = σ RESP= manager (ASG 1) = ASG = ASG 1 + ASG 2 = = 20 ASG i V (ASG,RESP) = 50 5 = 10 length(emp i ENO ASG i ) = length(emp) + length(asg) length(eno) = = 13 AAAF (School of CS, Manchester) Advanced DBMSs / 121 The Distributed Query Processing Problem Distributed Query Processing (5) Cost of Alternatives (3) Example (Comparison (1)) Action Cost Formula Cost produce ASG i = 2 ASG i t a = transfer ASG i to sites 3, 4 = 2 size(asg i ) t t = ,000 produce EMP i = 2 EMP i ASG i t a = transfer EMP i to site 5 = 2 size(emp i ENO ASG i ) t t = , Total Cost of Strategy 1 5,500 AAAF (School of CS, Manchester) Advanced DBMSs / 121
34 The Distributed Query Processing Problem Distributed Query Processing (6) Cost of Alternatives (4) Example (Comparison (2)) transfer EMP to site 5 transfer ASG to site 5 produce ASG join EMP and ASG Action Cost Formula Cost = size(emp) t t = ,000 = size(asg) t t = ,000 = ASG t a = = EMP ASG t a = ,600 Total Cost of Strategy 2 15,700 AAAF (School of CS, Manchester) Advanced DBMSs / 121 The Distributed Query Processing Problem Distributed Query Processing (7) Query Optimization Objectives Minimize a cost function such as total time or response time. All components may have different weights in different distributed environments. One could have different goals, e.g., maximize throughput. AAAF (School of CS, Manchester) Advanced DBMSs / 121
35 The Distributed Query Processing Problem Distributed Query Processing (8) Where Can Decisions Be Made? Centralized Distributed Hybrid A single site determines the schedule. This is simpler, but requires knowledge about the entire distributed database. There is co-operation among sites to determine the schedule. This only requires sharing local information, but co-operation has a cost. One site determines the global schedule. Each site optimizes the local subqueries. AAAF (School of CS, Manchester) Advanced DBMSs / 121 The Distributed Query Processing Problem Distributed Query Processing (9) Issues Regarding the Network Wide-Area Network (WAN) WANs have comparatively low bandwidth, low speed and high protocol overhead As a result, communication cost will dominate, to the extent that it may be possible to ignore all other costs. Thus, the global schedule will aim to minimize communication cost. Local schedules are decided according to centralized query optimization decisions. Local-Area Network (LAN) Communication cost is not as dominant as in WANs. Thus, all components in the total cost function must be considered. Broadcasting is an option. AAAF (School of CS, Manchester) Advanced DBMSs / 121
36 Two-Phase Distributed Query Optimization Distributed Query Optimization (1) Two-Phase Approach One way to implement distributed query optimization as a continuum with the centralized case is to structure the decision-making stages in such a way that the optimizer breaks the overall task into two phases. In the first phase, a single-node QEP is produced (that would run if the DBMS were not a distributed DBMS); in the second phase, this single-node QEP is transformed into a multi-node one. The second phase partitions a QEP into fragments linked by exchange operators, then schedules each fragment to execute in different component nodes. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Localization and Reduction Distributed Query Optimization (2) Localizing a Global Query Given an algebraic query on global relations: determine which are distributed; for those, determine which fragments are involved; replace references to global relations with the reconstruction expression (which is referred to as a localization program). The leaves of distributed relations are replaced by its localization program over its fragments. The result is sometimes referred to as a generic query and is likely to benefit from optimization by reduction. AAAF (School of CS, Manchester) Advanced DBMSs / 121
37 Localization and Reduction Distributed Query Optimization (3) Some Examples (1) Assume EMP is horizontally fragmented into EMP 1, EMP 2 and EMP 3 as follows: 1. EMP 1 σ ENO E3 (EMP) 2. EMP 2 σ E3 <ENO E6 (EMP) 3. EMP 3 σ ENO> E6 (EMP) Assume ASG is horizontally fragmented into ASG 1 and ASG 2 as follows: 1. ASG 1 σ ENO E3 (ASG) 2. ASG 2 σ ENO> E3 (ASG) AAAF (School of CS, Manchester) Advanced DBMSs / 121 Localization and Reduction Distributed Query Optimization (4) Some Examples (2) Assume the following query: SELECT E.ENAME FROM EMP E WHERE E.ENO = E5 The figure shows the corresponding generic query with the leaf replaced by its localization program. Then, the figure shows the query after optimization by reduction, in this case because it follows from the predicates that defined the fragments that only EMP 2 can contribute to the specified results. AAAF (School of CS, Manchester) Advanced DBMSs / 121
38 Localization and Reduction Distributed Query Optimization (5) Some Examples (3) Assume the following query: SELECT E.ENAME FROM EMP E, ASG A WHERE E.ENO = A.ENO The figure shows the corresponding generic query with the leaf replaced by its localization program. We next show the query after reduction. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Localization and Reduction Distributed Query Optimization (6) Some Examples (4) The figure shows the reduced join query. Note that the optimizer has used the commutativity between join and union to push the joins upstream and reduce the amount of work. This also helps in scheduling the joins to execute in parallel. Note, finally, that the optimizer has made use of the fact that EMP 3 and ASG 1 do not share tuples (because their predicates lead to a contradiction, and hence would return an empty set) and eliminated the need to join them. AAAF (School of CS, Manchester) Advanced DBMSs / 121
39 Localization and Reduction Distributed Query Optimization (7) Some Examples (5) Assume EMP is vertically fragmented into EMP 1 and EMP 2 as follows: 1. EMP 1 π ENO,ENAME (EMP) 2. EMP 2 π ENO,TITLE (EMP) Assume the following query: SELECT E.ENAME FROM EMP E The figure shows the corresponding generic query with the leaf replaced by its localization program. Then, the figure shows the query after optimization by reduction, in this case because it follows from the projection lists that defined the fragments that only EMP 1 can contribute to the specified results. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Localization and Reduction Distributed Query Optimization (8) A Detailed Example Derivation (1) Assume PROJ is horizontally fragmented into PROJ 1, PROJ 2 and PROJ 3 as follows: 1. PROJ 1 σ LOC= Tokyo (PROJ) 2. PROJ 2 σ LOC= Oslo (PROJ) 3. PROJ 3 σ LOC= Paris (PROJ) Assume the following query: SELECT AVG(P.BUDGET) FROM PROJ P WHERE P.LOC = OSLO AAAF (School of CS, Manchester) Advanced DBMSs / 121
40 Localization and Reduction Distributed Query Optimization (9) A Detailed Example Derivation (2) (translate) γ AVG(BUDGET ) (σ LOC= Oslo (PROJ)) (localize) γ AVG(BUDGET ) (σ LOC= Oslo (PROJ 1 (PROJ 2 PROJ 3 ))) (expand) γ AVG(BUDGET ) (σ LOC= Oslo (σ LOC= Tokyo (PROJ) (σ LOC= Oslo (PROJ) σ LOC= Paris (PROJ)))) (combine) γ AVG(BUDGET ) (σ LOC= Oslo LOC= Tokyo (PROJ) (σ LOC= Oslo LOC= Oslo (PROJ) (σ LOC= Oslo LOC= Paris (PROJ)))) (simplify) γ AVG(BUDGET ) (σ (PROJ) (σ LOC= Oslo (PROJ) (σ (PROJ)))) (simplify) γ AVG(BUDGET ) ( (σ LOC= Oslo (PROJ) )) (simplify) γ AVG(BUDGET ) (σ LOC= Oslo (PROJ)) (simplify) γ AVG(BUDGET ) (PROJ 2 ) This derivation shows that the query can be executed only over the Oslo horizontal fragment PROJ 2 and wherever it is stored. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Cost-Related Issues Distributed Query Optimization (10) Scheduling Query Fragments Given a fragment query, find the best global schedule by minimizing a cost function. Join processing in centralized DBMSs tends to prefer linear (e.g., left-deep) trees because the size of the search space is reduced by the linearity constraint). However, in distributed DBMSs, join processing over bushy trees reveals opportunities for parallelism. Other decisions include: Which relation to ship where? Whether to ship the whole or to ship as needed? Whether to use semijoins? (Semijoins save on communication at the expense of more local processing.) AAAF (School of CS, Manchester) Advanced DBMSs / 121
41 Cost-Related Issues Distributed Query Optimization (11) Cost Functions Total Time (also referred to as Total Cost): The overall strategy in this case is to Reduce the cost (i.e., time) in each component individually Do as little of each cost component as possible This optimizes the utilization of the resources and tends to increases system throughput. Response Time The overall strategy in this case is to do as many things as possible in parallel. However, this may increase the total time because of overall increased activity. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Cost-Related Issues Distributed Query Optimization (12) Total Cost The total cost is the summation of all cost factors: 1. Total cost = CPU cost + I/O cost + communication cost 2. CPU cost = unit instruction cost no.of instructions 3. I/O cost = unit disk I/O cost no. of disk I/Os 4. communication cost = (unit message initiation cost no. of messages)+ (unit transmission cost no. of bytes) AAAF (School of CS, Manchester) Advanced DBMSs / 121
42 Cost-Related Issues Distributed Query Optimization (13) Response Time The response time is the elapsed time between the initiation and the completion of a query. Processing and communication costs that are incurred in sequence in a component count at most once. If several sequential tasks are executed in parallel, the cost that is counted is the maximum cost of all those tasks. 1. Response time = CPU time + I/O time + communication time 2. CPU time = unit instruction time no. of sequential instructions 3. I/O time = unit I/O time no. of sequential I/Os 4. communication time = (unit message initiation time no. of sequential messages) + (unit transmission time no. of sequential bytes AAAF (School of CS, Manchester) Advanced DBMSs / 121 Cost-Related Issues Distributed Query Optimization (14) Some Cost Factors wide-area networks Message initiation and transmission costs are relatively high. Local processing cost is comparatively low (fast mainframes or minicomputers) Ratio of communication to I/O costs is high (2-digits to 1-digit?). local-area networks Communication and local processing costs are comparable. Ratio of communication to I/O costs is not high (close to 1:1?). AAAF (School of CS, Manchester) Advanced DBMSs / 121
43 Cost-Related Issues Distributed Query Optimization (15) Example: Total Cost v. Response Time Assume that: only the communication cost is considered one message conveys one unit of work (e.g., a tuple) Let UM denote the unit message initialization time and UT the unit transmission time. Let T send (r, s, t) denote the time to send r from s to t. Total time = (n + m)um + (np + mq)ut Response time = max{t send (n, 1, 3), T send (m, 2, 3)} T send (n, 1, 3) = num + nput T send (m, 2, 3) = mum + mqut If n = 900, m = 1, 000, p = 90, and q = 100, then Total time = 1, 900UM + 181, 000UT Response time = 1, 000UM + 100, 000UT AAAF (School of CS, Manchester) Advanced DBMSs / 121 Join Ordering in DQP Distributed Query Optimization (16) Join Ordering in Fragment Queries Given an n-ary relation R with attributes A 1,..., A n, let R denote the cardinality of R, and let length(a i ) denote the (possibly average) length in bytes of a value from the domain of A i, in which case the (possibly average) length of a tuple in R is length(r) = n i=1 length(a i). Let size(r) = R length(r). Given two relations R and S that are not co-located, we ship R to the site of S if size(r) size(s) and we ship S to the site of R if size(s) < size(r). For many relations, there may be too many alternatives. Also, computing the cost of all alternatives and selecting the best one depends on computing the size of intermediate relations, which is difficult. In practice, heuristics are needed. AAAF (School of CS, Manchester) Advanced DBMSs / 121
44 Join Ordering in DQP Distributed Query Optimization (17) Join Ordering: An Example (1) Consider the 2-way join PROJ PNO (ASG ENO EMP) The join graph shows the sites where each relation is, and there is an edge between two relations if an equijoin on the edge label is required. The many different execution alternatives are shown next, with a double-shafted arrow denoting the shipment of the relation in the left to the site in the right, and sign denoting that the left-hand side expression is evaluated at the site in the right-hand side. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Join Ordering in DQP Distributed Query Optimization (18) Join Ordering: An Example (2) EMP EMP EMP EMP EMP ASG EMP EMP EMP EMP ASG ASG ASG ASG ASG PROJ PROJ PROJ PROJ PROJ EMP PROJ PROJ (ASG EMP)@ 2 AAAF (School of CS, Manchester) Advanced DBMSs / 121
45 Join Ordering in DQP Distributed Query Optimization (19) Join Ordering: An Example (3) 1. An alternative to enumerating all possibilities is to use the heuristic of considering only the sizes of the operands and assuming that the cardinality of the join is the product of the input cardinalities. 2. In this case, relations are ordered by increasing sizes and the order of execution is given by this ordering and the join. 3. For example, the order (EMP, ASG, PROJ) could use Strategy 1, and the order (PROJ, ASG, EMP) could use Strategy 4. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Join Ordering in DQP Distributed Query Optimization (20) Approaches Based on Semijoins (1) Consider the join of two relations R[A] (located at site 1) and S[A] (located at site 2). One could evaluate R A S. Alternatively,one could evaluate one of the equivalent semijoins: R A S (R A S) A S R A (S A R) (R A S) A (S A R) AAAF (School of CS, Manchester) Advanced DBMSs / 121
46 Join Ordering in DQP Distributed Query Optimization (21) Approaches Based on Semijoins (2) 1. Using a join: 1.1 R R A S@ 2 2. Using a semijoin: 2.1 S π A (S) 2.2 S R R A R R A 2 Semijoin is better if size(π A (S)) + size(r A S)) < size(r) AAAF (School of CS, Manchester) Advanced DBMSs / 121 Summary Distributed Query Processing Join Ordering in DQP There is an evolutionary continuum from centralized to distributed query optimization. Localization and reduction are the main techniques by which a heuristically-efficient distributed QEP can be arrived at. In wide-area distributed query processing (DQP), communication costs tend to dominate, although in local-area networks this is not the case. The join ordering problem remains, here too, an important one. AAAF (School of CS, Manchester) Advanced DBMSs / 121
47 Join Ordering in DQP Advanced Database Management Systems Data Integration Strategies Alvaro A A Fernandes School of Computer Science, University of Manchester AAAF (School of CS, Manchester) Advanced DBMSs / 121 Outline Data Integration: Problem Definition Process Alternatives View-Based Data Integration Schema Matching, Mapping and Integration Dataspaces AAAF (School of CS, Manchester) Advanced DBMSs / 121
48 Data Integration: Problem Definition Data Integration (1) Problem Definition Data(base) integration is the process as a result of which a set of component DBMSs are conceptually integrated to form a multi-dbms, i.e., a DDBMS that offers a single, logically coherent schema to users and applications. Equivalently, given existing databases with their Local Conceptual Schemas (LCSs), data integration is the process by which they are integrated into a Global Conceptual Schema (GCS). A GCS is also called a mediated schema, or, more simply, a global schema. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Data Integration: Problem Definition Data Integration (2) Some Assumptions, Some Issues In general, the problem only arises if the component DBMSs already exist, so data integration is typically a bottom-up process. In some respects, it can be conceived of as the reverse of the data distribution (i.e., fragmentation and allocation) problem. One of the most important concerns in data integration is the level of heterogeneity of the component DBMSs. This, in turn, is strongly linked to the degree of autonomy that each component DBMSs enjoys and exercises. AAAF (School of CS, Manchester) Advanced DBMSs / 121
49 Data Integration (3) Some Alternatives Process Alternatives Physical Integration : in this case, the source databases are integrated and the outcome is materialized. It is the more common practice in data warehousing. Logical Integration in this case, the global schema that emerges from integrating the sources remains virtual. It is the more common practice when the component DBMSs enjoy autonomy (e.g., in scientific contexts, where different research groups maintain different data resources but still allow them to be part of a multi-dbms of interest to them and to others). AAAF (School of CS, Manchester) Advanced DBMSs / 121 Data Integration (4) A Bottom-Up Process Process Alternatives The most widely-used approach involves: translation Each LCS abstracts over a data source. A translator maps across to and from concepts in the LCS and concepts in an intermediate schema (IS). integration The ISs are cast in an interlingua, a canonical formalism in which the LCSs of the participating sources can be cast. The integrator uses the ISs to project out the GCS to users and applications. AAAF (School of CS, Manchester) Advanced DBMSs / 121
50 Data Integration (5) Process Alternatives Dealing with Autonomy and Heterogeneity In contexts where heterogeneity is the norm (e.g., when the multi-dbms is formed from public resources) the translators are often referred to as wrappers and the integrator is referred to as the mediator. Wrappers can reconcile different kinds of heterogeneity, e.g.: infrastructural including those stemming from the system software or network level syntactic including those relating to data model and query languages (e.g., generating a relational view of a spreadsheet) semantic which are the hardest to capture and maintain in sync AAAF (School of CS, Manchester) Advanced DBMSs / 121 Data Integration (6) Schemas as Views View-Based Data Integration There are two major possibilities to relate a GCS and its LCSs by means of views: Global-As-View (GAV) : in this case, the LCSs are the extents over which one writes a set of views that, together, comprise the GCS against which global queries are formulated. Local-As-View (LAV) : in this case, the GCS is assumed to exist and each LCS is treated as if it were a view over this postulated GCS. We will focus on GAV, and a simple example of how it works is coming soon. AAAF (School of CS, Manchester) Advanced DBMSs / 121
51 Schema Matching, Mapping and Integration Database Integration Tasks (1) Postulating Semantic Equivalences Assume three distributed, heterogeneous, autonomous data sources, S1-S3. Lighter attributes with dark borders denote primary keys. The first task is to postulate that there are (1:1, 1:n, m:n) relationships between tables and columns in different sources. This is done by matching at schema-level, i.e., using schema names and structures, and at instance-level, i.e., using attribute values and structures. The dashed lines show some postulated equivalences, explicitly, at column/attribute level (and, for simplicity, only implicitly, at table/relation level). AAAF (School of CS, Manchester) Advanced DBMSs / 121 Schema Matching, Mapping and Integration Database Integration Tasks (2) Postulating Mappings From postulated equivalences, the next task is to write view expressions that define constructs in one schema in terms of one or more other schemas. For example, one might define S3 in terms of S1 and S2 with the following mappings: R π x a,m b,n c (X x=y Y ) S π t d,r+q e (T ) π v d,w e (U) When written as above, we often call the left-hand side of a mapping, the head, and the right-hand side, the body. AAAF (School of CS, Manchester) Advanced DBMSs / 121
52 Schema Matching, Mapping and Integration Database Integration Tasks (3) Query Evaluation over Integrated Schemas From postulated mappings, the next task is to issue queries against the integrated schema. Assume a query against S3: γ avg(e) (σ d>5 (S)) γ avg(e) (σ d>5 (Q Q )) i.e., we rewrite using the mappings that define S in S3 in terms of T and U in S1: S2 : Q π t d,r+q e (T ) S1 : Q π v d,w e (U) The subqueries Q and Q run remotely at S1. Results are shipped to S3 where the union runs locally. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Schema Matching, Mapping and Integration Database Integration Tasks (4) Schema Matching (1) There are two major kinds of heterogeneity that make schema matching a hard problem: Structural (or Syntactic) Heterogeneity Type conflicts (e.g., address as string v. address as struct) Dependency conflicts (e.g., net salary plus tax v. salary) Key conflicts (e.g., absence of foreign keys that would be required) Schematic (or Semantic) Heterogeneity These are conflicts that arise from the fact that the designers of the GCS and the LCS have different underlying ontologies in mind, i.e., they conceptualize the database domain in different terms. AAAF (School of CS, Manchester) Advanced DBMSs / 121
53 Schema Matching, Mapping and Integration Database Integration Tasks (5) Schema Matching (2) Semantic heterogeneity takes many forms, including: Synonyms, i.e., when two words can be interchanged in a context they are said to be synonymous relative to that context. For example, in sport, the word match can be synonymous with tie. Homonyms, i.e., when two words are spelled the same way but have different meanings they they are said to be homonymous. For example, we may want to have an attribute spelled price in the GCS and find it in an LCS but it may have a different meaning there (e.g., it excludes VAT, which is not what the GCS expects). Hypernyms, which are words that are more generic than a given word. For example, the GCS may expect a relation employees not to discriminate between temporary and permanent staff, whereas in some LCS may only store in employees the permanent staff (e.g., because it stores temporary staff under, say, temps ). AAAF (School of CS, Manchester) Advanced DBMSs / 121 Schema Matching, Mapping and Integration Database Integration Tasks (6) Schema Matching (3) There are other complications too: Insufficient schema and instance information: how can one find out how a derived attribute (e.g., VAT) is calculated? Subjectivity of the matching: how can one be sure that the correspondence (e.g., between two relations named employees ) is valid for all instances? Many issues also conspire to make schema matching hard: Schema-level versus instance-level matching: which do we use? Both? If so, which weight does each have? Element-level versus structure-level matching: if we find a match for an attribute but all other attributes in the same relation do not match, do we trust the match? Matching cardinality is hard without additional information, as it is not normally captured in DDLs (although XMLSchema, e.g., can do). AAAF (School of CS, Manchester) Advanced DBMSs / 121
54 Schema Matching, Mapping and Integration Database Integration Tasks (7) Combined Schema Matching Approaches One way to strengthen the validity of the decision, it is possible to use multiple matchers (i.e., different similarity-assigning algorithms). This allows for specialization, e.g., different matchers may focus on different domains (e.g., names, or telephone numbers, or addresses, etc.) A meta-matcher integrates these into one prediction (e.g., taking the (possibly weighted) mean of the similarity values computed by individual matchers). AAAF (School of CS, Manchester) Advanced DBMSs / 121 Schema Matching, Mapping and Integration Database Integration Tasks (8) Schema Integration Once correspondences are deemed valid, we can use them to create a GCS. While matching can (and is) essentially an automated process, selecting matches to become mappings and combining these mappings into a GCS is largely a manual process, i.e., in a rule-based approach, like the previous examples, the rules are not normally generated automatically. Approaches to data integration are illustrated in the figure. AAAF (School of CS, Manchester) Advanced DBMSs / 121
55 Dataspaces Dataspaces as a Data Integration Approach (1) The Question of Cost What we have described so far can be called classical, mediator-based data integration. It delivers high-quality results early but with high upfront costs due to the need for human expertise in making up for the shortcomings of matching and mapping derivation. Dataspaces [Franklin et al., 2005, Halevy et al., 2006, Hedeler et al., 2009] are a new approach to data integration: it automates the bootstrapping of an integrated view, accepts the lower-quality of results early on, but aggressively seeks and uses feedback to improve them over time. The idea is that users get some results quickly at near to no cost: if this motivates them, they pay some cost in the form of feedback as their need spurs them. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Dataspaces Dataspaces as a Data Integration Approach (2) The Question of Quality v. Cost v. Time AAAF (School of CS, Manchester) Advanced DBMSs / 121
56 Dataspaces Dataspaces as a Data Integration Approach (3) The Broad Aim of a Dataspace Management System Given a set of data sources, a dataspace management systems (DSpMS) aims to obtain the best mappings with minimal human intervention. This means bootstrapping the set-up stage (i.e., the postulation of semantic equivalences and the mappings derived them) using automated means. This also means being intelligent and efficient in seeking as few and as useful feedback instances from users as possible and, once they are obtained, making the most of them for improving results [Belhajjame et al., 2010]. One wants to pay as little as possible as late as possible and still obtain excellent results for the effort spent. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Dataspaces Dataspace Architecture (1) Seen as a Stack As we saw, classical integration involves layering a mediator over data sources, which are assumed to be fully-fledged databases. The same holds for dataspaces, except that the mediator (i.e., the ability to translate queries against an integrated schema into queries against sources, stitching the partial results into integrated ones on the return journey) is powered by automatically derived mappings from automatically derived matches. For this later task, some have used model management techniques (essentially an algebra over schema constructs, including mappings and matches) [Bernstein and Melnik, 2007]. A dataspace is then seen to be unique in introducing improvement via feedback. AAAF (School of CS, Manchester) Advanced DBMSs / 121
57 Dataspaces Dataspace Architecture (2) Seen as a Composition of Algebras (1) AAAF (School of CS, Manchester) Advanced DBMSs / 121 Dataspaces Dataspace Architecture (3) Seen as a Composition of Algebras (2) A DSpMS is a DBMS: it retains the ability to evaluate queries over sources to produce the specified results. A DSpMS is also a data integration system: it retains the ability to use mappings and schemas over many distributed resources and use the former to make the latter seem an integrated source. In one approach, a DSpMS is also a model management system [Hedeler et al., 2010]: it has the ability to sample sources and match them to generate correspondences as well as to operate on schemas (e.g., merge, compose, subtract, extract schema constructs). AAAF (School of CS, Manchester) Advanced DBMSs / 121
58 Dataspaces Dataspace Architecture (4) Seen as a Composition of Algebras (3) In this approach, a significant part of a DSpMS is an engine to evaluate algebraic operations over schemas, matches and correspondences. These operations are like programs that a DSpMS executes to allow users to derive many integrated views over a collection of resources, rather than a single one. We can see, therefore, that what is truly unique to a DSpMS is the use of feedback for improving integration mappings that were generated algorithmically and are, for this reason, likely to produce poor quality results. AAAF (School of CS, Manchester) Advanced DBMSs / 121 Summary Data Integration Strategies Dataspaces With the explosion in the availability of networked data and computational resources, the data integration problem has become an extremely important one. Superimposing a global conceptual schema over local ones is as important a task as it is costly. However, view-based techniques can can be used to great effect. The greatest hurdle remains the reconciliation of schematic heterogeneity, the upfront cost of which is often prohibitive. The notion of a dataspace has been introduced recently to characterize a pay-as-you-go approach to data integration, i.e., avoiding having to pay high upfront costs. The idea is that conflict reconciliation happens over time, incrementally, driven by the assimilation of user feedback. AAAF (School of CS, Manchester) Advanced DBMSs / 121
59 Dataspaces Acknowledgements The material presented mixes original material by the author as well as material adapted from [Öszu and Valduriez, 1999] The author gratefully acknowledges the work of the authors cited while assuming complete responsibility any for mistake introduced in the adaptation of the material. AAAF (School of CS, Manchester) Advanced DBMSs / 121 References (1) Dataspaces Belhajjame, K., Paton, N. W., Embury, S. M., Fernandes, A. A. A., and Hedeler, C. (2010). Feedback-based annotation, selection and refinement of schema mappings for dataspaces. In Manolescu, I., Spaccapietra, S., Teubner, J., Kitsuregawa, M., Léger, A., Naumann, F., Ailamaki, A., and Özcan, F., editors, EDBT, volume 426 of ACM International Conference Proceeding Series, pages ACM. Bernstein, P. A. and Melnik, S. (2007). Model management 2.0: manipulating richer mappings. In Chan, C. Y., Ooi, B. C., and Zhou, A., editors, SIGMOD Conference, pages ACM. AAAF (School of CS, Manchester) Advanced DBMSs / 121
60 Dataspaces References (2) Franklin, M. J., Halevy, A. Y., and Maier, D. (2005). From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4): Halevy, A. Y., Franklin, M. J., and Maier, D. (2006). Principles of dataspace systems. In Vansummeren, S., editor, PODS, pages 1 9. ACM. AAAF (School of CS, Manchester) Advanced DBMSs / 121 References (3) Dataspaces Hedeler, C., Belhajjame, K., Mao, L., Paton, N. W., Fernandes, A. A. A., Guo, C., and Embury, S. M. (2010). Flexible dataspace management through model management. In Daniel, F., Delcambre, L. M. L., Fotouhi, F., Garrigós, I., Guerrini, G., Mazón, J.-N., Mesiti, M., Müller-Feuerstein, S., Trujillo, J., Truta, T. M., Volz, B., Waller, E., Xiong, L., and Zimányi, E., editors, EDBT/ICDT Workshops, ACM International Conference Proceeding Series. ACM. Hedeler, C., Belhajjame, K., Paton, N. W., Campi, A., Fernandes, A. A. A., and Embury, S. M. (2009). Dataspaces. In Ceri, S. and Brambilla, M., editors, SeCO Workshop, volume 5950 of Lecture Notes in Computer Science, pages Springer AAAF (School of CS, Manchester) Advanced DBMSs / 121
61 Dataspaces References (4) Öszu, M. T. and Valduriez, P. (1999). Principles of Distributed Database Systems. Prentice Hall International, 2 nd edition. AAAF (School of CS, Manchester) Advanced DBMSs / 121
Distributed Databases in a Nutshell
Distributed Databases in a Nutshell Marc Pouly [email protected] Department of Informatics University of Fribourg, Switzerland Priciples of Distributed Database Systems M. T. Özsu, P. Valduriez Prentice
Distributed Database Management Systems
Page 1 Distributed Database Management Systems Outline Introduction Distributed DBMS Architecture Distributed Database Design Distributed Query Processing Distributed Concurrency Control Distributed Reliability
Distributed Database Design (Chapter 5)
Distributed Database Design (Chapter 5) Top-Down Approach: The database system is being designed from scratch. Issues: fragmentation & allocation Bottom-up Approach: Integration of existing databases (Chapter
Distributed Database Systems. Prof. Dr. Carl-Christian Kanne
Distributed Database Systems Prof. Dr. Carl-Christian Kanne 1 What is a Distributed Database System? A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed
Chapter 3: Distributed Database Design
Chapter 3: Distributed Database Design Design problem Design strategies(top-down, bottom-up) Fragmentation Allocation and replication of fragments, optimality, heuristics Acknowledgements: I am indebted
Outline. Distributed DBMSPage 4. 1
Outline Introduction Background Distributed DBMS Architecture Datalogical Architecture Implementation Alternatives Component Architecture Distributed DBMS Architecture Distributed Database Design Semantic
Chapter 5: Overview of Query Processing
Chapter 5: Overview of Query Processing Query Processing Overview Query Optimization Distributed Query Processing Steps Acknowledgements: I am indebted to Arturas Mazeika for providing me his slides of
Distributed Databases
Distributed Databases Chapter 1: Introduction Johann Gamper Syllabus Data Independence and Distributed Data Processing Definition of Distributed databases Promises of Distributed Databases Technical Problems
Distributed Databases. Concepts. Why distributed databases? Distributed Databases Basic Concepts
Distributed Databases Basic Concepts Distributed Databases Concepts. Advantages and disadvantages of distributed databases. Functions and architecture for a DDBMS. Distributed database design. Levels of
Chapter 2: DDBMS Architecture
Chapter 2: DDBMS Architecture Definition of the DDBMS Architecture ANSI/SPARC Standard Global, Local, External, and Internal Schemas, Example DDBMS Architectures Components of the DDBMS Acknowledgements:
DISTRIBUTED AND PARALLELL DATABASE
DISTRIBUTED AND PARALLELL DATABASE SYSTEMS Tore Risch Uppsala Database Laboratory Department of Information Technology Uppsala University Sweden http://user.it.uu.se/~torer PAGE 1 What is a Distributed
Distributed Data Management
Introduction Distributed Data Management Involves the distribution of data and work among more than one machine in the network. Distributed computing is more broad than canonical client/server, in that
Fragmentation and Data Allocation in the Distributed Environments
Annals of the University of Craiova, Mathematics and Computer Science Series Volume 38(3), 2011, Pages 76 83 ISSN: 1223-6934, Online 2246-9958 Fragmentation and Data Allocation in the Distributed Environments
Principles of Distributed Database Systems
M. Tamer Özsu Patrick Valduriez Principles of Distributed Database Systems Third Edition
chapater 7 : Distributed Database Management Systems
chapater 7 : Distributed Database Management Systems Distributed Database Management System When an organization is geographically dispersed, it may choose to store its databases on a central database
Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures
Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do
Chapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
Overview of Database Management
Overview of Database Management M. Tamer Özsu David R. Cheriton School of Computer Science University of Waterloo CS 348 Introduction to Database Management Fall 2012 CS 348 Overview of Database Management
AN OVERVIEW OF DISTRIBUTED DATABASE MANAGEMENT
AN OVERVIEW OF DISTRIBUTED DATABASE MANAGEMENT BY AYSE YASEMIN SEYDIM CSE 8343 - DISTRIBUTED OPERATING SYSTEMS FALL 1998 TERM PROJECT TABLE OF CONTENTS INTRODUCTION...2 1. WHAT IS A DISTRIBUTED DATABASE
Introduction to Parallel and Distributed Databases
Advanced Topics in Database Systems Introduction to Parallel and Distributed Databases Computer Science 600.316/600.416 Notes for Lectures 1 and 2 Instructor Randal Burns 1. Distributed databases are the
Client/Server Computing Distributed Processing, Client/Server, and Clusters
Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the
Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation
Objectives Distributed Databases and Client/Server Architecture IT354 @ Peter Lo 2005 1 Understand the advantages and disadvantages of distributed databases Know the design issues involved in distributed
Concepts of Database Management Seventh Edition. Chapter 9 Database Management Approaches
Concepts of Database Management Seventh Edition Chapter 9 Database Management Approaches Objectives Describe distributed database management systems (DDBMSs) Discuss client/server systems Examine the ways
Distributed Database Design
Distributed Databases Distributed Database Design Distributed Database System MS MS Web Web data mm xml mm dvanced Database Systems, mod1-1, 2004 1 Advanced Database Systems, mod1-1, 2004 2 Advantages
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Middleware for Heterogeneous and Distributed Information Systems
Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 [email protected] Chapter 2 Architecture Chapter Outline Distributed transactions (quick
Mobile and Heterogeneous databases Database System Architecture. A.R. Hurson Computer Science Missouri Science & Technology
Mobile and Heterogeneous databases Database System Architecture A.R. Hurson Computer Science Missouri Science & Technology 1 Note, this unit will be covered in four lectures. In case you finish it earlier,
BBM467 Data Intensive ApplicaAons
Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal [email protected] FoundaAons of Data[base] Clusters Database Clusters Hardware Architectures Data
An Overview of Distributed Databases
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 4, Number 2 (2014), pp. 207-214 International Research Publications House http://www. irphouse.com /ijict.htm An Overview
Distributed Database Management Systems
Distributed Database Management Systems (Distributed, Multi-database, Parallel, Networked and Replicated DBMSs) Terms of reference: Distributed Database: A logically interrelated collection of shared data
Chapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
Distributed and Parallel Database Systems
Distributed and Parallel Database Systems M. Tamer Özsu Department of Computing Science University of Alberta Edmonton, Canada T6G 2H1 Patrick Valduriez INRIA, Rocquencourt 78153 LE Chesnay Cedex France
Principles and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza
Data Integration Maurizio Lenzerini Universitá di Roma La Sapienza DASI 06: Phd School on Data and Service Integration Bertinoro, December 11 15, 2006 M. Lenzerini Data Integration DASI 06 1 / 213 Structure
COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters
COMP5426 Parallel and Distributed Computing Distributed Systems: Client/Server and Clusters Client/Server Computing Client Client machines are generally single-user workstations providing a user-friendly
How To Understand The Concept Of A Distributed System
Distributed Operating Systems Introduction Ewa Niewiadomska-Szynkiewicz and Adam Kozakiewicz [email protected], [email protected] Institute of Control and Computation Engineering Warsaw University of
Distributed Software Development with Perforce Perforce Consulting Guide
Distributed Software Development with Perforce Perforce Consulting Guide Get an overview of Perforce s simple and scalable software version management solution for supporting distributed development teams.
Topics. Distributed Databases. Desirable Properties. Introduction. Distributed DBMS Architectures. Types of Distributed Databases
Topics Distributed Databases Chapter 21, Part B Distributed DBMS architectures Data storage in a distributed DBMS Distributed catalog management Distributed query processing Updates in a distributed DBMS
Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.
DBMS Architecture INSTRUCTION OPTIMIZER Database Management Systems MANAGEMENT OF ACCESS METHODS BUFFER MANAGER CONCURRENCY CONTROL RELIABILITY MANAGEMENT Index Files Data Files System Catalog BASE It
Chapter 3. Database Environment - Objectives. Multi-user DBMS Architectures. Teleprocessing. File-Server
Chapter 3 Database Architectures and the Web Transparencies Database Environment - Objectives The meaning of the client server architecture and the advantages of this type of architecture for a DBMS. The
Physical Database Design and Tuning
Chapter 20 Physical Database Design and Tuning Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1. Physical Database Design in Relational Databases (1) Factors that Influence
Distributed Architectures. Distributed Databases. Distributed Databases. Distributed Databases
Distributed Architectures Distributed Databases Simplest: client-server Distributed databases: two or more database servers connected to a network that can perform transactions independently and together
University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao
University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao CMPSCI 445 Midterm Practice Questions NAME: LOGIN: Write all of your answers directly on this paper. Be sure to clearly
Transaction Management in Distributed Database Systems: the Case of Oracle s Two-Phase Commit
Transaction Management in Distributed Database Systems: the Case of Oracle s Two-Phase Commit Ghazi Alkhatib Senior Lecturer of MIS Qatar College of Technology Doha, Qatar [email protected] and Ronny
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
1. Physical Database Design in Relational Databases (1)
Chapter 20 Physical Database Design and Tuning Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1. Physical Database Design in Relational Databases (1) Factors that Influence
low-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
Query Optimization for Distributed Database Systems Robert Taylor Candidate Number : 933597 Hertford College Supervisor: Dr.
Query Optimization for Distributed Database Systems Robert Taylor Candidate Number : 933597 Hertford College Supervisor: Dr. Dan Olteanu Submitted as part of Master of Computer Science Computing Laboratory
Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework
Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework Many corporations and Independent Software Vendors considering cloud computing adoption face a similar challenge: how should
PartJoin: An Efficient Storage and Query Execution for Data Warehouses
PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE [email protected] 2
AHAIWE Josiah Information Management Technology Department, Federal University of Technology, Owerri - Nigeria E-mail jahaiwe@yahoo.
Framework for Deploying Client/Server Distributed Database System for effective Human Resource Information Management Systems in Imo State Civil Service of Nigeria AHAIWE Josiah Information Management
Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange
Data Integration and Exchange L. Libkin 1 Data Integration and Exchange Traditional approach to databases A single large repository of data. Database administrator in charge of access to data. Users interact
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Tune That SQL for Supercharged DB2 Performance! Craig S. Mullins, Corporate Technologist, NEON Enterprise Software, Inc.
Tune That SQL for Supercharged DB2 Performance! Craig S. Mullins, Corporate Technologist, NEON Enterprise Software, Inc. Table of Contents Overview...................................................................................
Virtual machine interface. Operating system. Physical machine interface
Software Concepts User applications Operating system Hardware Virtual machine interface Physical machine interface Operating system: Interface between users and hardware Implements a virtual machine that
Data warehousing with PostgreSQL
Data warehousing with PostgreSQL Gabriele Bartolini http://www.2ndquadrant.it/ European PostgreSQL Day 2009 6 November, ParisTech Telecom, Paris, France Audience
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
The Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
Chapter 15: Distributed Structures. Topology
1 1 Chapter 15: Distributed Structures Topology Network Types Operating System Concepts 15.1 Topology Sites in the system can be physically connected in a variety of ways; they are compared with respect
Distributed Database Systems
Distributed Database Systems Vera Goebel Department of Informatics University of Oslo 2011 1 Contents Review: Layered DBMS Architecture Distributed DBMS Architectures DDBMS Taxonomy Client/Server Models
1. INTRODUCTION TO RDBMS
Oracle For Beginners Page: 1 1. INTRODUCTION TO RDBMS What is DBMS? Data Models Relational database management system (RDBMS) Relational Algebra Structured query language (SQL) What Is DBMS? Data is one
Outline. Mariposa: A wide-area distributed database. Outline. Motivation. Outline. (wrong) Assumptions in Distributed DBMS
Mariposa: A wide-area distributed database Presentation: Shahed 7. Experiment and Conclusion Discussion: Dutch 2 Motivation 1) Build a wide-area Distributed database system 2) Apply principles of economics
Patterns of Information Management
PATTERNS OF MANAGEMENT Patterns of Information Management Making the right choices for your organization s information Summary of Patterns Mandy Chessell and Harald Smith Copyright 2011, 2012 by Mandy
Chapter 10 Practical Database Design Methodology and Use of UML Diagrams
Chapter 10 Practical Database Design Methodology and Use of UML Diagrams Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 10 Outline The Role of Information Systems in
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Data Grids. Lidan Wang April 5, 2007
Data Grids Lidan Wang April 5, 2007 Outline Data-intensive applications Challenges in data access, integration and management in Grid setting Grid services for these data-intensive application Architectural
An Approach to High-Performance Scalable Temporal Object Storage
An Approach to High-Performance Scalable Temporal Object Storage Kjetil Nørvåg Department of Computer and Information Science Norwegian University of Science and Technology 791 Trondheim, Norway email:
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Big data management with IBM General Parallel File System
Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers
DATA WAREHOUSING AND OLAP TECHNOLOGY
DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are
Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.
Phases of database design Application requirements Conceptual design Database Management Systems Conceptual schema Logical design ER or UML Physical Design Relational tables Logical schema Physical design
IV Distributed Databases - Motivation & Introduction -
IV Distributed Databases - Motivation & Introduction - I OODBS II XML DB III Inf Retr DModel Motivation Expected Benefits Technical issues Types of distributed DBS 12 Rules of C. Date Parallel vs Distributed
Relational Databases
Relational Databases Jan Chomicki University at Buffalo Jan Chomicki () Relational databases 1 / 18 Relational data model Domain domain: predefined set of atomic values: integers, strings,... every attribute
Fourth generation techniques (4GT)
Fourth generation techniques (4GT) The term fourth generation techniques (4GT) encompasses a broad array of software tools that have one thing in common. Each enables the software engineer to specify some
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
1 Organization of Operating Systems
COMP 730 (242) Class Notes Section 10: Organization of Operating Systems 1 Organization of Operating Systems We have studied in detail the organization of Xinu. Naturally, this organization is far from
2009 Oracle Corporation 1
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
A Review of Database Schemas
A Review of Database Schemas Introduction The purpose of this note is to review the traditional set of schemas used in databases, particularly as regards how the conceptual schemas affect the design of
www.gr8ambitionz.com
Data Base Management Systems (DBMS) Study Material (Objective Type questions with Answers) Shared by Akhil Arora Powered by www. your A to Z competitive exam guide Database Objective type questions Q.1
Optimizing Performance. Training Division New Delhi
Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Chapter 5. Learning Objectives. DW Development and ETL
Chapter 5 DW Development and ETL Learning Objectives Explain data integration and the extraction, transformation, and load (ETL) processes Basic DW development methodologies Describe real-time (active)
Unit 4.3 - Storage Structures 1. Storage Structures. Unit 4.3
Storage Structures Unit 4.3 Unit 4.3 - Storage Structures 1 The Physical Store Storage Capacity Medium Transfer Rate Seek Time Main Memory 800 MB/s 500 MB Instant Hard Drive 10 MB/s 120 GB 10 ms CD-ROM
Capacity Planning Process Estimating the load Initial configuration
Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting
Data Integration: A Theoretical Perspective
Data Integration: A Theoretical Perspective Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza Via Salaria 113, I 00198 Roma, Italy [email protected] ABSTRACT
Query Processing in Data Integration Systems
Query Processing in Data Integration Systems Diego Calvanese Free University of Bozen-Bolzano BIT PhD Summer School Bressanone July 3 7, 2006 D. Calvanese Data Integration BIT PhD Summer School 1 / 152
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
bigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
IT Components of Interest to Accountants. Importance of IT and Computer Networks to Accountants
Chapter 3: AIS Enhancements Through Information Technology and Networks 1 Importance of IT and Computer Networks to Accountants To use, evaluate, and develop a modern AIS, accountants must be familiar
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
Client/Server and Distributed Computing
Adapted from:operating Systems: Internals and Design Principles, 6/E William Stallings CS571 Fall 2010 Client/Server and Distributed Computing Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Traditional
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
