low-level storage structures e.g. partitions underpinning the warehouse logical table structures
|
|
- Camron Green
- 8 years ago
- Views:
Transcription
1 DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures supporting efficient access paths to logical table data - e.g. bitmap indexes additional structures supporting more efficient query processing e.g. materialized views. Physical design decisions are driven by query performance and warehouse maintenance considerations. 1
2 The previous refresher note Database Definition introduced the basic physical storage structures supported by a mainstream relational DBMS, including tablespaces, segments, extents, block structures, B-tree indexes and clusters. This note introduces the additional constructs often found at the physical level in a data warehouse and their use in query processing. LARGE TABLE SUPPORT - PARTITIONIING A table is a logical structure forming the basic unit of access in SQL. It has already been seen that normally an Oracle table is contained in a single tablespace, which is the basic unit of storage organization, but the stored rows may be held in multiple files. For very large tables and indexes, partitioning allows decomposition into smaller, more manageable structures. This enables: enhanced query optimization easier administration of tables/indexes increased availability. 2
3 The basis for partitioning can be on the basis of: ranges of values for a specific column or columns within a table e.g. partitioning sales data by month an explicit list of values e.g. partitioning inventory data by warehouse id hashing e.g. to ensure an even distribution of data between partitions where no obvious range or list partitioning strategy. Each partition of a table or index can often be treated as a separate object as well as part of the overall table/index. For example, consider a large table/associated indexes containing sales data which has been range partitioned with one partition for each month s data. Then: 3
4 Queries only involving the most recent month s data need access the single partition containing that data remaining partitions may be offline. Queries may use multiple processes to access partitions in parallel. Partitions for months no longer required can be easily dropped without affecting partitions holding more recent months data. If the table needs reorganization at the physical level, perhaps to reduce the number of chained rows, this can be done on a partition by partition basis thus requiring much reduced temporary storage. If an index needs to be rebuilt, each index partition can be rebuilt one by one, again with much reduced overhead. All of the above are likely to be beneficial in a warehousing application. Parallel queries, though, might reduce performance in OLTP applications the extra overhead initiating parallel queries may not be justified for short, fast transactions accessing few rows. 4
5 LARGE TABLE SUPPORT - PARALLELISM AND COMPRESSION It has just been seen that partitioning provides one basis for exploiting parallelism in a warehouse. Striping data across multiple disks or RAID systems also allows parallelism to be exploited. This can be done whether or not tables are partitioned. Striping data avoids I/O bottlenecks by allowing multiple controllers, I/O channels and internal buses to be used to increase bandwidth of data movement to and from disk. Striping of partitioned tables/indexes can also be used to support increased availability of data if each partition has its own set of disks/files and a disk fails, access to data in remaining partitions is still possible. Data segments may be compressed, enabling reduced disk use and smaller buffer cache sizes at the cost of extra CPU, particularly on data updates. Hence, compression is most suitable for read-only data as often found in warehouses. Bitmap indexes, introduced below, also lend themselves to efficient compression techniques. 5
6 BITMAP INDEXES Traditional B-tree indexes allow fast access to indexed rows, and can be updated efficiently as updates are made to the indexed rows. However, they can be very expensive in terms of storage space used, particularly when multiple indexes to support ad hoc queries are defined on a table. Also, efficient updating may be not needed in a data warehouse consisting of data which once loaded into the warehouse is not updated. Bitmap indexes provide an alternative which can be more space efficient while providing efficient access for the ad hoc queries often required for a data warehouse. Bitmap indexes are most effective when the number of distinct values that a column can hold is relatively small compared to the number of rows, e.g. < 1%. Such low cardinality certainly holds for a column such as gender which holds values M or F, but it equally holds for a column holding approaching possible values in a table of 1 million rows. 6
7 In a bitmap index: A separate bitmap is held for each possible value for an indexed column in a table. Each position in the bitmap corresponds to a row in the table. The bit corresponding to a row is set to 1 in a bitmap if that row holds the value corresponding to the bitmap, otherwise the bit is set to 0. For example, consider the customer table in the FoodMart database with three bitmap indexes on marital_status, gender, and houseowner and index entries shown in respect of the rows for customers with ids from 1 to 4. customer_id marital_status marital_status gender gender houseowner houseowner = M = S = M = F = Y = N
8 The bitmap indexes can be used to efficiently answer queries such as: How many male married customers are there? The marital_status = M and gender = M bitmaps can be anded together producing bitmap 0001 for the four customer rows shown and the number of bits set can be counted. Retrieve customer records for customers who are either female houseowners or single males. The gender = F and houseowner = Y bitmaps can be anded together producing bitmap 1010 for the four customers shown. The gender = M and marital_status = S bitmaps can be anded together producing bitmap Finally, the two anded bitmaps may be ored - producing a final bitmap 1110 showing that the customers with id 1 to 3 will be among those satisfying the query. 8
9 Bitmap indexes may also be used to support join processing. For example consider a sales table including the following rows: sales_id customer_id Then a bitmap join index could be created on the sales table in respect of related customer marital_status values: sales_id marital_status marital_status = M = S Hence, the bitmap identifies the rows in sales which will join with rows in customer with a given value for marital_status. 9
10 The index would then efficiently support a query such as SELECT S.SALES_ID, C.MARITAL_STATUS FROM SALES S, CUSTOMER C WHERE S.CUSTOMER_ID = C.CUSTOMER_ID Oracle supports bitmap indexes such as introduced above. For example, indexes could be created for customer and sales tables as follows: CREATE BITMAP INDEX cust_marital_status_bx ON customer(marital_status); CREATE BITMAP INDEX sales_cust_marital_status_jbx ON sales(customer.marital_status) FROM sales, customer WHERE sales.customer_id = customer.customer_id; While bitmap indexes work well for read-only warehouse data, they are not suited for tables with many updates since a stored bitmap record referencing many rows will be locked for update to reflect a changed value in the indexed column. 10
11 So far we have seen that each possible value for a column has been represented by a separate bitmap. For columns which have more than a small number of values, holding a bitmap for each value may require too much space. Encoded bitmap indexes use schemes which encode the possible values of a column in more space efficient ways than a separate bitmap for each possible value. Some encoded bitmap indexes can also efficiently support a wider range of queries than just exact match with a single value. For example, consider a column which may contain a value between 1 and 6. A conventional bitmap index would have 6 separate bitmaps. An alternative approach would be encode ranges of values <2, <3, <4, <5, <6 resulting in 5 bitmaps only. Such a scheme enables queries with a search condition < n to be answered by reference to a single bitmap, while still enabling exact match queries to be answered by reference to 2 bitmaps only. 11
12 For example, a row with a value 4 for the column would have the corresponding bit set to 1 in the <5 and <6 bitmaps with 0 set in the remainder. Hence, rows with a value 4 are those whose corresponding bit in the <4 bitmap is set to 0 while the corresponding bit in the <5 is set to 1. The example above encodes ranges of values. Another approach encodes intervals rather than ranges e.g. 1-3, 2-4, 3-5, 4-6 resulting in 4 bitmaps only. Now support for queries with a lower and upper bound on values are directly supported, while again exact match queries can be answered by reference to 2 bitmaps only. For example, rows with a value 4 are those whose corresponding bit in the 1-3 bitmap is set to 0 while the corresponding bit in the 2-4 bitmap is set to 1. 12
13 If a bitmap index exists on each foreign key in the fact table of a star schema, the query optimizer can use a technique known as a star transformation or star join optimization. Consider the following query on the FoodMart schema: select t.week_of_year, p.product_id, p.product_name, s.store_city, sum(sf.store_sales) sales_sum from sales_fact_1998 sf, store s, product p, time_by_day t where sf.store_id = s.store_id and sf.product_id = p.product_id and sf.time_id = t.time_id and s.store_country = USA and p.product_class_id = 94 and t.quarter = Q1 group by t.week_of_year, p.product_id, p.product_name, s.store_city order by t.week_of_year, p.product_id, sales_sum desc, s.store_city This query has a structure which is often found: an aggregate of a measure in a fact table involving a number of dimensions is computed, with selection conditions being applied to the dimensions. 13
14 If a bitmap index exists on each of the foreign key columns store_id, product_id and time_id in the fact table sales_fact_1998, then the optimizer can optimize the query as follows. Execute the selection conditions on the dimension tables s.store_country = USA p.product_class_id = 94 t.quarter = Q1 to identify the key values of s.store_id, p.product_id and t.time_id satisfying the selection conditions. Access the bitmap indexes on sf.store_id, sf.product_id and sf.time_id foreign keys in sales_fact_1998 which will identify the rows in the fact table with matching foreign key values in each case. Intersect the bitmaps to identify those rows in the fact table which have matching key values in all of the dimension tables involved in the selection conditions. Only join those matching rows in the fact table with rows in the dimension tables. 14
15 Another use of a bitmap is in a Bloom Filter which tests whether an element with a particular value is a member of a set. Unlike other data structures such as search trees and conventional hash tables, Bloom filters are more space efficient since the value itself is not stored. Also, the structure can represent a set with any number of elements. In the context of data warehouses, a Bloom Filter is used in some architectures to test whether data with a particular key value is stored within the warehouse. If not, unnecessary disk accesses for non-existent data can be avoided. This is particularly valuable in warehouse architectures built on distributed file systems where access to a remote node is thereby avoided. The Bloom Filter algorithm uses a number of different hash algorithms: we assume 3 in the example below, h 1, h 2, h 3. Also, a bit vector is used, which initially has all elements set to
16 If a data item with key k is inserted into the warehouse, each hash function h 1, h 2, h 3 is applied to k giving 3 results each of which identifies a position in the bit vector. The bits in those positions are set to 1 if currently 0. h 1 (k) h 2 (k) h 3 (k) To test whether a data item with key k exists in the warehouse, each hash function is applied to k. If any of the bit positions identified is 0, then the data item cannot exist in the warehouse. The approach may lead to false positive: all bit positions being 1 for a key value does not guarantee that the data item exists in the warehouse. It does not lead to false negatives however: if any bit position is 0,then the data item does not exist in the warehouse. As the size of the bit vector increases, the number of false positives decreases. For a given number of data items and bit vector size, the number of hash functions needed to minimize the probability of false positives can be calculated. 16
17 MATERIALIZED VIEWS Materialized view tables have already been introduced as a logical data warehouse structure. Since their purpose is to increase query performance, and they may be transparent to users of a warehouse, they are perhaps better considered along with indexes as structures to support efficient access paths. Although users of a warehouse may access materialized views tables directly, they are often user for query rewrite by the query optimizer without explicit reference to them in SQL queries. 17
18 For example, consider the FoodMart sales_fact _1998 table. Assume that the sales performance of stores needs to be analyzed frequently, resulting in queries like: select store_id, time_id, sum(store_sales), sum(store_cost), sum(unit_sales) from sales_fact_1998 group by store_id, time_id; Query Plan SELECT STATEMENT Cost = 396 SORT GROUP BY TABLE ACCESS FULL SALES_FACT_1998 On a large fact table, performance for this query may be unacceptable. A materialized view to support this query could be created in Oracle as follows: 18
19 CREATE MATERIALIZED VIEW sales_group_store_1998 build immediate refresh on demand enable query rewrite as select store_id, time_id, sum(store_sales), sum(store_cost), sum(unit_sales) from sales_fact_1998 group by store_id, time_id; The materialized view definition specifies that: The materialized view should be populated from the underlying base table immediately - build immediate. The alternative parameter build deferred indicates that the materialized view will be populated by the next REFRESH operation. 19
20 Modifications (updates/inserts/deletes) to the underlying base table are propagated to the materialized view by running a system procedure when required - refresh on demand. The alternative parameters include: complete the defining query of the materialized view is executed. fast an incremental refresh is performed which takes account of changes within the underlying tables. force will perform a fast refresh is possible, otherwise a complete refresh. on commit a fast refresh is to occur whenever the database commits a transaction that operates on an underlying table. The materialized view can be used for query rewrite - enable query rewrite. This is explained further below. 20
21 The materialized view definition can be used directly in a query, for example: select * from sales_group_store_1998 where store_id between 1 and 20; Query Plan SELECT STATEMENT Cost = 2 TABLE ACCESS FULL SALES_GROUP_STORE_1998 The corresponding query on the base table would be: select store_id, time_id, sum(store_sales), sum(store_cost), sum(unit_sales) from fm_admin.sales_fact_1998 where store_id between 1 and 20 group by store_id, time_id; Query Plan SELECT STATEMENT Cost = 413 SORT GROUP BY TABLE ACCESS FULL SALES_FACT_
22 With query rewrite enabled, the materialized view may also be used transparently: ALTER SESSION SET QUERY_REWRITE_ENABLED = TRUE; ALTER SESSION SET QUERY_REWRITE_INTEGRITY = ENFORCED; select store_id, time_id, sum(store_sales), sum(store_cost), sum(unit_sales) from fm_admin.sales_fact_1998 where store_id between 1 and 20 group by store_id, time_id; Query Plan SELECT STATEMENT Cost = 2 TABLE ACCESS FULL SALES_GROUP_STORE_
23 Query rewrite has been enabled with: The statement ALTER SESSION SET QUERY_REWRITE_ENABLED = TRUE ALTER SESSION SET QUERY_REWRITE_INTEGRITY = ENFORCED controls how Oracle rewrites queries. In this case ENFORCED ensures that queries are only rewritten using the constructs which Oracle itself enforces and are thereby guaranteed correct such as the materialized view in the example. 23
24 Alternative parameters to ALTER SESSION SET QUERY_REWRITE_INTEGRITY are: TRUSTED This additionally allows Oracle to use constructs which are user-specified. For example, a pre-existing table may be specified as a materialized view. A value TRUSTED would enable such a materialized view to be used in query rewrite, Oracle trusting that the pre-built table contains correct data for the specified view. STALE_TOLERATED This additionally allows Oracle to use a materialized view even if the contents are out of synchronization with the source tables. This might be acceptable for some applications. 24
25 DIMENSIONS Dimensions have already been introduced as a logical design construct in multidimensional models. If dimensions are explicitly created, they may give performance advantages through additional optimizations being possible with query rewrite. They may also improve the performance of materialized view refresh operations. For example consider the following dimension created with Oracle: create dimension cust_dim level region is customer.customer_region_id level province is customer.state_province level country is customer.country hierarchy geog_rollup ( region child of province child of country ) 25
26 In this example, the 1:n hierarchical relationship between attributes in the customer table is defined. Dimensions may also be defined in respect of 1:n relationships in multiple tables as well as relationships between a hierarchy level and other functionally dependent attributes. With such a dimension, a materialized view is not needed for each level in the hierarchy for efficient evaluation of an aggregation at that level. For example, a materialized view aggregating at the region level may still be used in the evaluation of a query aggregating at the province or country level given the definition of cust_dim. 26
27 COLUMN STORES Relational DBMS implementations have traditionally been row-oriented: at the storage level the logical rows of tables are stored as individual records or chained records if the record representing an entire row is too large to fit in a page. Alternative architectures have been researched for many years in which columns rather than rows are the primary basis of physical organization. These column stores have become of particular interest in recent years for data warehouse applications. Since many data warehousing applications require the analysis of data in just some of the columns of a table, even without sophisticated implementation techniques columns stores are likely to have performance advantages over row stores since it is not necessary to retrieve records representing all the columns in a table. Relational DBMS vendors increasingly incorporate column store technology within their own architectures to enhance performance with data warehouse analytics workloads, 27
28 The advantages can be even greater if the following optimization techniques are incorporated in the storage and query execution levels of column store implementations. Storing columns rather than rows of data is likely to allow compression techniques allowing the values of a column in a table with very many rows to be efficiently stored. For example, if a column is sorted, and the same value repeats many times, it can be stored as a value and the number of repeats rather than the value being stored multiple times. Whether sorted or not, a column of a table can be stored conceptually as an identifier for the row together with an encoding of the value of that column in that row. A fixed width may be used for both the row identifier and the encoded value which means the row identifier need not be physically stored at all the storage position of the value for a row may be computed as an offset. 28
29 The performance advantages of compression are enhanced if the compressed data can be operated on without decompression first by the query execution engine: this can be done in many cases with late materialization techniques. For example, if values are represented as offsets in columns, these can be manipulated efficiently using bit level representations as with bitmap indexes seen earlier. Hence: The construction of tuples may be avoided in many cases, with only the columns required in the final result materialized at a late stage in the query execution. The need to decompress other than the final result data may be avoided. Cache memory may be used efficiently given the compact bit level representation of values. Block iteration query processing is possible in which blocks of values in columns are processed by an operator in a single function call, enabling efficient parallel execution using pipelining techniques. 29
30 Storage structures have also been developed to to scale to very large amounts of data processed on commodity hardware. An example of this is HDFS (Hadoop Distributed File System) which has been developed as part of the Apache Hadoop project. HDFS is designed to be: Scalable to store very large amounts of data. Economical by utilising clusters of nodes running on commodity hardware with heterogeneous operating systems any machine which runs Java can run HDFS. Fault tolerant with automatic recovery in the presence of failed nodes. The HDFS architecture supports interconnected clusters, each of which consists of: Multiple DataNodes which store data in blocks in files. A single NameNode that manages the file system namespace and maintains mappings of file blocks to DataNodes. 30
31 HDFS supports operations to create and delete directories, and create, read, write and delete files within directories. A user application does not need to be aware of the distributed nature of the architecture. HDFS replicates data blocks for fault tolerance: typically HDFS clusters are spread across multiple hardware racks and a NameNode aims to place replicated blocks on multiple racks. When a user application reads a file, the HDFS client asks the NameNode for a list of DataNodes that hold the replicas of the file s blocks. It then requests transfer of the desired block from the closest DataNode to the reader. During normal operation, each DataNode sends periodic (default 3 seconds) messages called heartbeats to the NameNode to confirm availability. If a NameNode does not receive a heartbeat in 10 minutes, the NameNode considers the DataNode to have failed and schedules creation of the unavailable block replicas at other DataNodes. When a user application writes to a file, the HDFS client caches data in a temporary local file which only gets written to the HDFS file when a block is filled. The client flushes the block to one DataNode which itself flushes the block to the next DataNode holding replicas in a pipeline fashion. 31
32 READING D J Abadi, S R Madden, N Hachem, Column-Stores vs. Row-Stores: How Different Are They Really?, Proc SIGMOD 08, June (Section 6 optional.) P-Å Larson et al., Enhancements to SQL Server Column Stores, Proc SIGMOD 13, June (Section 5 optional.) J Jeffrey Hanson, An Introduction to the Hadoop Distributed File System, IBM developerworks, February FOR REFERENCE Oracle Database Data Warehousing Guide, Part I Data Warehouse Fundamentals (Chapters 3-4), Part II Optimizing Data Warehouses (Chapters 5-6, 9-11). 32
ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #2610771
ENHANCEMENTS TO SQL SERVER COLUMN STORES Anuhya Mallempati #2610771 CONTENTS Abstract Introduction Column store indexes Batch mode processing Other Enhancements Conclusion ABSTRACT SQL server introduced
More informationData Warehousing Concepts
Data Warehousing Concepts JB Software and Consulting Inc 1333 McDermott Drive, Suite 200 Allen, TX 75013. [[[[[ DATA WAREHOUSING What is a Data Warehouse? Decision Support Systems (DSS), provides an analysis
More informationIndexing Techniques for Data Warehouses Queries. Abstract
Indexing Techniques for Data Warehouses Queries Sirirut Vanichayobon Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK, 739 sirirut@cs.ou.edu gruenwal@cs.ou.edu Abstract Recently,
More informationA B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION
Speed- Up Extension To Hadoop System- A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationActian Vector in Hadoop
Actian Vector in Hadoop Industrialized, High-Performance SQL in Hadoop A Technical Overview Contents Introduction...3 Actian Vector in Hadoop - Uniquely Fast...5 Exploiting the CPU...5 Exploiting Single
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationIn-Memory Data Management for Enterprise Applications
In-Memory Data Management for Enterprise Applications Jens Krueger Senior Researcher and Chair Representative Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University
More informationOracle EXAM - 1Z0-117. Oracle Database 11g Release 2: SQL Tuning. Buy Full Product. http://www.examskey.com/1z0-117.html
Oracle EXAM - 1Z0-117 Oracle Database 11g Release 2: SQL Tuning Buy Full Product http://www.examskey.com/1z0-117.html Examskey Oracle 1Z0-117 exam demo product is here for you to test the quality of the
More informationRCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
More informationchapater 7 : Distributed Database Management Systems
chapater 7 : Distributed Database Management Systems Distributed Database Management System When an organization is geographically dispersed, it may choose to store its databases on a central database
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationPetabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
More information1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle
1Z0-117 Oracle Database 11g Release 2: SQL Tuning Oracle To purchase Full version of Practice exam click below; http://www.certshome.com/1z0-117-practice-test.html FOR Oracle 1Z0-117 Exam Candidates We
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationThe Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
More informationLecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches Column-Stores Horizontal/Vertical Partitioning Horizontal Partitions Master Table Vertical Partitions Primary Key 3 Motivation
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationChapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design
Chapter 6: Physical Database Design and Performance Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden Robert C. Nickerson ISYS 464 Spring 2003 Topic 23 Database
More informationHadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
More informationComparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
More informationDistributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
More informationWho am I? Copyright 2014, Oracle and/or its affiliates. All rights reserved. 3
Oracle Database In-Memory Power the Real-Time Enterprise Saurabh K. Gupta Principal Technologist, Database Product Management Who am I? Principal Technologist, Database Product Management at Oracle Author
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationDesign and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
More informationDATA WAREHOUSING AND OLAP TECHNOLOGY
DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are
More informationThe Cubetree Storage Organization
The Cubetree Storage Organization Nick Roussopoulos & Yannis Kotidis Advanced Communication Technology, Inc. Silver Spring, MD 20905 Tel: 301-384-3759 Fax: 301-384-3679 {nick,kotidis}@act-us.com 1. Introduction
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationORACLE DATABASE 10G ENTERPRISE EDITION
ORACLE DATABASE 10G ENTERPRISE EDITION OVERVIEW Oracle Database 10g Enterprise Edition is ideal for enterprises that ENTERPRISE EDITION For enterprises of any size For databases up to 8 Exabytes in size.
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationData Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
More informationThe Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
More informationModule 14: Scalability and High Availability
Module 14: Scalability and High Availability Overview Key high availability features available in Oracle and SQL Server Key scalability features available in Oracle and SQL Server High Availability High
More informationIN-MEMORY DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: In-Memory DBMS - SoSe 2015 1
IN-MEMORY DATABASE SYSTEMS Prof. Dr. Uta Störl Big Data Technologies: In-Memory DBMS - SoSe 2015 1 Analytical Processing Today Separation of OLTP and OLAP Motivation Online Transaction Processing (OLTP)
More informationChapter 13 File and Database Systems
Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation
More informationChapter 13 File and Database Systems
Chapter 13 File and Database Systems Outline 13.1 Introduction 13.2 Data Hierarchy 13.3 Files 13.4 File Systems 13.4.1 Directories 13.4. Metadata 13.4. Mounting 13.5 File Organization 13.6 File Allocation
More informationSafe Harbor Statement
Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment
More informationInge Os Sales Consulting Manager Oracle Norway
Inge Os Sales Consulting Manager Oracle Norway Agenda Oracle Fusion Middelware Oracle Database 11GR2 Oracle Database Machine Oracle & Sun Agenda Oracle Fusion Middelware Oracle Database 11GR2 Oracle Database
More informationThe Vertica Analytic Database Technical Overview White Paper. A DBMS Architecture Optimized for Next-Generation Data Warehousing
The Vertica Analytic Database Technical Overview White Paper A DBMS Architecture Optimized for Next-Generation Data Warehousing Copyright Vertica Systems Inc. March, 2010 Table of Contents Table of Contents...2
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationA survey of big data architectures for handling massive data
CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - jordydomingos@gmail.com Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationMaximizing Materialized Views
Maximizing Materialized Views John Jay King King Training Resources john@kingtraining.com Download this paper and code examples from: http://www.kingtraining.com 1 Session Objectives Learn how to create
More informationHDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5
More informationData Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A
Data Warehousing and Decision Support Chapter 23, Part A Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1 Introduction Increasingly, organizations are analyzing current and historical
More informationOracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.
Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE
More informationComparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
More informationCS54100: Database Systems
CS54100: Database Systems Date Warehousing: Current, Future? 20 April 2012 Prof. Chris Clifton Data Warehousing: Goals OLAP vs OLTP On Line Analytical Processing (vs. Transaction) Optimize for read, not
More informationSQL Server 2008 Performance and Scale
SQL Server 2008 Performance and Scale White Paper Published: February 2008 Updated: July 2008 Summary: Microsoft SQL Server 2008 incorporates the tools and technologies that are necessary to implement
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationF1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013
F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords
More informationIn-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More informationIndex Selection Techniques in Data Warehouse Systems
Index Selection Techniques in Data Warehouse Systems Aliaksei Holubeu as a part of a Seminar Databases and Data Warehouses. Implementation and usage. Konstanz, June 3, 2005 2 Contents 1 DATA WAREHOUSES
More informationCitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
More informationThe Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationIBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop
IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop Frank C. Fillmore, Jr. The Fillmore Group, Inc. Session Code: E13 Wed, May 06, 2015 (02:15 PM - 03:15 PM) Platform: Cross-platform Objectives
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationUsing distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More informationThe Classical Architecture. Storage 1 / 36
1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage
More informationBigdata High Availability (HA) Architecture
Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources
More informationOracle Database 11 g Performance Tuning. Recipes. Sam R. Alapati Darl Kuhn Bill Padfield. Apress*
Oracle Database 11 g Performance Tuning Recipes Sam R. Alapati Darl Kuhn Bill Padfield Apress* Contents About the Authors About the Technical Reviewer Acknowledgments xvi xvii xviii Chapter 1: Optimizing
More informationCopyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1
Slide 29-1 Chapter 29 Overview of Data Warehousing and OLAP Chapter 29 Outline Purpose of Data Warehousing Introduction, Definitions, and Terminology Comparison with Traditional Databases Characteristics
More informationHadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
More informationOracle Architecture, Concepts & Facilities
COURSE CODE: COURSE TITLE: CURRENCY: AUDIENCE: ORAACF Oracle Architecture, Concepts & Facilities 10g & 11g Database administrators, system administrators and developers PREREQUISITES: At least 1 year of
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationPerformance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.
Is your database application experiencing poor response time, scalability problems, and too many deadlocks or poor application performance? One or a combination of zparms, database design and application
More informationOracle Database 11g: SQL Tuning Workshop Release 2
Oracle University Contact Us: 1 800 005 453 Oracle Database 11g: SQL Tuning Workshop Release 2 Duration: 3 Days What you will learn This course assists database developers, DBAs, and SQL developers to
More informationOracle Database In-Memory The Next Big Thing
Oracle Database In-Memory The Next Big Thing Maria Colgan Master Product Manager #DBIM12c Why is Oracle do this Oracle Database In-Memory Goals Real Time Analytics Accelerate Mixed Workload OLTP No Changes
More informationQuery Acceleration of Oracle Database 12c In-Memory using Software on Chip Technology with Fujitsu M10 SPARC Servers
Query Acceleration of Oracle Database 12c In-Memory using Software on Chip Technology with Fujitsu M10 SPARC Servers 1 Table of Contents Table of Contents2 1 Introduction 3 2 Oracle Database In-Memory
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationIn-Memory Databases MemSQL
IT4BI - Université Libre de Bruxelles In-Memory Databases MemSQL Gabby Nikolova Thao Ha Contents I. In-memory Databases...4 1. Concept:...4 2. Indexing:...4 a. b. c. d. AVL Tree:...4 B-Tree and B+ Tree:...5
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationOLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni
OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni Agenda Database trends for the past 10 years Era of Big Data and Cloud Challenges and Options Upcoming database trends Q&A Scope
More informationDistributed Data Management
Introduction Distributed Data Management Involves the distribution of data and work among more than one machine in the network. Distributed computing is more broad than canonical client/server, in that
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationBitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse
Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse Zainab Qays Abdulhadi* * Ministry of Higher Education & Scientific Research Baghdad, Iraq Zhang Zuping Hamed Ibrahim Housien**
More informationOracle Database 11g: SQL Tuning Workshop
Oracle University Contact Us: + 38516306373 Oracle Database 11g: SQL Tuning Workshop Duration: 3 Days What you will learn This Oracle Database 11g: SQL Tuning Workshop Release 2 training assists database
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001
ICOM 6005 Database Management Systems Design Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001 Readings Read Chapter 1 of text book ICOM 6005 Dr. Manuel
More informationHypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
More informationFacebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election
More informationDatabase Design Patterns. Winter 2006-2007 Lecture 24
Database Design Patterns Winter 2006-2007 Lecture 24 Trees and Hierarchies Many schemas need to represent trees or hierarchies of some sort Common way of representing trees: An adjacency list model Each
More informationHadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
More informationData Management in the Cloud
Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
More informationInnovative technology for big data analytics
Technical white paper Innovative technology for big data analytics The HP Vertica Analytics Platform database provides price/performance, scalability, availability, and ease of administration Table of
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationCapacity Planning Process Estimating the load Initial configuration
Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationPetabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics data 4
More information