Capacity Planning Process Estimating the load Initial configuration



Similar documents
Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Virtuoso and Database Scalability

Muse Server Sizing. 18 June Document Version Muse

PARALLEL PROCESSING AND THE DATA WAREHOUSE

Distribution One Server Requirements

The Bus (PCI and PCI-Express)

SQL Server Business Intelligence on HP ProLiant DL785 Server

Whitepaper: performance of SqlBulkCopy

Innovative technology for big data analytics

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

The Advantages of Using RAID

Technical White Paper. Symantec Backup Exec 10d System Sizing. Best Practices For Optimizing Performance of the Continuous Protection Server

The Methodology Behind the Dell SQL Server Advisor Tool

Cognos Performance Troubleshooting

Parallel Replication for MySQL in 5 Minutes or Less

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:

Performance And Scalability In Oracle9i And SQL Server 2000

Rackspace Cloud Databases and Container-based Virtualization

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

SQL Memory Management in Oracle9i

XenDesktop 7 Database Sizing


Q & A From Hitachi Data Systems WebTech Presentation:

WHITE PAPER BRENT WELCH NOVEMBER

High performance ETL Benchmark

WHITE PAPER FUJITSU PRIMERGY SERVER BASICS OF DISK I/O PERFORMANCE

Optimizing Performance. Training Division New Delhi

IncidentMonitor Server Specification Datasheet

How To Scale Out Of A Nosql Database

MS SQL Performance (Tuning) Best Practices:

DB2 Database Layout and Configuration for SAP NetWeaver based Systems

Case Study I: A Database Service

PUBLIC Performance Optimization Guide

Optimizing the Performance of Your Longview Application

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Understanding the Value of In-Memory in the IT Landscape

sql server best practice

Binary search tree with SIMD bandwidth optimization using SSE

SQL Server 2012 Performance White Paper

Windows Server Performance Monitoring

Initial Hardware Estimation Guidelines. AgilePoint BPMS v5.0 SP1

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.

QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE

Understanding the Benefits of IBM SPSS Statistics Server

Communicating with devices

2009 Oracle Corporation 1

White Paper February IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario

Geospatial Server Performance Colin Bertram UK User Group Meeting 23-Sep-2014

Data Warehouse Performance Management Techniques.

Load Testing Analysis Services Gerhard Brückl

CribMaster Database and Client Requirements

Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

How To Make A Backup System More Efficient

Deploying and Optimizing SQL Server for Virtual Machines

SAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

A Guide to Getting Started with Successful Load Testing

System Requirements Table of contents

Performance Workload Design

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Configuring Apache Derby for Performance and Durability Olav Sandstå

Computer Components Study Guide. The Case or System Box

Big Fast Data Hadoop acceleration with Flash. June 2013

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Oracle Database In-Memory The Next Big Thing

q for Gods Whitepaper Series (Edition 1) Multi-Partitioned kdb+ Databases: An Equity Options Case Study

Benchmarking Cassandra on Violin

Performance Optimization Guide

Azure VM Performance Considerations Running SQL Server

MailEnable Scalability White Paper Version 1.2

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hardware Configuration Guide

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

Optimizing Your Data Warehouse Design for Superior Performance

CitusDB Architecture for Real-Time Big Data

Directions for VMware Ready Testing for Application Software

Exploring RAID Configurations

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

THE NEAL NELSON DATABASE BENCHMARK : A BENCHMARK BASED ON THE REALITIES OF BUSINESS

StreamServe Persuasion SP5 Microsoft SQL Server

Understanding SQL Server Execution Plans. Klaus Aschenbrenner Independent SQL Server Consultant SQLpassion.at

Oracle Big Data SQL Technical Update

Transcription:

Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting the extended sizes without unacceptable performance loss, or growth of the load window to a point where it affects the use of the system. Process The capacity plan for a data warehouse is defined within the technical blueprint stage of the process. The business requirements stage should have identified the approximate sizes for data, users, and any other issues that constrain system performance. One of the most difficult decisions you will have to make about a data warehouse is the capacity required by the hardware. It is important to have a clear understanding of the usage profiles of all users of the data warehouse. For each user or group of users you need to know the following: The number of users in the group; Whether they use ad hoc queries frequently; Whether they use ad hoc queries occasionally at unknown intervals; Whether they use ad hoc queries occasionally at regular and predictable times; The average size of query they tend to run; The maximum size of query they tend to run; The elapsed login time per day; The peak time of daily usage; The number of queries they run peak hour; The number of queries they run per day. These usage profiles will probably change over time, and need to be kept up to date. They are useful for growth predictions and capacity. The profiles in themselves are not enough; you also require an understanding of the business. Estimating the load When choosing the hardware for the data warehouse there are many things to consider, such as hardware architecture, resilience, and so on. The data warehouse will probably grow rapidly from its initial configuration, so it is not sufficient to consider the initial size of the data warehouse. There are a number of different elements that need to be considered, but the decision all come down to how much CPU, how much memory and how much disk you will need. If your sizing calls for more than the budget can afford, do not allow the required capacity to be chopped back. If that is the case, some of the functionality will need to be pared back and then the capacity can be re-estimated. In the following pages we shall attempt to outline the rules and guidelines that we follow when sizing a system for a data warehouse. Initial configuration When sizing the initial configuration you will have no history information or statistics to work with, and the sizing will need to be done on the predicted load. Estimating this load is difficult, because there is an ad hoc element to it. 1

All you can do is estimate the configuration based on the known requirements. This is why the business requirements phase is so important. When deciding on the initial configuration you will need to allow some contingency. This is particularly important in a data warehouse project, because the requirements are often difficult to pin down accurately, and the load can quickly vary from the expected. Otherwise, the sizing exercise is the same irrespective of the phase of the data warehouse that you are trying to size. How much CPU Bandwidth? To start with you need to consider the distinct loads that will be placed on the system. There are many aspects to this, such as query load, data load, backup and so on, but essentially the load can be divided into two distinct phases: Daily processing # user query processing Overnight processing # data transformation and load # aggregation and index creation # backup Daily processing The daily processing is centered on the user queries. To estimate the CPU requirements you need to estimate the time that each query will take. As much of the query load will be ad hoc it is impossible to estimate the requirement of every query; therefore another approach has to be found. The first thing to do is estimate the size of the largest likely common query. It is possible that some user will want to query across every piece of data in the data warehouse, but this will probably not be a common requirement. It is more likely that the users will want to query the most recent week or month s worth of data. Having established the likely period that will be queried, you will know the volume of data that will be involved. As you cannot assume that a relevant index will be in place, you must assume the query will perform a full table scan of the fact data for that period. So we now have a measure of the volume of data, let us say F megabytes, that will be accessed. To progress any further we need to know the I/O characteristics of the devices that the data will reside on. This allows us to calculate the scan rate S at which the fact data can be read. This will depend on the disk speeds and on the throughput ratings of the controllers. Clearly this also depends on the size of F itself. If F is many gigabytes then it will definitely be spread across multiple disks and probably across multiple controllers. You can now calculate S assuming a reasonable spread of the data. Remember that if the database is designed correctly and the large queries are controlled properly you should not get much contention for the disks, so you should get a reasonable throughput. Using S and F you can calculate T, the time in seconds to perform a full table scan of the period in question: T = F/S..1.1 If fact you should calculate a number of times, T1 Ta, which depend on the degree of parallelism that you are using to perform the scan. Therefore we get T1 = F/S1.... Tn = T/Sn 1.2 2

Where S1 is the scan speed of a disk or striped disk set, and Sn is the scan speed of all the disks or disk sets that F is spread across. You may be able to get slightly higher throughput than Sn with higher degrees of parallelism, but you will bottlenect on I/O at degrees of parallelism much above N. Now you can take the query response time requirements specified in the service level agreement and pick the appropriate T value, Tp say: this will give you Sp, the required scan rate, the number of disks or disk sets you will need to spread data across. It also gives you P, the required degree of parallelism to meet query response times for a single query. Now you need to estimate Pc, the number of parallel scanning threads that a single CPU will support. This will vary from processor to processor. The processors currently on the market will support from two to four scanners, but chip technology is moving on quickly, and this number will change over time. If possible, you should establish this by experiment. Now you can estimate your CPU requirement to support a single query with: Cs = Roundup(2P/Ps) --------- 1.3 You need to use 2P to allow for other query overheads, and for queries that involve sorts. To calculate the minimum number of CPUs required overall use the following formula: Ct = ncs + 1 -----------------1.4 Where n is the number of concurrent queries that will be allowed to run. This should not be confused with the number of concurrent users, because unless a user is running a query that are not likely to be doing anything heavier than editing. Note that the additional 1 is added to the total to allow for the operating system overheads and for all other user processing. Overnight Processing The first point to note about the nightly processing is that the operations listed at he beginning of this section are, for the most part, serialized. This is because each operation usually relies on the previous operation s completing before it can begin. The CPU bandwidth required for the data transformation will depend on the amount of data processing that is involved. Unless there is an enormous amount of data transformation, it is unlikely that this operation will require more CPU bandwidth than the aggregation and index creation operations. The same applies to the backup, although you should bear in mind that backing up large quantities of data in a short period of time will cause a major kickback onto the CPU. If the backup is spread over more hours, the amount of parallelism will come down, and its CPU bandwidth requirement will drop. The data load is another task that can use massive parallelism to speed up its operation. as with backup. As with backup, if you use fewer parallel streams it will use less CPU bandwidth and will take longer to run. Having established what you are going to use as your baseline, you then need to estimate how much CPU capacity that operation requires to complete in the allowed time. It is not safe to assume that you will have more than 10 hours overnight to achieve all the processing, even if the user day is only 8 or 9 hours long. Delays in data arrival can cause you significant problems, and you must make sure that you can complete the overnight processing without running over into the business day. 3

As every data warehouse is different, it is impossible here to give explicit estimates for these operations. It is not even possible to give firm guidelines, because each aggregation is a different complex query and/or update plus an intensive write operation. How Much Memory? There are a number of things that affect the amount of memory required. First, there are the database requirements. The database will need memory to cache data blocks as they are used; it will also need memory to cache parsed SQL statements and so on. You will need memory for sort space. Secondly, each user connected to the system will use an amount of memory: how much will depend on how they are connected to the system and what software they are running. Finally, the operating system will require an amount of memory. How much disk? The disk requirement can be broken down into the following categories: Database requirements # administration # fact and dimension data # aggregation Non database requirements # operating system requirements # other software requirements # user requirements Database sizing There are a number of aspects to the database sizing that need to be considered. First, there are the system administration requirements of the database. There is data dictionary and the journal files, plus any rollback space that is required. These will all be small by comparison with the temporary or sort area. If you can gauze the size of the largest transactin that will realistically be run you can use this to size the temporary requirements. If not, the best you can do is tie it to the size o a partition. If you do this, make allowance for multiple queries running at a given time, and set the temporary space to T = (2n + 1) P ------1.5 Where n is the number of concurrent queries allowed, and P is the size of a partition. It you use different-sized partitions for current data and for older data, then use the largest partition size in the calculation above. Then there is the fact and dimension data. This is the one piece of data that you can actually size; everything else will be sized off this. To do this sizing exercise, you will need to known the database schema. Clearly, when performing the original size estimates for a business case, much of this information will be missing. In this situation, the sizing will need to be based purely on original estimates of the base data size. When sizing the fact or dimension data you will have the record definitions, with each field type and size specified. Note, however, that the size specified for a field will be the maximum size. When calculating the actual size you will need to know. The average size of the data in each field; 4

The percentage occupancy of each field; The position of each data field in the table; The RDBMS storage format of each data type; Any table; row and column overhead. Another factor that may affect our calculations is the database block or page size. A database block will normally have some header information at the top of the block. This is space that cannot be used by data and can amount to 100 + bytes. The difference between using a 2 kb block size and using a!6 kb block size will mean something of the order of 100 bytes of extra data space in every 16 kb. You will also need to estimate the size of the index space required for the fact and dimension data. Fact data should generally be only lightly indexed, with indexes occupying between 15% and 30% of the space occupied by the fact data; the cost in terms of index maintenance would be extremely heavy otherwise. The final determinant of the ultimate size of the fact data is the amount of data that you intend to keep online. When this is known, you can decide on your partitionling strategy. This needs to be taken into account in the sizing, because it is unlikely that you will get data to load exactly into every partition. You also need to size the aggregations. For the initial system you will probably be able to size the actual aggregations that are planned. All the factors discussed above apply equally to the aggregations. As a rule of thumb, you should allow the same amount of space for aggregations as you will have fact data online. You will also need to allow space for indexes on the aggregations. These summarized tables are likely to be heavily indexed, and it is usual to assume 100% indexation. In other words, allow as much space again for indexing as for the aggregates themselves. So, to summarize, the space required by the database will be Space required = F+Fi+D+Di+A+Ai+A+S---------1.6 Where F is the size of the fact data (all the fact data that will be kept online); Fi is the size of the fact data indexation; D is the size of the dimension data; Di is the size of the dimension data indexation; A is the size of the aggregations; Ai is the size of the aggregations indexation; T is the size of the database temporary or sort area; and S is the database system administration overhead. If you want to get a quick upper bound on the database size, equation 1.6 can be reduced as follows: Space required = F+Fi+D+Di+A+Ai+A+S = 3F + Fi + D + Di + T + S as A = Ai = F < 3.3F +D+Di+T+S as Fi <= 30% F < 3.5F + T + S as D<=10%F and D = Di <= 3.5F + T as S<<T and S<<F 1.7 If F is sized accurately this formula will give a reasonable estimate of the ultimate system size. To show a worked example, suppose the fact data is calculated to be 36GB of data per year, and 4 years worth of data are to be kept online. This means that F would be 144GB. Then using eq. 1.7 we get 5

Space required = 3.5F + T = (3.5 * 144)+T ------1.8 = 504 +T GB Now suppose the data is to be partitioned by month; that would give a partition size P of 3GB. If four concurrent queries are to be allowed, using eq 1.5 we can now estimate the size of the temporary space T: T = (2n+1)P T = [(2*4)+1]3 T = 27 This gives a total database size of 531 GB for the full-sized system. If it is intended to keep 3 years worth of data online, the formula above will represent the size of the database after 3 years worth of data has been loaded. To size the initial system for 6 months worth of data you can say Initial space = (3.5F+T)/[(n*12)/6]+1 --------1.9 Where n is the number of years data you intend to keep online. Note the +1 under the line: this will account for the dimension data s being a bigger percentage of the fact initially. This means that eq. 1.9 reduces to Initial space = (3.5F+T)/7 -----1.10 Which is your initial sizing. One final word: remember that every data warehouse is different. These figures are only guidelines. 6