Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting the extended sizes without unacceptable performance loss, or growth of the load window to a point where it affects the use of the system. Process The capacity plan for a data warehouse is defined within the technical blueprint stage of the process. The business requirements stage should have identified the approximate sizes for data, users, and any other issues that constrain system performance. One of the most difficult decisions you will have to make about a data warehouse is the capacity required by the hardware. It is important to have a clear understanding of the usage profiles of all users of the data warehouse. For each user or group of users you need to know the following: The number of users in the group; Whether they use ad hoc queries frequently; Whether they use ad hoc queries occasionally at unknown intervals; Whether they use ad hoc queries occasionally at regular and predictable times; The average size of query they tend to run; The maximum size of query they tend to run; The elapsed login time per day; The peak time of daily usage; The number of queries they run peak hour; The number of queries they run per day. These usage profiles will probably change over time, and need to be kept up to date. They are useful for growth predictions and capacity. The profiles in themselves are not enough; you also require an understanding of the business. Estimating the load When choosing the hardware for the data warehouse there are many things to consider, such as hardware architecture, resilience, and so on. The data warehouse will probably grow rapidly from its initial configuration, so it is not sufficient to consider the initial size of the data warehouse. There are a number of different elements that need to be considered, but the decision all come down to how much CPU, how much memory and how much disk you will need. If your sizing calls for more than the budget can afford, do not allow the required capacity to be chopped back. If that is the case, some of the functionality will need to be pared back and then the capacity can be re-estimated. In the following pages we shall attempt to outline the rules and guidelines that we follow when sizing a system for a data warehouse. Initial configuration When sizing the initial configuration you will have no history information or statistics to work with, and the sizing will need to be done on the predicted load. Estimating this load is difficult, because there is an ad hoc element to it. 1
All you can do is estimate the configuration based on the known requirements. This is why the business requirements phase is so important. When deciding on the initial configuration you will need to allow some contingency. This is particularly important in a data warehouse project, because the requirements are often difficult to pin down accurately, and the load can quickly vary from the expected. Otherwise, the sizing exercise is the same irrespective of the phase of the data warehouse that you are trying to size. How much CPU Bandwidth? To start with you need to consider the distinct loads that will be placed on the system. There are many aspects to this, such as query load, data load, backup and so on, but essentially the load can be divided into two distinct phases: Daily processing # user query processing Overnight processing # data transformation and load # aggregation and index creation # backup Daily processing The daily processing is centered on the user queries. To estimate the CPU requirements you need to estimate the time that each query will take. As much of the query load will be ad hoc it is impossible to estimate the requirement of every query; therefore another approach has to be found. The first thing to do is estimate the size of the largest likely common query. It is possible that some user will want to query across every piece of data in the data warehouse, but this will probably not be a common requirement. It is more likely that the users will want to query the most recent week or month s worth of data. Having established the likely period that will be queried, you will know the volume of data that will be involved. As you cannot assume that a relevant index will be in place, you must assume the query will perform a full table scan of the fact data for that period. So we now have a measure of the volume of data, let us say F megabytes, that will be accessed. To progress any further we need to know the I/O characteristics of the devices that the data will reside on. This allows us to calculate the scan rate S at which the fact data can be read. This will depend on the disk speeds and on the throughput ratings of the controllers. Clearly this also depends on the size of F itself. If F is many gigabytes then it will definitely be spread across multiple disks and probably across multiple controllers. You can now calculate S assuming a reasonable spread of the data. Remember that if the database is designed correctly and the large queries are controlled properly you should not get much contention for the disks, so you should get a reasonable throughput. Using S and F you can calculate T, the time in seconds to perform a full table scan of the period in question: T = F/S..1.1 If fact you should calculate a number of times, T1 Ta, which depend on the degree of parallelism that you are using to perform the scan. Therefore we get T1 = F/S1.... Tn = T/Sn 1.2 2
Where S1 is the scan speed of a disk or striped disk set, and Sn is the scan speed of all the disks or disk sets that F is spread across. You may be able to get slightly higher throughput than Sn with higher degrees of parallelism, but you will bottlenect on I/O at degrees of parallelism much above N. Now you can take the query response time requirements specified in the service level agreement and pick the appropriate T value, Tp say: this will give you Sp, the required scan rate, the number of disks or disk sets you will need to spread data across. It also gives you P, the required degree of parallelism to meet query response times for a single query. Now you need to estimate Pc, the number of parallel scanning threads that a single CPU will support. This will vary from processor to processor. The processors currently on the market will support from two to four scanners, but chip technology is moving on quickly, and this number will change over time. If possible, you should establish this by experiment. Now you can estimate your CPU requirement to support a single query with: Cs = Roundup(2P/Ps) --------- 1.3 You need to use 2P to allow for other query overheads, and for queries that involve sorts. To calculate the minimum number of CPUs required overall use the following formula: Ct = ncs + 1 -----------------1.4 Where n is the number of concurrent queries that will be allowed to run. This should not be confused with the number of concurrent users, because unless a user is running a query that are not likely to be doing anything heavier than editing. Note that the additional 1 is added to the total to allow for the operating system overheads and for all other user processing. Overnight Processing The first point to note about the nightly processing is that the operations listed at he beginning of this section are, for the most part, serialized. This is because each operation usually relies on the previous operation s completing before it can begin. The CPU bandwidth required for the data transformation will depend on the amount of data processing that is involved. Unless there is an enormous amount of data transformation, it is unlikely that this operation will require more CPU bandwidth than the aggregation and index creation operations. The same applies to the backup, although you should bear in mind that backing up large quantities of data in a short period of time will cause a major kickback onto the CPU. If the backup is spread over more hours, the amount of parallelism will come down, and its CPU bandwidth requirement will drop. The data load is another task that can use massive parallelism to speed up its operation. as with backup. As with backup, if you use fewer parallel streams it will use less CPU bandwidth and will take longer to run. Having established what you are going to use as your baseline, you then need to estimate how much CPU capacity that operation requires to complete in the allowed time. It is not safe to assume that you will have more than 10 hours overnight to achieve all the processing, even if the user day is only 8 or 9 hours long. Delays in data arrival can cause you significant problems, and you must make sure that you can complete the overnight processing without running over into the business day. 3
As every data warehouse is different, it is impossible here to give explicit estimates for these operations. It is not even possible to give firm guidelines, because each aggregation is a different complex query and/or update plus an intensive write operation. How Much Memory? There are a number of things that affect the amount of memory required. First, there are the database requirements. The database will need memory to cache data blocks as they are used; it will also need memory to cache parsed SQL statements and so on. You will need memory for sort space. Secondly, each user connected to the system will use an amount of memory: how much will depend on how they are connected to the system and what software they are running. Finally, the operating system will require an amount of memory. How much disk? The disk requirement can be broken down into the following categories: Database requirements # administration # fact and dimension data # aggregation Non database requirements # operating system requirements # other software requirements # user requirements Database sizing There are a number of aspects to the database sizing that need to be considered. First, there are the system administration requirements of the database. There is data dictionary and the journal files, plus any rollback space that is required. These will all be small by comparison with the temporary or sort area. If you can gauze the size of the largest transactin that will realistically be run you can use this to size the temporary requirements. If not, the best you can do is tie it to the size o a partition. If you do this, make allowance for multiple queries running at a given time, and set the temporary space to T = (2n + 1) P ------1.5 Where n is the number of concurrent queries allowed, and P is the size of a partition. It you use different-sized partitions for current data and for older data, then use the largest partition size in the calculation above. Then there is the fact and dimension data. This is the one piece of data that you can actually size; everything else will be sized off this. To do this sizing exercise, you will need to known the database schema. Clearly, when performing the original size estimates for a business case, much of this information will be missing. In this situation, the sizing will need to be based purely on original estimates of the base data size. When sizing the fact or dimension data you will have the record definitions, with each field type and size specified. Note, however, that the size specified for a field will be the maximum size. When calculating the actual size you will need to know. The average size of the data in each field; 4
The percentage occupancy of each field; The position of each data field in the table; The RDBMS storage format of each data type; Any table; row and column overhead. Another factor that may affect our calculations is the database block or page size. A database block will normally have some header information at the top of the block. This is space that cannot be used by data and can amount to 100 + bytes. The difference between using a 2 kb block size and using a!6 kb block size will mean something of the order of 100 bytes of extra data space in every 16 kb. You will also need to estimate the size of the index space required for the fact and dimension data. Fact data should generally be only lightly indexed, with indexes occupying between 15% and 30% of the space occupied by the fact data; the cost in terms of index maintenance would be extremely heavy otherwise. The final determinant of the ultimate size of the fact data is the amount of data that you intend to keep online. When this is known, you can decide on your partitionling strategy. This needs to be taken into account in the sizing, because it is unlikely that you will get data to load exactly into every partition. You also need to size the aggregations. For the initial system you will probably be able to size the actual aggregations that are planned. All the factors discussed above apply equally to the aggregations. As a rule of thumb, you should allow the same amount of space for aggregations as you will have fact data online. You will also need to allow space for indexes on the aggregations. These summarized tables are likely to be heavily indexed, and it is usual to assume 100% indexation. In other words, allow as much space again for indexing as for the aggregates themselves. So, to summarize, the space required by the database will be Space required = F+Fi+D+Di+A+Ai+A+S---------1.6 Where F is the size of the fact data (all the fact data that will be kept online); Fi is the size of the fact data indexation; D is the size of the dimension data; Di is the size of the dimension data indexation; A is the size of the aggregations; Ai is the size of the aggregations indexation; T is the size of the database temporary or sort area; and S is the database system administration overhead. If you want to get a quick upper bound on the database size, equation 1.6 can be reduced as follows: Space required = F+Fi+D+Di+A+Ai+A+S = 3F + Fi + D + Di + T + S as A = Ai = F < 3.3F +D+Di+T+S as Fi <= 30% F < 3.5F + T + S as D<=10%F and D = Di <= 3.5F + T as S<<T and S<<F 1.7 If F is sized accurately this formula will give a reasonable estimate of the ultimate system size. To show a worked example, suppose the fact data is calculated to be 36GB of data per year, and 4 years worth of data are to be kept online. This means that F would be 144GB. Then using eq. 1.7 we get 5
Space required = 3.5F + T = (3.5 * 144)+T ------1.8 = 504 +T GB Now suppose the data is to be partitioned by month; that would give a partition size P of 3GB. If four concurrent queries are to be allowed, using eq 1.5 we can now estimate the size of the temporary space T: T = (2n+1)P T = [(2*4)+1]3 T = 27 This gives a total database size of 531 GB for the full-sized system. If it is intended to keep 3 years worth of data online, the formula above will represent the size of the database after 3 years worth of data has been loaded. To size the initial system for 6 months worth of data you can say Initial space = (3.5F+T)/[(n*12)/6]+1 --------1.9 Where n is the number of years data you intend to keep online. Note the +1 under the line: this will account for the dimension data s being a bigger percentage of the fact initially. This means that eq. 1.9 reduces to Initial space = (3.5F+T)/7 -----1.10 Which is your initial sizing. One final word: remember that every data warehouse is different. These figures are only guidelines. 6