PARALLEL PROCESSING AND THE DATA WAREHOUSE

Transcription

1 PARALLEL PROCESSING AND THE DATA WAREHOUSE BY W. H. Inmon

2 One of the essences of the data warehouse environment is the accumulation of and the management of large amounts of data. Indeed, it is said that if you manage large amounts of data well in the data warehouse environment, that all other aspects of the data warehouse design and usage come easily. And if you do not manage large amounts of data well in the data warehouse environment, that nothing else really matters, because you will fail. It is true that the management of large amounts of data is the first and most critical success factor in the building and using of the data warehouse. There are many design approaches and techniques for the management of large amounts of data in the warehouse environment, such as: storing data on multiple storage media, summarizing data when detail becomes obsolete, storing data relationships in terms of artifacts (see the tech topic on data relationships in the data warehouse for an in-depth discussion of this topic), encoding and referencing data where appropriate, partitioning data for independent management of the different partitions, choosing levels of granularity and summarization properly for the data warehouse, and so forth. While all of these design and architecture techniques and approaches are valid and should be employed in any hardware environment, there is another approach to the management of large volumes of data for the data warehouse and that approach is to select technology that can manage data in parallel. Parallel technology is sometimes known as data base machine technology. THE TOPIC OF DISCUSSION This discussion will be on the management of large amounts of data warehouse data in a parallel environment. At this point the reader should be very cautious about one aspect of this discussion. This discussion is for the data warehouse only. Occasionally a developer will try to use data base machine or parallel technology for operational processing. This discussion is not about that kind of environment. Also, occasionally a designer will try to use data base machine technology for an environment that attempts to do both operational transaction processing and data warehouse processing at the same time on the same machine on the same data. This discussion is not about that environment. (For an in-depth discussion of the mixed operational and data warehouse environment refer to the tech topic on doing both operational and DSS processing on a single database.) This discussion is for the data warehouse environment only, where parallel technology has been selected as the (or one of the) primary storage and access method. THE APPEAL OF PARALLEL TECHNOLOGY Parallel technology is technology in which different machines are tightly coupled together but work independently. The machines each manage their own collection of data independently. The spread of data in the data warehouse is orthogonal. Copyright 2000 by William H. Inmon, all rights reserved Page 1

3 Figure 1 shows the basic configuration of processors working together and managing data independently. In Figure 1 there are five basic components of interest - the transaction (or request), the queue the transaction goes in, the network connecting the processors, the processor, and the data controlled by the processor. A request enters the system, and the processor the request needs to be channeled to determine the queue the request is routed to. In the case of a large transaction, the transaction may be broken into a series of requests for data and processing, which in turn are routed to the appropriate processor. The request enters the queue, and the processor starts the execution of the request. Upon the completion of the work done for the request, the results are sent to the requestor. While one processor is servicing the requests that operate on data that belongs to it, another processor can be servicing the requests that have been channeled to it independently of the work done by other processors. It is this independence of processing that has great appeal to the data warehouse architect because the independence of processing means that managing large amount of data is technologically possible. To manage large amounts of data merely requires the harnessing together multiple processors. Said another way, in order to manage more data warehouse data, the data warehouse architect merely needs to add more processors to the tightly networked configuration, as shown by Figure 2. Copyright 2000 by William H. Inmon, all rights reserved Page 2

4 In nonparallel environments, adding large amounts of data to an already large environment can cause tremendous difficulties at the basic operating system level. Nonparallel environments have a threshold of data after which they operate inefficiently. Once this threshold of data is reached, there is nothing that can be done except to go to another technology. And transferring data and processing to another technology can be a disruptive, expensive, complex experience to be avoided if at all possible. Parallel processing offers the possibility then that the technology itself can be extended almost ad infinitum, avoiding the conversion of the data warehouse from one technology to another. The independence of processing in the parallel environment leads to the observation that the speed of access of data is proportional to the number of processors that data is spread over. Suppose a parallel configuration has m independent processors with data spread evenly and optimally over the processors. Now suppose it takes a single large (non parallel) processor n units of time to service a request. The elapsed time required for the parallel configuration to execute the service is n/m. Figure 3 shows this difference. Copyright 2000 by William H. Inmon, all rights reserved Page 3

5 The savings in elapsed time by going to a parallel environment can be expressed: n - (n/m) = elapsed time differential It is worthy of note that the work done by the systems - either parallel or nonparallel - is the same in terms of I/O. The real difference is not in the total amount of work done, but in the elapsed time required to do that work. Another observation about the parallel environment is that the marginal improvement in elapsed time decreases as the number of processors are added. In other words, when the number of processors in the parallel environment increases from one to two, there is an enormous improvement in elapsed time. When the number of processors increases from two to three there is a significant improvement in elapsed time. But as the number of processors increases, the improvement grows smaller. For example, increasing the number of processors from twenty to twenty one may make no noticeable improvement in elapsed time at all. The choice between a parallel approach to technology in the data warehouse and a standard centralized approach in the warehouse usually revolves around volumes of data. For small to modest sized data warehouses, a centralized approach makes economic and technological sense. But after a point (when the data warehouse starts to contain a very large volume of data), the amount of data is such that a parallel approach becomes economically and technologically advantageous. The choice always boils down to both technological and economic considerations. Copyright 2000 by William H. Inmon, all rights reserved Page 4

6 PHYSICAL ORGANIZATION There are many ways that the components of parallel technology can be arranged. The following is a discussion of the most common ways, but is hardly intended to describe all the possibilities. It is worth noting that each configuration and arrangement of the components of parallel technology have their associated tradeoffs. Figure 4 illustrates the (typical) dynamics of the inner workings of the components of the parallel environment. The request can be a singular request for data that is routed to the appropriate processor or can be a general request that is broken down into a series of specific requests that are individually routed to individual processors. The queue that the request goes into can be a single large queue that has access to all the processors or can be a series of queues each of which is unique to different processors. When the queues are unique to different processors, the request must be assigned to a specific processor prior to execution. The designation of data to a processor can be done by means of a hashing algorithm or an index (or ostensibly by both means). When a hashing algorithm is used, the data is divided across the different processors in a random manner based on the primary key of the record. When data is assigned to a processor by means of an index, data is usually (although not necessarily) assigned to a processor in groups. Once the data arrives at the processor to which it is assigned, it is placed on disk storage and an index keeps track of its assignment. The data is stored in physical blocks, which hold tables, which are made up of rows (or records), which contain columns (or fields.) Copyright 2000 by William H. Inmon, all rights reserved Page 5

7 As stated there are many variations of the arrangement of the components of a parallel environment, each with its own strengths and weaknesses. HOT SPOTS The appeal of parallel technology is quite strong in the face of needing to manage large amounts of data and the ability to add processing resources in an incremental, nondisruptive fashion. At first glance it appears that the parallel approach is the answer to the data architect's prayers insofar as managing the volumes of data found in the data warehouse. However, there are occasions where the parallel approach to the management of large amounts of data yields worse results in terms of performance than the traditional single processor approach. The performance and the efficient utilization of the parallel approach to the management of large amounts of data depends on the data being spread evenly across the different processors so that the corresponding workload is likewise spread evenly over the processors. When there is an even and equitable spread of data and processing, the parallel approach to the management of large amounts of data works quite well. However, if there ever is an imbalance in the spread of data across the parallel processors and there is a corresponding imbalance in the workload spread across the processors, then there develops what is known as a "hot spot", and the effectiveness of the parallel environment is compromised. Figure 5 shows a hot spot. The workload in Figure 5 is seen to be imbalanced and a hot spot has occurred. Some processors have no work at all and one processor has the majority of the work piled on it. In this case there might as well be a central management of data. (Indeed, in this case a central approach to the management of data is much more effective and efficient than a parallel approach.) Copyright 2000 by William H. Inmon, all rights reserved Page 6

8 One of the problems associated with the data warehouse environment (and the world of DSS in general) is that the patterns of access of data in a warehouse are unpredictable. Both the rate of access and the specific records to be accessed are very variable. This implies that for DSS processing hot spots are the norm. Of course hot spots can be remedied. To remedy a hot spot requires the redistribution of data to other or more processors. Figure 6 shows the remedying of hot spots. The problem with remedying hot spots in the data warehouse environment is that any remedy depends on a foreknowledge of the usage of data. Just because data has shown a pattern of usage in the past does not mean that it will exhibit the same pattern of usage in the future. Therefore trying to identify and remedy hot spots in the parallel environment is a difficult task. OPERATIONAL/DSS DIFFERENCES The parallel environment can be used for the purposes of operational or data warehouse, DSS processing, but not both at the same time. There are several reasons why the two environments do not mix in the parallel environment. Figure 7.1, Figure 7.2, Figure 7.3 and Figure 7.4 illustrate some of those differences. Copyright 2000 by William H. Inmon, all rights reserved Page 7

10 The first difference between the two environments is in the type of transaction that is being run. The operational environment runs many small transactions, each of which attach their processing to a single processor. The data warehouse environment has a very different transaction operating in it. On the other hand, the data warehouse Copyright 2000 by William H. Inmon, all rights reserved Page 9

11 environment contains a few very large transactions, which operate on data spread all over the parallel environment. The second major difference between the two environments lies in the internal structure of the data. The data warehouse environment contains data whose structure is optimized for massive sequential, non-update processing. The operational environment is structured for access of a limited amount of data where the data can be updated. In addition the operational environment typically groups together data of different types so that a transaction does not have to look into different locations in order to find the data. Data warehouse data on the other hand is stored homogeneously. Another important difference between the operational environment and the data warehouse environment is in the logging of data. Since operational processing involves the update (or potential update) of data, there is a certain amount of overhead required. Logging is one type of overhead that comes with update processing. But data warehouse data does not require a log because no update is done. Therefore the basic system characteristics of the operating system are quite different. These then are the basic reasons why the operational environment and the data warehouse environment do not mix, even in the face of a parallel management of data. THE LEVELS OF THE WAREHOUSE The main benefit of the data warehouse residing on a parallel technology is that data can be accessed quickly and randomly. The details of data managed this way are (relatively!) easy to get to. But a parallel management of data is expensive. Most organizations try to position the data warehouse current detail level in the parallel environment, and let other levels of the data warehouse reside on other technologies. Figure 8 shows this arrangement. Copyright 2000 by William H. Inmon, all rights reserved Page 10

12 For a variety of reasons - economic and technological - the management of current detailed data by parallel technology and other data by other technologies is a very good solution. METADATA IN THE PARALLEL DATA WAREHOUSE ENVIRONMENT Metadata is one of the most important aspects of the data warehouse environment. The fact that the data warehouse resides on a parallel technology neither diminishes nor enhances the role of metadata. Typically the metadata stored with a data warehouse includes: data content, data structure, the mapping of data from the operational environment, the history of extracts, versioning, etc. PHYSICAL DESIGN IN THE PARALLEL DATA WAREHOUSE ENVIRONMENT The design of the data warehouse in the parallel environment proceeds exactly the same as the design of a data warehouse in the non-parallel environment in the early phases of design. The activities of defining the data model, defining the major subject areas, defining the system of record and so forth are the same for both environments. The major difference in the design of a data warehouse in the parallel environment comes Copyright 2000 by William H. Inmon, all rights reserved Page 11

13 when the physical design is created. The spread of the data over the different processors is a major design issue. The first issue is how many processors will there be. The next issue is how the data will be spread over the processors. Some of the relevant factors affecting this decision are: what will be the pattern of growth of the data, how much data will there be initially, what is the pattern of access of the data, and so forth. Another important design issue is what the primary key of the data ought to be. The primary key affects the physical spread of the data over the different parallel processors in that the primary key is the discriminator that allows the data to be spread in the first place. A related important design issue is the placement of the secondary key of the data for units of data not directly related to the key. Secondary data may be placed randomly over the parallel processors or may be forced into the same physical location as the data relating directly to the primary key. The usage of the data dictates which is the better choice. Partitioning of data is as important in the parallel environment as it is in the centralized data warehouse environment. Partitioning allows you to index data independently, restructure data independently, manage data independently, and so forth. The assignment of data to parallel processor by means of definition of keys is a very important design aspect in the data warehouse environment because the physical placement of data profoundly affects the pattern of access of data, which in turn has a profound effect on the effectiveness of the parallel management of data. Said another way, if the data is not spread properly over the parallel processors, the benefit of parallel processing is lost and the data may as well be managed by a single, central processor. A second important physical design aspect is the identification and support of derived (i.e., summary) data in the data warehouse environment. The storage of summary data makes sense when that data is used often and/or when an "official" calculation of data needs to be done and there is concern that if calculation is done more than once it will not be correct. Under these circumstances summarization and storing of data in the parallel environment makes sense. An important physical design technique in the parallel data warehouse environment is that of the prejoining of data when it is known that the data will be joined as a regular matter of course. If it is known that data will be joined, it is much more efficient to join the data at the moment of load than it is to join the data dynamically. Another physical design technique is to create artifacts of relationships in the data warehouse. Data relationships are important in the data warehouse. However, their implementation is quite different than that found in the operational environment. Copyright 2000 by William H. Inmon, all rights reserved Page 12

14 Because data is much more quickly accessible in the parallel data warehouse environment, it is a temptation to store as much detailed data as possible, on the theory that you can never tell when you will be needing a scrap of data. There is however a cost of storing data in the warehouse, even in the face of a parallel technology. The following rules of thumb for the management of data hold true: if the data will not be used for DSS processing, it has no place in the data warehouse, if the data is very old, it should be considered for placement in "deep freeze", bulk storage, and if the level of detail is so granular that it is unlikely to be used, the data should be summarized. SUMMARY The data warehouse can be managed by a parallel approach to technology. The parallel approach has multiple processors which operate on data independently and manage data independently. Because of the independent management of data, processors can be added linearly and independently. The question as to whether to use a parallel approach or a centralized approach depends on the volume of data to be managed and the access to the data. Even though the parallel approach offers a powerful alternative to the management of data, physical design issues are still very important in the design of the data warehouse. Copyright 2000 by William H. Inmon, all rights reserved Page 13