Application Scalability in Proactive Performance & Capacity Management Bernhard Brinkmoeller, SAP AGS IT Planning Work in progress
What is Scalability? How would you define scalability? In the context of PPCM is scalability a characteristic of the load or the hardware? How would you define scalable load? How would you define scalable hardware? 2013 SAP AG. All rights reserved. 2
What is Scalability? Definition from Wikipedia In electronics (including hardware, communication and software), scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.. Scalability, as a property of systems, is generally difficult to define [2] and in any particular case it is necessary to define the specific requirements for scalability on those dimensions that are deemed important. It is a highly significant issue in electronics systems, databases, routers, and networking. A system whose performance improves after adding hardware, proportionally to the capacity added, is said to be a scalable system. An algorithm, design, networking protocol, program, or other system is said to scale if it is suitably efficient and practical when applied to large situations. The general definition of scalability is depends strongly on the context that it is used in. Even in a given context it is questionable whether it is precise enough to form the basis to define concrete work packages for a scalability analysis. It is very important, that we reach a common understanding of what we want to achieve in Proactive Performance & Capacity Management before we start. 2013 SAP AG. All rights reserved. 3
Content Definition of Scalability Proactive Performance & Capacity Management according to ITIL Scalability of load with the amount of business data processed in a step the number of parallel processes the size of the DB Scalability of service time with the number of CPUs available the capacity of the I/O Subsystem Database locks Non-scalability introduced by application server buffering Consequences for risk assessment and quality control Consequences for monitoring 2013 SAP AG. All rights reserved. 4
Capacity Management Service According To ITIL ITIL Service Delivery v.2.1 Published For OGC By TSO According to the Information Technology Infrastructure Library (ITIL) a capacity management service consists of three Sub- Processes The output with the highest value is obtained when the results of the sub-processes are brought together. While the entry point of the sub-processes are different, all of them aim at establishing a connection from the business requirement over the services (reports and transactions) to the resource (CPU, memory, disk) consumption 2013 SAP AG. All rights reserved. 5
What is Scalability? Definition of Scalability in PPCM 1/3 Business capacity Management: Business volume for process X Load is scalable when the resource consumption of the Services necessary to run the business process depends linearly on the (business) volume and there are no unexpected load drivers Service Capacity Management: Resource consumption for Service Y Hardware is scalable when it is capable to provide the necessary resources for the required number of services in a given time interval without a degradation of service times Resource Capacity Management: Service time for resource Z 2013 SAP AG. All rights reserved. 6
What is Scalability? Definition of Scalability in PPCM 2/3 Load is scalable when the consumption of expected resources depends linearly on the (business) volume and there are no unexpected load drivers. Hardware is scalable when it is capable to provide the necessary resources for the required number of services in a given time interval without a degradation of service times. Examples: The response time of processing of an order should depend linearly on the number of items in the order. Signs of nonscalability are: Quadratic dependence on the number of line items in the order Use of sorted tables and read binary search in ABAP Dependence on the network latency between the front end and the server: The number of communication steps has to be so small that the network latency can be neglected. Dependency on the amount of data stored in the DB. Read only new data from chronologically sorted indices. The throughput for order processing should depend linear on the CPU capacity provided by the infrastructure. Signs for non-scalability are: Dependence on the length of the critical path of DB locks. avoid long critical path for updates with a large likelihood of lock collision I/O bottlenecks caused by high ReDo volume. Avoid unproductive database changes (eg. Using set update task local ) 2013 SAP AG. All rights reserved. 7
What is Scalability? Definition of Scalability in PPCM 3/3 On a detailed level scalability describes the relationships between: Volumes of all business processes supported by a system The consumption of the various resources provided by the system The service request times the system is capable to provide A system is scalable up to the required limit when even under high load the contribution of non scalable contributions to the overall resource consumption and service times remains below an acceptable limit and the hardware can provide the required resources at peak time without unacceptable degradations of service request times. For very large systems the acceptable contribution of non scalable load to the resource consumption is typically set at about 20%. For smaller systems it is much higher as it is cheaper to provide more hardware. A system is scalable when the load consisting about 80% of the resource consumption is proven to be scalable and the hardware can provide the required resources at peak time without degradations of service request times of less than 20%. (Limits are debatable) A Scalability analysis is always restricted to the (expected) top load contributors 2013 SAP AG. All rights reserved. 8
Content Definition of Scalability Proactive Performance & Capacity Management according to ITIL Scalability of load with the amount of business data processed in a step the number of parallel processes the size of the DB Scalability of service time with the number of CPUs available the capacity of the I/O Subsystem Database locks Non-scalability introduced by application server buffering Consequences for risk assessment and quality control Consequences for monitoring 2013 SAP AG. All rights reserved. 9
factor of load increase The size of the DB Principal Scalability Patterns 6 5 Scaling behaviour with DB size 4 3 2 Independend Buffer hit ratio depends on table size Amount of Data read depends on table size 1 0 1 1,5 2 2,5 3 3,5 4 4,5 5 DB table growth Following scaling behavior can be observed for the load with the DB size: 1. Constant resource consumption independent of table size Independent of the table size are all fully indexed access to data ( the small dependency on the depth of the B-tree for the index can be neglected) which have a high likelihood to only access Data blocks in the buffer. 2. Decrease of the buffer hit ratio with table size In case the chance that a data block decreases with the index or the table size a week linear dependency of the resource consumption with the table size can be observed Directly proportional with the table size: 3. In case the number of data blocks that need to be read increases with the table size a strong linear dependency of the resource consumption can be observed. 2013 SAP AG. All rights reserved. 10
The size of the DB Amount of Data Read Depends on Table Size 1/3 In the cursor cache statements that create a load proportional to the size of the DB can be identified by a large (and growing) number of Bgets/row or Rproc/Exec. In case Rproc/exec is large the most common technical issue is select for all entries with an empty selection table. This always needs to be check In case this is not the solution it has to be check how the processes can be changed to reduce the number of records read. In case Bgets/row the index layout has to be checked. In case the access is to a single table correct indexing will always allow to reduce the Bgets/row to < 6. 2013 SAP AG. All rights reserved. 11
The size of the DB Amount of Data Read Depends on Table Size 2/3 In case the large number of buffer gets is seen for a join with distributed selectivity it is not always possible to improve the situation with technical means. The most prominent and frequently seen example for such a join is the selection of material movement either in standard or as seen here in customer coding. The main issue here is the distributed selectivity of the Date on MKPF and all other fields on MSEG. In this special case the only stable solution for this is described in SAP Note 1598760. FAQ: MSEG extension and MB51/MB5B redesign. The changes necessary to avoid such non scalability are very complex as it is not only necessary to change coding but also the table layout. In many cases it is therefore not possible to implement a solution. So, most customers refuse to implement the changes. In that case knowledge of the non scalability can be used to estimate the largest allowed residence time for archiving to stay within acceptable performance limits. 2013 SAP AG. All rights reserved. 12
The size of the DB Amount of Data Read Depends on Table Size 3/3 A nice example of non scalable load with an ever increasing amount of data read can be found in customer systems with long running delivery contracts typical for the automotive industry Such a statement is the select from EKBE which is the second most time consuming in the snapshot of the cursor cache with already more than a billion recorded disk reads. It is a select with specified EBLEN and POSNR so it looks quite harmless. Rproc/exec is not that high as it is diluted by many access caused by simple Purchase orders. But the huge number of disk reads triggered are suspicious. EKBE is the order history containing all deliveries made on behave of a contract. Using JIT his might be one delivery every 3 minutes for more than a year for each position of a contract. As more and more old data is touched this drives the I/O load for this access dramatically. A solution for this special issue can be found in the use of transaction ME87 that needs to be used regularly to summarize the order history. (See SAP Note 417933 for details). The example shows once again that it is more important to understand the business processes associated with the top resource consuming statements to find relevant performance improvements. 2013 SAP AG. All rights reserved. 13
Example: VAPMA-VBUK non scalable runtime increases with number of orders in DB The database uses index VAPMA~Z01 to access the data, this way each time all entries belonging to one plant will be read. The runtime of this statement will increase with the number of orders in the system. It is necessary to change the access so that the number open orders determine the runtime. This is most securely done by selecting VBUK first or by introducing oracle hints to use index VBUK~z02) 2013 SAP AG. All rights reserved. 14
The size of the DB Buffer hit ratio depends on table size 1/2 Less obvious than the depends discussed before are cases where we have fully indexed access to data reading only necessary. However there are statements in the cursor cache among the most expensive statement which are executed in huge but justified number numbers which only have a rather bad ratio of disk reds to buffer gets compared to the overall buffer quality. Very often this is caused by an access via a non-chronologically sorted index. The theory behind this is elaborated in more detail in: Data Archiving Improves Performance Myth or Reality? http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/d0b0de48-0701-0010-fcb1- fb99d43920e3?quicklink=index&overridelayout=true&5003637390373 or in more detail in Performance Aspects of Data Archiving https://websmp108.sap-ag.de/~sapidb/011000358700005070382005e/da_and_performance_11_en.pdf The main principle is easy to understand: If the data accessed is randomly distributed over the full width of an index or even a DB table the buffer hit ratio will depend heavily on the ratio of Index/table size vs. buffer size. Assuming fixed buffer sizes and growing index tables sizes the buffer hit ratio will go down. This is not the case when the access is concentrated on a small part of the index/table that does not grow with the DB size. 2013 SAP AG. All rights reserved. 15
The size of the DB Buffer hit ratio depends on table size 2/2 Data is touched by the DB when it resides in the same data block as data that is needed to fulfill a request. Therefore it can be concluded that the DB load can only be scalable when old data that is not needed any more does not reside in data blocks which also contain new data, that is needed to fulfill a request. The pictures below show the insertion points of new data in a chronologically sorted index compared to that of an index that is not chronologically sorted. In a chronologically sorted index the amount of data touched to for all business transactions remains constant and is independent of the number of entries in the table. If the index is not chronologically sorted this is not the case: The number of data blocks that are touched increases as the fraction of new data per index block gets smaller and smaller until the growth of old data is stopped for instance by data archiving. Classification of Indices used for the access to data in respect to their quality with respect to chronology allows a very good estimate of the scalability of an application even from single user measurements. 2013 SAP AG. All rights reserved. 16
The size of the DB Tools to check The most important tools to check this are the SQL trace in single user measurements and the cursor cache after go live. There may be several reasons for this kind of non scalability. In the cursor cache statements need to be checked for a large number of Bget/execution, a large number of rproc/execution, and even for a worse than average buffer hit ratio. In a single user trace it is necessary to check indexing and table design with special attention given to the explicit or implicit time constraints in the where clause of each statement and how this is handled in the index. Especially the importance of considering the different buffer quality for old and new data is neglected in many tables and index designs and very often makes the decisive difference between scalable and non scalable load. 2013 SAP AG. All rights reserved. 17
Praxis Check: Indices of DFKKOP Table DFKKOP (Items in contract account document) is typically the largest and most important table of FI- CA with billions of entries in customer systems. In Standard 6 indices are defined for this table: dfkkop~0 dfkkop~1 dfkkop~2 dfkkop~3 dfkkop~4 dfkkop~5 dfkkop~6 MANDT MANDT MANDT MANDT MANDT MANDT MANDT OPEL Number of Contract Accts Rec. & Payable Doc. OPUPW Repetition Item in Contract Account Document OPUPK Item number in contract account document OPUPZ Subitem for a Partial Clearing in Document AUGST Clearing status GPART business partner BUKRS Company Code XMANL Exclude Item from Dunning Run AUGBL Clearing Document or Printed Document ABWBL Number of the substitute FI- CA document AUGST Clearing status WHGRP Repetition group None of the indices is explicitly chronologically sorted. Specifying a time as the last field of an index (AUGDT in Index ~5; ~6) only creates a chronological order for entries with equal VTREF and BUKRS, which does not prevent the mixture of new and old data in one block. All of the indices are implicitly chronologically sorted, by the use of either a document number (ascending with time), or the clearing status (open new; closed old). Note: The clearing status was explicitly chosen as second field of all indices that did not contain a document number to achieve a separation between old and new data and enhance the scalability of access to new data with status open. Note also: there is no chronological order among the closed records for index ~1; ~4; ~5 and ~6. Any access to the closed records via one of the indices ~1; ~4; ~5 or ~6 creates a non scalable load. AUGST Clearing status VKONT Contract Account Number BUKRS Company Code AUGDT Clearing date AUGST Clearing status VTREF Reference Specifications from Contract BUKRS Company Code AUGDT Clearing date AUGST Clearing status ABWKT Alternative contract account for collective bills 2013 SAP AG. All rights reserved. 18
Praxis Check: Access to DFKKOP Insert of new records into DFKKOP When new records are inserted into DFKKOP the status is open. Insertion points are concentrated locally for new items Access to open items Use of all indices guarantees local access to new items only Clearing run The change of the clearing status distributes the entry points in index ~1, ~4, ~5, ~6 Distributes the entry points equally over the complete index range forcing access to the complete range of these indices. Open item list for settlement day To determine recently closed items it is necessary to access all of index ~1 and all of the table 2013 SAP AG. All rights reserved. 19
Praxis Check: Access to DFKKOP Insert of new records into DFKKOP When new records are inserted into DFKKOP the status is open. Insertion points are concentrated locally for new items Access to open items Use of all indices guarantees local access to new items only Clearing run The change of the clearing status distributes the entry points in index ~1, ~4, ~5, ~6 Distributes the entry points equally over the complete index range forcing access to the complete range of these indices. Open item list for settlement day To determine recently closed items it is necessary to access all of index ~1 and all of the table 2013 SAP AG. All rights reserved. 20
Expected Performance impact of HANA Migration Insert of new records into DFKKOP Any insert is just done into the L1-delta. While the merge will be very resource intensive the insert itself should be fast. Access to open items Being column based, HANA has a principle disadvantage here which will result in higher access times. Clearing run The update of the records again is just an insert into the L1-delta. Open item list for settlement day While this only touches only recent data (as long as the report is executed a short time after the settlement day) the amount of data necessary to be read is large enough that this disadvantage is offset by the efficient access to the data in the column store. 2013 SAP AG. All rights reserved. 21
Content Definition of Scalability Proactive Performance & Capacity Management according to ITIL Scalability of load with the amount of business data processed in a step the number of parallel processes the size of the DB Scalability of service time with the number of CPUs available the capacity of the I/O Subsystem Database locks Non-scalability introduced by application server buffering Consequences for risk assessment and quality control Consequences for monitoring 2013 SAP AG. All rights reserved. 22
ratio to top contributor [%] Example: DB Cursor Cache Analysis Resource Consumption of Top 20 SQL- statements 100 duration disk reads buffer reads rows read 90 80 70 60 50 40 30 20 10 0 2013 SAP AG. All rights reserved. 23
# buffer gets normalized to top statement Importance of top 20 Resource Consumers 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 Relative cost of top resource consuming statements 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 top n statements sorted by buffer gets SYS1 (time 1) SYS1 (time 2) Z2L SYS2 Jun (time1) 2012 Z2L SYS2 Dec (time2) 2012 Efficient optimization concentrates on the largest resource consumers. The longer and the more extensive this approach is followed, the smaller will be the relative importance of the top n resource consumers compared to the rest. The effect of each optimization becomes smaller and less significant for overall sizing. SYS2 has reached a state where this approach does not show any significant potential for improvement any more. This can be seen very clearly using the example of the shared cursor cache analysis: Shown above are the top 20 statements with respect to the number of buffer gets form SYS2 from tim1 and time2 together with an example of another customer system before and after optimization. The slop of the curves in SYS2 are so small that the top 20 virtually are meaningless for the overall resource consumption. 2013 SAP AG. All rights reserved. 24