Performance Optimization For Operational Risk Management Application On Azure Platform

Performance Optimization For Operational Risk Management Application On Azure Platform Ashutosh Sabde, TCS www.cmgindia.org 1

Contents Introduction Functional Requirements Non Functional Requirements Business Architecture Solution Architecture Technology Stack Performance Measurement / Tools Used / Findings Performance Tuning (analysis / recommendations) View Layer Controller layer Model Layer Database Layer Server Configurations Cloud Configurations Result 2

Introduction Paper describes - Performance Tuning of Java based ORM application Following perspectives were considered Business and Technology architecture Environmental setup Database Setup Work-flow Content management Reports and Analytics Methodology applicable to all cloud based web applications 3

Functional Requirements Streamline and automate control assessment programs to reduce operational risk and loss Migrate from paper based manual processes to electronic automated processes Help top level management in better risk and control assessment by having single repository of enterprise-wide risks and controls adequate mechanism for documentation of evidence trace-ability and re-usability of controls tested assurance on timely assessment of controls Generate various analytic reports 4

Non Functional Requirements Product should cater non functional requirements categorized under following categories User Expérience Performance Security Infrastructure Open Source software licence compliance 5

Business Architecture - Module & Features 6

Business Architecture - Workflow 7

Solution Architecture 8

Technology Stack Software Old Version New Version Web Server Apache HttpdServer 2.0.64 Apache HttpdServer 2.0.64 App Server Apache Tomcat 6.0 JBoss EAP 6.2 DBMS Postgres 9.13 MySQL 5.5.32 DAO Hibernate 3.2 Hibernate 4.2.6 / JPA 2.0 Servlet Controller Spring-MVC 3.0.3 Spring-MVC 3.2.6 Security Custom built Spring-Security 3.2.0 JVM JRE 6.0 JRE 1.7 JavaScript API JQuery v1.10.2 GUI Custom CSS Bootstrap 3.0.2 Reporting Engine Jasper Reports 2.0.5 Jasper Reports 4.7.0 / D3 Charts 3.3.5 Content Management Custom built Apache JackRabbit 2.4.2 Web Services - Restful 9

Tools Used - SLAs Tools Used for Performance Measurement # Tool Name 1 Code Scan for VAPT using HP Fortify Static Application Security Testing and Dynamic Application Security Testing 2 Code Scan for VAPT using CORE IMPACT PROFESSIONAL for OWASP (Open Web Application Security Project) top ten categories 3 Code scan for Open source software license compliance checking tool Protex from Black Duck Software Agreed Service Level Agreements # Item Unit 1 Total number of users of each type (whether using the system or not). i.e. the expected population 2 How many users concurrently doing a transaction in a time of 1 minute or 1 hour? 1000 50 / hr 3 Number of Pages in Application 60 4 Average Response Time 5 sec 4 Code Profiling using - JProfiler 5 Performance Testing using Jmeter 6 GUI performance testing using Google PageSpeed 10

Tools Settings Item (unit) JMeter Settings Value Max User Load (users) 100 Ramp-up / user (sec) 2 Thread Delay ( sec) 2 Test Duration (HH:MM:SS) 01:00:02 JBoss-eap- 6.2 settings before Performance Improvement Initiative JAVA_OPTS="-Xms1303m -Xmx1303m - XX:MaxPermSize=256m - XX:+UseParallelOldGC - XX:ParallelGCThreads=4 - XX:NewRatio=3 - Djava.net.preferIPv4Stack=true - XX:+DisableExplicitGC -verbose:gc -XX +PrintGCTimeStamps - XX:+PrintGCDetails - Xloggc:Morse2_gc.log - Dsun.rmi.dgc.client.gcInterval=3600000 - Dsun.rmi.dgc.server.gcInterval=3600000 " 11

Findings before Performance Improvement Average Response Time For Critical Requests 90th Percentile Response Time For Critical Requests (Top 10) 12

Findings before Performance Improvement High Response Time Resources Top Bandwidth Consuming Resources 13

Analysis & Recommendations # Analysis 1 80% of performance issue is due to Front end components that too JQuery Data Tables 2 The problem can also be attributed to JavaScript files (.js), cascading stylesheet files (.css), images (.png,.jpg,.gif) files that are not compressed 3 Hibernate component need further analysis and need fine tuning # Recommendations Provided for 1 Performance Improvement activity to be carried out in multiple phases in following application layers 2 View Layer (JQuery, CSS, AJAX) 3 Controller Layer (Spring mvc, security, Services) 4 Model Layer (DAO, Hibernate, JPA) 5 Database Layer 6 Web and Application Server Configurations 7 Cloud (MS Azure) Layer 14

End Result Average Response Time For Critical Requests 90th Percentile Response Time For Critical Requests 15

Additional Slides on Recommendation Details 17

View Layer - Recommendations # Phase Recommendation 1 I While rendering '.js', '.css', 'image (.png,.jpg,.gif)' files should be minified (compressed) using industry standard compressors ( e.g. http://developer.yahoo.com/yui/compressor/ or http://dean.edwards.name/packer/ ) 2 I I Reduce number of '.js' files by consolidating them. As many browsers do not process '.js' files concurrently, they queue them up 3 Reference to JavaScript functions which are present in '.js' file which is referenced is included '.jsp' page 4 II Pagination is not properly implemented. On every page entire data set is pulled from database 5 II Data being processed on client side rather than at server side for JQuery data tables - Write server side code (.java) rather than client side scripting (.js) 6 II Wherever client side data processing is un-avoidable use "Deferred Rendering" in JQuery 7 II Use OSIV (Open Source Image Velocimetry) in your web applications since it will load data only when / if it's needed 8 II Use Internet Explorer Developer Tools - Pressing F12 in IE opens up developer tools have many good features using same we monitored JavaScript objects, cookies, etc. and tuned the application. Similarly, in Google Chrome use PageSpeed to evaluate performance 18

Controller Layer - Recommendations (Spring mvc, security, Services) # Phase Recommendation 1 I Always Use a Finally Clause In Each Method to Cleanup 2 I Design Transactions Usage Correctly 3 I Put Business Logic In the Right Place 4 I Avoid Common Errors That Can Result In Memory Leaks 5 I Avoid Creating Objects or Performing Operations That May Not Be Used 6 I Replace Hashtable and Vector With Hashmap, ArrayList, or LinkedList If Possible 7 I Reuse Objects Instead of Creating New Ones If Possible 8 I Reuse Objects Instead of Creating New Ones If Possible Use Stringbuffer Instead of String Concatenation 9 I Release JDBC ResultSet, Statement, or connection to pool. 10 II Release failures here are usually in error conditions. Use a finally block to make sure these objects are released appropriately. 11 II Release instance or resource objects that are stored in static tables 12 Never rely on the garbage collector to manage any resource other than memory. Discard objects no longer in use immediately 13 II Reduce usage of static variables 19

Database Layer - Recommendations # Phase Recommendation 1 I Identification of columns that are getting searched most and index them 2 I Use of staging tables In some of the complex reports instead of performing complex logic on the fly we designed batch jobs that runs at a pre-defined frequency and populates staging table. Online report would then pickup data from the same and display report. This leads to some lag in real-time data and report data but that was agreeable to business team 3 I Use of connection pool / data source for database connection 4 I User of batch statements instead of individual statement execution wherever possible 5 I Use of Prepared Statements in place of Statements for repeated reads 6 I Avoided resource leaks by o Closing all database connections after you have used them o Clean up objects after you have finished with them especially when an object having a long life cycle refers to a number of objects with short life cycles (you have the potential for memory leak) o Poor exception handling where the connections do not get closed properly and clean up code that never gets called. You should put clean up code in a finally {} block o Handle and propagate exceptions correctly. Decide between checked and unchecked (i.e. RunTime exceptions) exceptions 20

Server Configuration Recommendations (JBOSS: JAVA_OPTS, etc.) # Phase Recommendation 1 III Set the Web container threads, which will be used to process incoming HTTP requests. The minimum size should be tuned to handle the average load of the container and maximum should be tuned to handle the peak load. The maximum size should be less than or equal to the number of threads in your Web server 2 III Application servers maintain a pool of JDBC resources so that a new connection does not need to be created for each transaction. Application servers can also cache your prepared statements to improve performance. So you can tune the minimum and maximum size of these pools 3 III Tune your initial heap size for the JVM so that the garbage collector runs at a suitable interval so that it does not cause any unnecessary overhead. Adjust the value as required to improve performance 4 III Set the session manager settings appropriately based on following guidelines: o Set the appropriate value for in memory session count. o Reduce the session size. o Don t enable session persistence unless required by your application. o Invalidate your sessions when you are finished with them by setting appropriate session timeout. 5 III If a servlet or JSP file is called frequently with identical URL parameters then they can be dynamically cached to improve performance 6 III Turn the application server tracing off unless required for debugging III JAVA_OPTS="-Xms2048m -Xmx2048m -XX:MaxPermSize=384m -XX: +UseParallelOldGC - XX:ParallelGCThreads=4 -XX:NewRatio=3 -Djava.net.preferIPv4Stack=true -XX:+DisableExplicitGC - verbose:gc -XX: +PrintGCTimeStamps -XX:+PrintGCDetails -Xloggc:TRACS_gc.log - Dsun.rmi.dgc.client.gcInterval=1800000 -Dsun.rmi.dgc.server.gcInterval=1800000" 21

MS Azure Layer - Recommendations (Cloud Platform) # Phase Recommendation 1 I We ensured that cloud server is not physically too far away from client machines. This will lead to latency 2 I We created application server cluster and enabled auto-scaling in azure - this ensure same performance during peak time and slack time. During peak hours when user volumes is high new servers would be added automatically and would be withdrawn when slack period. This would bring software licensing cost to the bare minimum 3 I We had two options while selecting cloud. One was to have VM installed on base OS (hosted) and other was to have VM installed on bare metal (native). As performance of second option is better than the first one with same hardware we decided to go with second option 4 I All cloud vendors including Azure has a hard upper limit for scaling up. Means one Azure instance has upper limit of 8 CPU cores and 14 GB Ram as of May 2012. Therefore we designed web application in such a way that if number of users increases we can manage scale out not scale up. Means add more instances of VMs rather than increasing VM configuration itself. For database we used partitioning to support scale out 5 I As cloud supports multi-tenancy ensure that other applications running on the same cloud are not heavy weight though the applications are virtually divided on different servers they share same hardware. Therefore there is a possibility that they may consume other application resources if that application is idle 22

MS Azure Layer - Recommendations (Cloud Platform) Contd. # Phase Recommendation 6 III Avoid multiple AJAX calls on single page. As in cloud environment latency is bit high compared to physical servers. Multiple AJAX calls to server may bring down the application performance drastically. (E.g. if a page consists of a table of 50 x 10 size and for each cell JavaScript call is made to server. In this scenario the local hosted page would take 500 (page execution) + 25 (latency) + 500 (number of cells) x [ 5 (lookup execution) + 25 (latency) ] = 525 + 500 x 30 = 15,525 which is roughly 15 seconds. Now let's calculate the Azure hosted version. 500 (page execution) + 100 (latency) + 500 (number of cells) x [ 5 (lookup execution) + 100 (latency) ] = 600 + 500 x 105 = 53,100 This time it is 53 seconds, nearly a minute. We used Azure diagnostics to measure server performance and optimize it 7 III In the beginning we started with 2 core and 4 GB RAM VM instance but the performance was poor even with very few users. So we anticipated that scale out would not suffice therefore we increased VM size to 4 core and 12 GB RAM and that improved the performance by 400% 8 III Used Internet Explorer Developer Tools - Pressing F12 in IE opens up developer tools have many good features using same we monitored network to detect many Azure performance issues (especially latency) 23