Scaling Sitecore for Load introduction In 2012, SolutionSet rebuilt the California Lottery website from the ground up and learned a great deal about building a Sitecore infrastructure to withstand targeted bursts of traffic. Being one of the largest state-run lotteries in the world and having heavy spikes in traffic due to large jackpots, the California Lottery needed a platform that would be steady and stable through the ebbs and flows. Before implementing Sitecore, the California Lottery website would go down every time user traffic became too heavy. Their traffic profile was somewhat atypical in that while their baseline traffic level was reasonably high, jackpot nights presented an exponentially higher level of traffic. We needed to build a platform that could handle their multifaceted scalability challenge the site not only had to scale for high traffic days, but also for the precise high traffic times and high traffic pages. In this white paper we share factors key to successfully scaling with Sitecore and our learnings along the way, with the hope that this will serve as a framework for thinking about scaling a website for bursts in traffic. M AY 2 0 13 Robert Balmaseda, Eric Bradford, Alex Kaplinsky, Peter Montgomery
1. comprehensive caching is vital Sitecore provides a very sophisticated and flexible caching infrastructure with four primary layers of caching: Prefetch Caching, which pulls specified segments of raw data from the database. The Data Cache, which converts the data to a structured representation in memory when required. The Item Cache, which converts these structures to Sitecore items in memory. Output Caching, which caches the rendered output from the rendering or sub-layouts that reference those items. You can specify maximum sizes for each of these layers, but we focused our cache tuning efforts on the first and last layers. When we started going through the exercise of modifying the California Lottery s cache settings in web.config, we quickly realized that the total size of the Lottery s Sitecore content was at a level that we could prefetch the entire content tree without taxing the memory on our origin Content Delivery servers. So we set our max cache sizes and load factor at levels that let us load all site content into memory when the site first spun up. This means that regardless of the load placed on the servers, the database activity on our Sitecore web databases is limited to system operations, such as checking the event queue. REQUEST Figure 1: Sitecore s Caching Layers Contains rendered markup as invoked under various conditions, which are defined in caching configuration for each rendering. For a given request, if an entry is found for a rendering meeting that request s conditions (ie. language, querystring, etc.), the rendering is served from the Output Cache. Otherwise, the rendering engine looks to the Item Cache. Effective use of output caching can greatly reduce CPU utilization. Output Cache Contains a single version of an item in a single language, stored as an instance of the Sitecore.Data.Items.Item class. If a cache entry exists for a request for an item, either by the rendering engine or directly from code, the cached item will be returned. Otherwise, the framework will attempt to create a corresponding cache entry derived from the Data Cache. Item Cache Contains an XML representation of one version of an item. A cache entry will be provided in response to a request from the Item Cache if possible; otherwise the framework will attempt to create a corresponding entry derived from the Prefetch Cache. Data Cache Contains data underlying items in the content hierarchy. By default, this cache is populated from the database on application startup, as specified in a dedicated configuration file. If the Prefetch Cache cannot provide an entry in response to a request from the Data Cache, it will create one after retrieving the appropriate data from the database. The Prefetch Cache can significantly alleviate database usage when configured correctly. Prefetch Cache DATABASE
comprehensive caching is vital (continued...) This is a beautiful thing because it eliminates the database as a potential bottleneck, regardless of traffic patterns. If the volume of content on your site doesn t lend itself to being fully preloaded, the prefetch configuration file will let you specify that the data underlying the most heavily visited pages should be prefetched, and when you dig into it you might find that you can simply exclude old press releases, for example, and preload everything else. If you can take the database out of the equation and set aside memory consumption considerations around session usage which has solutions that are well documented and not specific to Sitecore our scalability challenge was essentially reduced to pushing down CPU utilization, most of which was driven by output rendering. This is where the rubber met the road for us, because there were sections of our pages that had information whose freshness was independent of Sitecore publishes. This meant that we couldn t take advantage of Sitecore s output cache for a few small, but important, page sections on our key pages. We made heavy use of a Content Delivery Network (CDN), to the point where only the base pages were being served from the origin servers, and everything else images, CSS, JavaScript, everything came off of the CDN. First, this gave us better predictability if a user uploads and links to an enormous.pdf file from the winning numbers page an hour before a big jackpot, it s not going to bog the site it s just going to make their CDN bill skyrocket. It also allowed for using single-threaded virtual users in load testing because a real user s browser is only going to be dedicating one thread to the origin server, so a VU load test allowed us to model real world usage. This made identifying issues through load testing much less complex and if you re using a third party service, it s also a lot less expensive. Then, we had a situation where we only served markup from our origin servers and each server was serving its content from memory without the secondary dependency on a database. This meant that we were able to scale linearly if we knew that a single server, in this case a VM, could spit out key pages at 250 pps and we had a sense of what a particularly heavy night would bring in terms of traffic, say 1500 pps, then we knew that we could spin up three additional VMs on top of the three we typically had running and handle the traffic for that night.
2. know your bottlenecks: profile, profile, profile profile against the code We used BrowserMob (now called Neustar) for load testing. We maintained a staging instance of the site that was identical in every way to production; we did this for a variety of reasons, but one of the more important ones was that it allowed us to get reliable load test results at any time without bogging the main site. BrowserMob offered a simple interface for firing off load tests and the ability to view both high-level results of the tests as well as extract very detailed information using a SQL-ish query language. Knowing full well that we weren t going to be able to infer every trick in the book from the documentation, we worked with Sitecore on scalability tuning. It can be tempting to make assumptions and profile pages based on a hunch, but using empirical data is critical. We presented Sitecore with our load test results at that point and together we went through the process of profiling the site using Red Gate s ANTS performance profiler and determining where our bottlenecks were. Figure 2: Segment of ANTS Analysis Tree
profile against the code (continued...) Figure 2 includes a section of the ANTS analysis tree that gave us the processing time, consumed by every call made during the process of rendering a page. These trees clearly showed where the current bottlenecks were for a given page, so you could optimize those, re-run the test, and focus on the next bottleneck. During this process we found a few items that were lowhanging fruit, such as disabling certain counters, optimizing inefficient content queries, and output caching page segments that had been overlooked. We were able to meet our site-wide targets after a couple of days of this, but we wanted to squeeze as much performance out of the three key pages the home, MEGA Millions, and winning numbers pages as we could. As mentioned earlier, we had certain sub-layouts that contained a mixture of Sitecore content and text, such as winning numbers. This data needed to be displayed within seconds of its arrival for obvious reasons, so we were not able to rely on any of Sitecore s cache clearing mechanisms for sub-layouts that contained this type of data. Once we started drilling into the remaining performance bottlenecks on these key pages, all of which happened to contain a lot of this non-sitecore data, we realized that we needed to cache individual controls such as text controls. Sitecore s text controls appeared to support caching as they would accept a cacheable attribute without complaint, but in reality they couldn t be output cached out of the box. This was also true for link controls and a few others. So we sub-classed the text control, overriding GetCacheKey to return a string identifying a unique instance of the control. This change allowed us to output cache every bit of markup that originated from Sitecore. This may seem like a really nitpicky level of optimization, but given the number of these mixed-content sub-layouts we have on the site, it made a big difference for us. At the end of a few days of work, we were successfully serving our key pages at 800-900 pps in our go-live hardware infrastructure without noticeable performance degradation.
profile against user behavior What we learned most out of this detailed profiling experience is that it s always valuable to take a step back, dig into logs, and even talk to users to get a sense of whether assumptions about site usage hold true under typical and atypical usage patterns. In our case, there was a large MEGA Millions jackpot on March 27, just prior to our highest traffic night, which allowed us to challenge some of our assumptions and adjust accordingly. The CA Lottery s MEGA Millions page is in the middle of the first page of results if you Google mega millions, so it was no surprise that this page saw a majority of the traffic on these high traffic nights, and as such invited renewed scrutiny. On this page there is a side module that says, Are your favorite numbers lucky?. It lets you type numbers in and see how often they had won in the past at least that was the intent. When we looked at the logs after the first of our heavy traffic nights it was noted that this functionality, which by necessity executes a real-time database query, was getting hammered at the times when people would be looking for that night s winning numbers. We realized that visitors were using this functionality to look for that night s winning numbers hitting it once for every number they had played and they were doing it within the window of heaviest traffic, rather than checking once and then waiting for a few minutes. As a result, we removed the check your numbers box and tweaked our messaging, and our page views -to-visits ratio dropped way off, as did the frustration level of site users, I m sure. There were also some non-essential areas on the page that couldn t be output cached; luckily we had built a mechanism to allow those modules to be removed temporarily and replaced after the party was over, so two of the three key pages were 100% output cached by the time things got rough.
summary In summary, we learned that having static pages and elements ready and continuing to parse logs and talk to real users to find bottlenecks are critical actions to take to ensure a website can scale for heavy bursts in traffic. In doing this, we enabled the California Lottery to handle a world-record 640 MEGA Millions jackpot and the traffic that came with it more than 3,000 requests per second at the peak just one month after launch. To this day, we continue to fine-tune our caching, page structure, and hardware infrastructure as the site evolves to deliver up-to-the-second information to lottery players. To learn more about SolutionSet and how we can partner with you to build scalable platforms that serve your business needs, please contact robert. balmaseda@solutionset.com. about solutionset SolutionSet is an award-winning digital consultancy and a Sitecore Certified Solution Partner in CEP, CRM, and E-Commerce. We have 17+ Sitecore Certified Developers on staff who have worked to complete 14 Sitecore implementations to date. We earned two North American Sitecore Site of the Year awards in 2012 for California Lottery and American Express Global Corporate Payments website implementations. SolutionSet was built from the ground up to combine the thinking, creativity, and passion of a digital agency with the strategy, process, and engineering depth of a technology consultancy. We design and develop web, mobile, social and digital marketing solutions that help leading companies better engage and serve customers. SolutionSet clients include American Express, California Lottery, Cisco, Cord Blood Registry, Dell, Duke University, and TXU Energy. Visit us at: solutionset.com M AY 2 0 13 Robert Balmaseda, Eric Bradford, Alex Kaplinsky, Peter Montgomery