CAT: Azure SQL DB Premium Deep Dive and Mythbuster Ewan Fairweather Senior Program Manager Azure Customer Advisory Team Tobias Ternstrom Principal Program Manager Data Platform Group
Cloud & Enterprise Customer Team Europe: - Azure Applications - Azure Data - Azure Analytics Customer 45% Architecture guidance and technology expertise i.e. patterns, practices and codification CAT Platform Provide end to end Azure customer story on how features works in customer project scenarios based on learnings from the biggest dpeloyments Community Accelerate cloud adoption i.e. white-papers, events Frameworks and code Community 10% Engineering 45%
Agenda Persistent data options in Azure Azure SQL DB Premium Deep Dive Sizing and capacity planning Customer experience and learnings Summary
Persistent Data Options in Azure
The Application Journey
Azure Storage Options Platform as a Service Azure SQL Database (managed databases) Publish and run Shared environment Infrastructure as a Service SQL Server running in a Windows Azure VM Or any other database you have bits for Full control / insight More administrative effort Tables Blobs Queues Azure Storage No relational Cheap storage Optimized for density and scale out
Resources Three different ways to run SQL Dedicated Scale-up Full h/w control Roll-your-own HA/DR/scale Azure Premium SQL Server Raw iron 100% of API, Virtualized Roll-your-own HA/DR/scale SQL Server in IaaS Virtualized Machine Auto HA, Fault- Tolerance Self-provisioning, mgmt @ scale Shared SQL Database - PaaS Virtualized Database High Friction /Control Low
Decision Points Common Data going to WA Storage (Point lookups, minimial relational) Telemetry Logs, append workloads, primarily key value lookups Blobs for WA SQL DB (lower costs, reduce DB size under 150GB limit) Commonly going to SQL Server in VM (lift and shift, DW) Applications needing features not currently in SQL DB (example: Fulltext) Light DW workloads Commonly going to SQL DB (OLTP) Applications who do not want to manage their databases Applications that need massive horizontal scale (Internet-facing SaaS ISVs) New OLTP applications Premium DB extends Azure SQL DB s capabilities
Typical Performance Factors 1 2 3 1 Factor Why it matters Latency - Greater than on-premise - Higher variance 2 Establishing connections - Initial login goes to the gateway - Connections are unreliable and will fail Writes are the most expensive resource in this system 3 Multi tenancy - Unpredictable performance - Soft throttling - Hard throttling - Shared log, max transaction size
SQL DB Web/Business Performance Variance Web/Business Editions provide elastic scale without performance SLA There is some variance in performance due to multi tenancy, we will reduce the variance further over time SQL DB contains logic to move DBs around to balance load across each cluster to maximize average resources DB Resources Available Databases can get different resources based on other s activity Time
Resource management in Azure SQL DB SQL Database monitors the usage of the shared resources to keep databases within resource limits When resource usage exceeds limits SQL DB can manage resource usage at DB or node level killing connection or deny requests Throttling stages: Soft (subset of DBs) and Hard (all DBs) Resource Limit Error code Database Size 150 GB or less depending on the database quota (MAXSIZE) 40544 Decode type and resource Transaction duration State 1: 24 hours State 2: 20 seconds if a transaction locks a resource required by an underlying system task 40549 Lock count 1 million locks per transaction 40550 TempDB State 1: 5 GB of tempdb space State 2: 2 GB per transaction in tempdb State 3: 20% of total log space in tempdb 40551 Transaction log space State 1: 2 GB per transaction State 2: 20% of total log space 40552 Memory 16 MB memory grant for more than 20 seconds 40553 Worker Thread Governance Every database will have a maximum worker thread concurrency limit 10928 10929
Azure SQL DB Premium: How it works
Edition Comparison Premium has reserved resources on all 3 nodes You can upgrade or downgrade a database You should decide sizing based on your resource needs P2 P1 DB Resources Available Web/Business Time
Premium Edition Some applications require guaranteed resources Premium Edition was introduced for customers who need dedicated resources Common customer attributes: High throughput requirements Low latency requirements Low performance variance requirements Premium Edition details Dedicated resources (min=max) to avoid performance variance Different sizes (P1-P2) allow adjustment based on resource needs Currently in Public Preview
Premium Edition Reservation Sizes Reservations are done separately for each database Capacity is limited during public preview Customers can get 1-2 reservations based on availability Monthly Price is USD $930 for P1 at GA. P2 is 2x P3 and P4 s available at engineering discretion Size CPU Cores Worker Threads Active Sessions Disk IO (IOPS) Memory (GB) P1 1 200 2000 150 8 P2 2 400 4000 300 16
Premium Database
Set Premium Service Objective
Checking Status of Azure SQL DB The DB will remain online aside from a few seconds during the final failover
Checking Current SLO
Checking Status of Move Lower and upper bound estimates vary between 15 minutes for an empty database and approximately 2 days for a 150 GB database
Premium DB or A Larger VM? SQL Premium DB Size SQL Premium GA Monthly Cost SQL VM Monthly Cost SQL VM Size (Enterprise Edition) P1 (M) 1 CPU Core 8GB RAM 150 IOPS P2 (L) 2 CPU Cores 16GB RAM 300 IOPS $930 $1,629 $1860 $1696 $1,830 $2,321 $3,660 $4,642 S (A1) 1 CPU Core 1.75GB RAM 2x500 IOPS M (A2) 2 CPU Cores 3.5GB RAM 4x500 IOPS L (A3) 4 CPU Cores 7GB RAM 8x500 IOPS A6 4 CPU Cores 28GB RAM 8x500 IOPS XL (A4) 8 CPU Cores 14GB RAM 16x500 IOPS A7 8 CPU Cores 56GB RAM 16x500 IOPS
Sizing and Capacity Planning
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 Sizing Databases For a SINGLE database Find largest resource consumer Measure peak load over time period Choose appropriate reservation size to handle peak load Workload Type matters Batch processing aim to achieve avg throughput over time (not size for peak) Interactive applications need to size for the peak to preserve response times 1.2 1 0.8 0.6 0.4 0.2 0 CPUAvgCoresUsedInHr
2013091600 2013091610 2013091620 2013091706 2013091716 2013091802 2013091812 2013091822 2013091908 2013091918 2013092004 2013092014 2013092100 2013092110 2013092120 2013092206 2013092216 Peak Load Example Weekly IO chart of a large customer on WA SQL DB We actively work on the load each week Query tuning Moving maintenance jobs to off-peak hours We also do aggressive things Split different functions out into different databases Rate-meter background jobs to not impact core workloads 250 200 150 100 50 0 Avg Hourly Physical Write IOPS (1 week) Query Tuning to reduce daily peak Daily Maintenance Job Moved to offpeak hours Weekly Maintenance Moved to Sunday Total
26 Azure SQL Database DMV Surface Area Health (master) sys.event_log sys.bandwidth_usage sys.database_connection_stats Data Access & Usage sys.dm_db_index_usage_stats sys.dm_db_missing_index_details sys.dm_db_missing_index_groups sys.dm_db_missing_index_group_stats sys.dm_exec_sessions Performance sys.dm_exec_query_stats sys.dm_exec_sql_text sys.dm_exec_query_plan sys.dm_exec_requests sys.dm_db_wait_stats Resource Usage (master) sys.resource_usage* sys.resource_stats* Windows Azure SQL Database and SQL Server -- Performance and Scalability Compared and Contrasted http://msdn.microsoft.com/en-us/library/windowsazure/jj879332.aspx
Capacity planning Use sys.resource_stats (in preview) in master db to determine your application resource needs: SELECT * FROM sys.resource_stats WHERE database_name = 'MyTestDB' AND start_time > DATEADD(day, -7, GETDATE())
Investigating resource usage Avg and Max resource usage SELECT avg(avg_cpu_cores_used) AS 'Average CPU Cores Used', max(avg_cpu_cores_used) AS 'Maximum CPU Cores Used', avg(avg_physical_read_iops + avg_physical_write_iops) AS 'Average Physical IOPS', max(avg_physical_read_iops + avg_physical_write_iops) AS 'Maximum Physical IOPS', avg(active_memory_used_kb / (1024.0 * 1024.0)) AS 'Average Memory Used in GB', max(active_memory_used_kb / (1024.0 * 1024.0)) AS 'Maximum Memory Used in GB', avg(active_session_count) AS 'Average # of Sessions', max(active_session_count) AS 'Maximum # of Sessions', avg(active_worker_count) AS 'Average # of Workers', max(active_worker_count) AS 'Maximum # of Workers' FROM sys.resource_stats WHERE database_name = 'MyTestDB' AND start_time > DATEADD(day, -7, GETDATE()) Percentage of time using more than 1 core SELECT (SELECT SUM(DATEDIFF(minute, start_time, end_time)) FROM sys.resource_stats WHERE database_name = 'MyTestDB' AND start_time > DATEADD(day, -7, GETDATE()) AND avg_cpu_cores_used > 1.0) * 1.0 / SUM(DATEDIFF(minute, start_time, end_time) ) AS percenage_more_than_1_core FROM sys.resource_stats WHERE database_name = 'MyTestDB' AND start_time > DATEADD(day, -7, GETDATE())
Managing DB Resource Growth Assuming your application resources grow over time, you need a plan to deal with that growth, in the box world we are always sizing for a future peak The cloud offers two architectural approaches to manage, which are both elastic Scale-up (limited): Web/Business -> P1 -> P2 Scale-out : use more databases Partitioning data by function or by tenant allows you to adjust as needed to growth in resource usage at the database level Plan on actively monitoring/alerting telemetry about the resource use so you can adjust to growth before something breaks
Cost Optimization Two paths to improve your cloud service Spend more money (purchase more capacity) Optimize/Tune (more operations in capacity you have) The Cloud model lets you choose If you have development resources available, you might choose to tune If you are on a time deadline, you might just choose to scale up instead This model also works great for seasonal demand changes Example: Add capacity before the holiday sales season, remove after. (~$32 per day for a P1)
Customer Experience and Learnings
What s different with data access in the Cloud? Two key areas of attention Connection management issues Less reliable connection state due to multiple layers and network hops Retry logic mandatory to implement reliable communications between application and database server Higher latency between app tier and database tier compared to an on-premises deployment Firewalls, load balancers, gateways This amplifies the impact of chatty application behaviors We will talk more about this in our 11:45 session
Batching inserts APP TIER 1 Application logic asynch inserts 2 Buffer, group items Data access layer 3 Batch Bulk Insert Azure SQL Database Time (t) or size (n) window approach can result in the loss of: t seconds of data n rows of data
Takeaways Reliability: Plain ADO.NET single insert with full retry logic 1 Density: Async and buffered approach 2
Workload tuning options How can I improve density? Introducing batching Reducing application round-trips Improve insert performance Leverage asynchronous approach Buffer across time and number of insertions
Scale-Up vs. Scale-Out P1-P2 supported during public preview period Additional sizes can be introduced by GA With a scale up approach you may lose some flexibility E.g. require planning for worst case / peaks Premium let you scale up/down between P1 and P2 max one time a day Scale up may not fit all costs/business models Unpredictable workloads Multiple database deployments
Easyjet Seat Selection System 70/30 R/W workload, very efficient workload (<200mS max exec time) Majority of queries benefitted from switching Reduced and more stable response times for both reads and writes Switch
Customer experience: Easyjet Reduced impact of 40501, 10928 and 10929 errors Remaining exceptions have been mostly due to application issues Broken build Major ticket sale
Another Customer experience Availability has greatly improved after the switch (less than 2min x month) Growing trend in CPU usage Around 2 on average, with spikes up to 5 No major errors related to resource issues Sporadic throttling for High Log IO waits Switch
Application-Tier Caching App-tier caching is a very effective way to reduce data-tier load Azure has a several caching solutions available to you For load spikes, this can often significantly reduce peak load Example: Azure SQL DB was used in the last US Presidential Election Few writes, massive reads all at once App tier caching used to remove reads from the database CPU graph for the core reporting DB 1 st 10 seconds 44K page views/second (est. ~450K DB calls/sec) Next 20 seconds 10K page views/sec (est. ~100K DB calls/sec) (DB calls mostly removed due to caching)
Summary
Summary Premium DB provides predictable performance and elasticity We offer you a mixture of scale-up and scale-out approaches The elastic nature of these options allows you to deal with peaks in a different way to on premises
Resources Premium Preview for SQL Database Guidance (http://msdn.microsoft.com/enus/library/jj853352.aspx) Azure SQL Database and SQL Server -- Performance and Scalability Compared and Contrasted (http://msdn.microsoft.com/enus/library/windowsazure/jj879332.aspx)
Resources Cloud Service Fundamentals in Windows Azure Wiki: http://social.technet.microsoft.com/wiki/contents/articles/17987.cloud-servicefundamentals.aspx Best practices on: Scale out architecture Design for operations Telemetry solution Reliable architecture
THANK YOU! For attending this session and PASS SQLRally Nordic 2013, Stockholm