SSIS - enterprise ready ETL By: Oz Levi BI Solution architect Matrix BI Agenda SSIS Best Practices What s New in SSIS 2012? High Data Quality Using SQL Server 2012 Data Quality Services SSIS advanced topics Q&A 1
SSIS Best Practices Performance Logging Scalability Design will be address later today. Performance Tips & Tricks Use NOLOCK to remove locking overhead. It will improve the speed of large table scans. SELECT only the columns that you need. Use shared lookups and data cache whenever possible in 2012 you can share them between the packages in your project! 2
Performance Consider your network environment Try to minimize network overhead When working in distributed environments change the network packet size on the connection manager to 32K (32767), higher values will generally produce faster throughput! Q: What does the MaxConcurentExecutables parameter do? 3
Performance Be precise! Use thought when selecting data types for your columns Avoid preforming excessive casting it will increase memory usage and degrade your package performance. Make your data typesas smallas you can, your package will take less RAM and thus increase speed. When using numerical data types such as decimal watch precision issues. Avoid traditional UPSERT Performance Plan from the bottom up If a large portion of the table has changed, just reload it. If UPSERT is required use SQL MERGE statement. 4
Performance Plan from the bottom up Use minimally logged operations TRUNCATE don t DELETE. SWITCH Partitions or implement Sliding window mechanisms. Work in BULK mode on your data flow. Consider using TRACE 610 but be careful. Q: What is a HEAP table? 5
Performance SQL Target Minimize index usage HEAP* insert is typically faster than clustered index. If indexes are needed, drop and create them when 30% or more of the table is changed. SWITCH Partitions! It is considerably faster When inserting a HEAP set Commit size to 0, its fastest because it will commit only one transaction (for transaction of more the 500MB avoid this). If 0 is not available set to the highest possible size. A heap table, by definition, is a table that doesn't have any clustered indexes Performance SSIS is strong, but SQL Server is stronger! Let the RDBM do what it is good at Try to JOIN directly at the source (if it is relational) Use GROUP BY clause Aggregate functions and window function are a better alternative Use ORDER BY clause directly on your source component 6
Performance Plan from the bottom up Let the RDBM do what it is good at (cnt.) When updating (or upserting) use a set based update and not a Row By Row OLEDB operation. If possible join and filter on the data source level. Aggregations GROUP BY/SUM etc. Use Merge, Switch and Drop when handling partitions. Sort only if you must Performance Parallelism Create parallelism where ever possible but plan to avoid bottlenecks. Know your systems I/O limitation. To avoid SQL Server waits (CX Packet, IO Completion etc.) design you DWH for parallel writing (multiple storage arrays, file groups ) Partition your tables. 7
Agenda SSIS Best Practices What s New in SSIS 2012? High Data Quality Using SQL Server 2012 Data Quality Services SSIS advanced topics Q&A What s New in SSIS 2012? New Look & Feel New features and improvements to the UI Getting started window New package visualization. Zoom slide (in oppose to ctrl + mouse scroll) Undo SSIS toolbox has new features. Data flow source/destination wizard Sort packages by name Grouping inside the data flow 8
What s New in SSIS 2012? New Look & Feel Q: What is backpressure in SSIS? 9
Data flow changes What s New in SSIS 2012? Inside the flow Column mapping dialog is all new Merge and Merge join now have improved backpressure support. Pivot and Row Count components get a UI. Components in a Dataflow can be Grouped / Ungrouped What s New in SSIS 2012? Programmability Script Task Script task and script component now support.net 4.0 Breakpoints are supported in script component Custom Components When developing custom components, there is better backpressure support: SupportsBackPressureproperty, IsInputReadyand GetDependantInputs method..net API and Powershell. 10
What s New in SSIS 2012? New + Expression + Task Expression task No need to use script task to re-assign variables, an expression task can be used instead to chronologically modify variables. 4000 Char expression length lifted. New expression language keywords LEFT as syntactic sugar* for SUBSTRING(,1,) TOKEN and TOKENCOUNT for shredding strings. REPLACENULL What s New in SSIS 2012? All new VS2010 project SSIS can work with the new project mode, or in the old package mode. In the new project mode Project becomes the level of deployment. Deployment can only be done to SQL Server (not to MSDB, but to a new catalog that is introduces called SSISDB. Logging is automated (to SSISDB). The project can be converted between deployment types. 11
Deployment Models What s New in SSIS 2012? Package deployment model what we know today Deploy DTSX files to file System or MSDB. Project deployment model A new type of deployment model Deploy ISPAC files to SSISDB. What s New in SSIS 2012? Characteristic Package Project Unit of deployment Package Project Deployment location File system or MSDB database Integration Services catalog Run-time property value assignment Environment-specific values for use in property values Package validation Package execution Configurations Configurations Just before execution using: DTExec Managed code DTExec DTExecUI Parameters Environment variables Independent of execution using: SQL Server Management Studio interface Stored procedure Managed code SQL Server Management Studio interface Stored procedure Managed code Logging Configure log provider or implement custom logging No configuration required Scheduling SQL Server Agent job SQL Server Agent job CLR integration Not required Required 12
DEMO What's new What s New in SSIS 2012? Parameters Package scope Project scope Once assigned, it is read only Values are set when package starts and cannot be changed in runtime. Can also be set from SSDT. Does not replace use of variables. Default values can be configured. 13
What s New in SSIS 2012? (Real) Shared connection managers It is in the project level and is automatically available for every package in the project! Can be parameterized. In memory data cache also available, you can cache data in one package and use it in another. What s New in SSIS 2012? Under the hood Get to know the new catalog SSISDB Not created automatically, needs to be created before deployment. Manage via SSMS It stores all SSIS related contents Allows running, monitoring and managing SSIS projects and packages via SSMS. 14
Source control What s New in SSIS 2012? Under the hood DTSX files are more readable and can now be merged in TFS (the XML is sorted, filtered, Prettyprinted ). Data taps What s New in SSIS 2012? Under the hood Can be added on runtime on the server itself! No need to open or modify a production package! Writes the data to disk instead of visualizing it. Hands on 15
DEMO Setting up a Data Tap What s New in SSIS 2012? CDC (Change data capture) Incremental load loads all rows that have changed since the last load How do we know what has changed? Compare every source row with every destination row Last modified date and a trigger to maintain this Change tracking Change data capture! Data quality services (DQS) Will be reviewed 16
What s New in SSIS 2012? Native ODBC Support added Was not supported in previous versions (via ADO.Net) ODBC Source and destination are added in 2012 (essential for SQL Azure). ODBC is faster! # of rows ODBC (sec) ADO.Net (sec) % Diff 1,000 0.42 2.12 405% 10,000 4.91 7.84 60% 100,000 49.2 78.36 59% 1,000,000 481.65 781.28 62% * Benchmark by Nico Jacobs (@sqlwaldorf) DEMO Get to know SQL Server data tools 17
Agenda SSIS Best Practices What s New in SSIS 2012? High Data Quality Using SQL Server 2012 Data Quality Services SSIS advanced topics Q&A High Data Quality Using DQS Bad data is bad business! Or DQS in a nutshell Knowledge-Driven data quality solution People Processes Technology 18
High Data Quality Using DQS Processes People Technology Meet the lead actors Data stewards. Information worker. They are the real owners of the data! In charge of making the rules (creating knowledge) and driving the technology. High Data Quality Using DQS The flow Processes People Technology Knowledge management Source data Map Build Knowledge KB base Processed Data Export Use Data Quality project 19
High Data Quality Using DQS Processes People Technology Q: When to use DQS? 20
High Data Quality Using DQS Bad data is bad business! Processes People Technology Issue Detail Completeness Is all information present? Conformity Is all data in the correct format? Consistency Do values represent the same meaning? Accuracy Do data objects represent their real-world values? Validity Do data values fall within acceptable ranges? Duplication Are there multiple copies of the same data? High Data Quality Using DQS DQS and SSIS Processes People Technology DQS Integration via SSIS DQS Cleansing Transformation The DQS Cleansing component can be used when: Cleansing should be performed as a batch process. The cleansing functionality is used as part of a larger data integration scenario. The cleansing process has to be automated, or run periodically. 21
High Data Quality Using DQS DQS and SSIS Processes People Technology Correct Records DQS Server Cleansing SourceTask DQS KB SSIS Package Cleansing Task Cleansing Task Corrected Records Suggested Records Invalid Records High Data Quality Using DQS Bad data is bad business! Processes People Technology DQS Cleansing output statuses - Status Correct Invalid Corrected Unknown Description The value was already correct, and was not modified The value was marked as invalid for this domain The value was incorrect, but DQS was able to correct it. The Corrected column will contain the modified value. The value wasn t in the current domain, and did not match any domain rules. DQS is unsure whether or not it is valid. Suggestion The value wasn t an exact match, but DQS has provided a suggestion. If you include the Confidence field, you could automatically accept rows above a certain confidence level, and redirect others to a separate table. 22
DEMO Setting up the DQS cleansing task in SSIS Agenda SSIS Best Practices What s New in SSIS 2012? High Data Quality Using SQL Server 2012 Data Quality Services SSIS advanced topics Q&A 23
SSIS advanced topics Understanding resource utilization of your packages SSIS design tips SSIS as an in-memory pipeline Monitoring SSIS Execution SSIS in large organizations SSIS advanced topics Understanding resource utilization of packages Data flow process is a row by row operation. Happens in memory and with high speed. It is important to understand resource utilization, i.e., the CPU, memory, I/O, and network utilization of your packages. 24
Q: What is a buffer? CPU SSIS advanced topics Understanding resource utilization of packages SQL Server vs. SSIS on the same machine? SQL Server will probably win Transformation will slow down because SSIS will write to disk if RAM gets low. Perfmon counters Process / % Processor Time (Total). 25
Network SSIS advanced topics Understanding resource utilization of packages Your transformation is only as fast as your network packets flow. If a distributed environment is used, make sure that the network throughput is good enough. I/O SSIS advanced topics Understanding resource utilization of packages Ensure that nothing is written to disk except from when data is read initially or written eventually. Make sure your server is design for a high number of IOP S (I/O Operations for second). If your I/O subsystem is not well design SQL server will be stuck on I/O completion and CXPakects waits. 26
Memory SSIS advanced topics Understanding resource utilization of packages Monitor DTEXEC.exe -it will tell you how much memory your package is consuming. Process / Private Bytes (DTEXEC.exe) The amount of memory currently in use by Integration Services. This memory cannot be shared with other processes. Process / Working Set (DTEXEC.exe) The total amount of allocated memory by Integration Services. Q: What's the difference between a Synchronous and an Asynchronous component? 27
SSIS advanced topics What to look for? ---Look for --- Leaking Buffer On package completion, buffer usage retunes to 0, if not, buffers are leaking Use Buffers in use, Flat buffers in use and Private buffers in use to discover this Memory swapping Buffers spooled should always stay on 0, if above your package is using I/O instead of RAM. Execution progress Rows read and Rows written show how many rows the entire Data Flow has processed. DEMO Working with prefmon counters 28
SSIS advanced topics SSIS design tips It s only a data type change, how long can it take? A Good package design Development Process Agile (iterative) design and implementation (don t try to boil the ocean). Planning What are the limitations of my servers (I.O, RAM)? What sources am I working with? What is my process window? Implementation Process Modularity - Logically distinct packages. Package Modularity Sub-Processes in containers. Component modularity Custom components, assemblies. 29
SSIS advanced topics SSIS design tips Design you package to be modular. Keep a single repetitive design pattern (seen one seen em all). Design your package to be re-runnable. Use event handlers (on error etc.) Use checkpoints if needed. Make sure that if the package fails, it will pick up where it left off SSIS advanced topics SSIS design tips Sharpen, Separate and Re-use code if possible. Wrap complex queries in stored procedures or user defined tables. Use custom components or custom assemblies to store the code for proprietary frequently used operation. Avoid code duplication! 30
SSIS advanced topics SSIS design tips Keep conformed naming conventions Repetitive blueprint design patterns Presentable layout Annotations (like you use on code ) Error logging Configurations SSIS advanced topics SSIS as an in-memory pipeline Share Cache Among Multiple Packages With CACHE connection manager Use Shared lookup cach 31
SSIS advanced topics SSIS as an in-memory pipeline Preparing a shared data cache Building the Cache Warming it Up Consuming the cache in an external package Can be saved to a CAW file. If a file option is selected, the consuming package does not have to be a child package. SSIS advanced topics SSIS as an in-memory pipeline understand the data flow 32
SSIS advanced topics SSIS as an in-memory pipeline Try to avoid asynchronous components They wait for all of the rows to flow in before starting to work (Aggregate, Sort etc.) They create a new buffer and give the rows new Lineage ID, so you are actually duplication the data. Engine threads property is a hint DEMO Using shared cache and lookups 33
Monitoring SSIS Execution SSIS DB SQL Queries. SSRS Reports (i.e. search for SSIS Reporting Pack in CODEPLEX). Performance monitor. Monitor SSIS counters. Logging. Use of event handlers. Monitoring SSIS Execution SSIS DB SQL Queries. SSRS Reports. Built in External/Your own SSIS Reporting Pack in CODEPLEX DTLoggedExec A very nice project, it allows to fully log and instrument package execution. Performance monitor. Monitor SSIS counters. Logging. Use of event handlers and native logging can also help. 34
DEMO Monitoring SSIS packages SSIS in large organizations Large Telco Company 500 Cell towers. Location. Data connection and calls. ~50 Unix switches (switch for every 5-10 towers). Each switch produces 500MB to 5GB flat files. 7~8 TB Per day. 64 SSIS Servers 35
SSIS in large organizations Large Telco Company Unix Switches DWH SSIS Servers Extract Transform Load SSIS in large organizations Work pile pattern Work pile / Queue Scheduler SSIS Servers Shared Resources Pn. P3 P2 P1 DTEXEC (1) DTEXEC (2) SAN DTEXEC (n) 2012 36
DEMO Work pile pattern in real life size 37