David Dye Extract, Transform, Load
Extract, Transform, Load Overview SQL Tools Load Considerations
Introduction David Dye derekman1@msn.com HTTP://WWW.SQLSAFETY.COM
Overview
ETL Overview Extract Define the source of the data Can be SQL Oracle Excel XML Text Transform To insure the data is accurate and consistent Apply business logic Aggregate Load Define the destination Load the transformed data
SQL Tools
SQL Tools T-SQL MERGE Bulk copy program (bcp) BULK INSERT OPENROWSET(BULK)
T-SQL Data Manipulation Language DML Can be used to INSERT UPDATE DELETE T-SQL code can be compartmentalized using Views Stored procedures Functions
MERGE Included in all editions of T-SQL beginning in SQL 2008 T-SQL statement used to INSERT UPDATE DELETE All within a single statement Syntax MERGE Targettablename that will be inserted, updated, or deleted into USING source table that is joined back to the target WHEN MATCHED specifies what transaction should be done when the target and source predicates are met WHEN NOT MATCHED specifies what transaction should be done with the target and source predicates are not met
When to Use MERGE Since the MERGE statements treats all INSERT(S), UPDATE(S), and DELETE(S) as a single transaction this is often more efficient This is a general statement Validate that MERGE is more efficient than using separate INSERT, UPDATE, and DELETE statements
BCP bcp Utility Command line tool to import or export data to and from SQL Import can be done from a user specified file format bcp does not include any information about the data Table structure Data types Constraints A format file is used to hold this meta data Optionally supports a format file Can be used to ease importing or exporting -f switch specifies that a format file is used Bcp can optionally create a format file When used with in or out bcp requires an existing format file
BULK INSERT BULK INSERT Options similar to bcp, but implemented as T-SQL Runs in the SQL Server process Can be executed within Stored procedures User defined transactions Supports CHECK_CONSTRAINTS FIRE_TRIGGERS
OPENROWSET Function OPENROWSET Allows access to remote data sources using OLEDB provider Disabled by default Offers a bulk provider for imports from files Implemented as T-SQL used in the FROM clause Supports special tables hints
Demonstration BCP BULK INSERT OPENROWSET
Load Considerations
Check constraints Check Constraints Business logic can be incorporated in the transformation to insure constraint logic Constraints can be disabled during load and reenabled after Default behavior is that existing values will not be validated once re-enabled To validate existing values use WITH CHECK CHECK Once enabled, regardless of WITH CHECK CHECK, the constraint will insure incoming validation
Foreign Keys Foreign keys Like check constraints referential integrity can be verified during the transformation Foreign keys can be disabled during the load and re-enabled after the load Like check constraints once re-enabled existing values will not be validated Requires using WITH CHECK CHECK Once enabled all incoming values will be validated for referential integrity
Primary keys Primary Keys Primary keys can be disabled To re-enable it requires rebuilding the index During the index rebuild ALL values will be validated Disabling a primary key will disable all foreign keys that reference the primary key
Unique constraints Unique Constraint Unique constraints can be disabled To re-enable it requires rebuilding the index During the index rebuild ALL values will be validated Disabling a unique constraint will disable all foreign keys that reference the primary key
Indexes Both clustered and non-clustered indexes are transactionally based As rows are inserted, updated, and deleted the index(es) must be updated if the key column(s) are affected Disabling the index(es) can speed up load and reduce logging Enabling indexes requires rebuilding the index Default behavior is the index will be unavailable while being rebuilt Enterprise edition can be done online Uses the tempdb Once disabled the index(es) will no be available This can obviously dramatically affect query performance Often offset by the resources saved with the indexes disabled
Indexes and Constraints Constraints Foreign keys and check constraints are both constraints By default an index is not created MUST USE WITH CHECK CHECK to validate existing values meet constraint Indexes Primary key Enforces uniqueness for all values Does not accept any NULL values Unique constraint/index SAME THING Implemented as an index Require all values must be unique Will allow a single NULL value ANSI allows multiple NULL values
Locking Locking occurs automatically in SQL to insure the ACID properties of a database A Atomicity Each transaction is all or nothing. If one part of the transaction fails the entire transaction fails C Consistency Any transaction will bring the database from one valid state to another I Isolation Ensures concurrent execution of transactions results in a system state that would be obtained if transactions were executed one after the other D Durability Once a transaction is committed it will remain committed regardless
Minimizing Locking SQL works with locks using lock escalation Lower level locks are generated, page index range etc. Every lock requires resources but lower level locks increase concurrency Lock escalation trades many fine grain locks to fewer coarser grain locks This reduces the resources required, but reduces concurrency Ex. Trading many page locks for a table lock The import process could use a TABLOCK which will increase the load Although the load will be faster there is reduced concurrency
Minimizing Logging Logging can be minimized by changing the recovery model Reduced logging will speed the load process as well as reduce the disk IO during the load BULK_LOGGED recovery model provides minimal logging for bulk transactions SIMPLE recovery model will insure that the transaction log is truncated after checkpoints occur Changing the recovery model to simple will prevent the ability to restore to a point in time and bulk logged recovery can prevent restoring bulk transactions For a data warehouse you quite often can completely reload the database with the existing ETL solution Can be considered, but prohibitive for VLDB
Demonstration Check constraints Foreign keys Disabling and re-enabling check constraints and foreign keys Unique constraints(indexes) Primary keys