Join af tabeller med SAS skal det være hurtigt? Henrik Dorf, chefkonsulent, PS Commercial
Join af tabeller Skal det være hurtigt kræver det Valgmuligheder Viden Eksperimenter
Historien En af de første ting, jeg hørte om SAS DATA step, var: SAS kan blande, flette og sammensætte data på alle måder, man kan tænke sig også alle de måder som det nye SQL tilbyder. Og endda flere... (1985)
Valget mellem DATA step eller SQL SAS DATA step 1972 SAS-udviklet 4. generationsværktøj til databehandling Ofte anvendt i forbindelse med store datamængder Kræver færre computerressourcer Forudsætning: Hold data sorteret Structured Query Language 1979 Sprog designet til relationelle databaser Skræddersyet til transaktionsbaserede systemer Findes i forskellige varianter Kræver ofte flere ressourcer, men er afhængig af implementering
Hvilke joins findes? Type SAS DATA step SQL Konkatenering Set Union Flette Set by Union order by Kombinere Merge Join Update Merge by Update Modify Join Left join Right Join Full join Update + Insert Split Data output1-n Create table table1 Tabelopslag Formater: Keyed set med KEY= HASH-objekt Subselect join
Konkatenering A 1 1 1 1 2 2 2 1 3 3 3 1 B Obs k b c 1 2 2 2 2 3 3 2 3 4 4 2 data D ; Set A B ; run; proc sql ; create table S as select * from a union select * from b ; quit; b 1 1 1 1. 2 2 2 1. 3 3 3 1. 4 2. 2 2 5 3. 2 3 6 4. 2 4 1 1 1 1 2 2 2 1 3 2 2 2 4 3 3 1 5 3 3 2 6 4 4 2
Flette A 1 1 1 1 2 2 2 1 3 3 3 1 B Obs k b c 1 2 2 2 2 3 3 2 3 4 4 2 data D ; set A B ; By K; run; b 1 1 1 1. 2 2 2 1. 3 2. 2 2 4 3 3 1. 5 3. 2 3 6 4. 2 4 proc sql ; create table S as select * from a union select * from b order by k; quit; 1 1 1 1 2 2 2 1 3 2 2 2 4 3 3 1 5 3 3 2 6 4 4 2
Merge / join match merge A B 1 1 1 1 2 2 2 1 3 3 3 1 4 4 4 1 5 5 5 1 data D1; merge a b ; by k ; run; b 1 1 1 2 1 2 2 2 2 2 3 3 3 2 3 4 4 4 2 4 5 5 5 2 5 Obs k b c 1 1 1 2 2 2 2 2 3 3 3 2 4 4 4 2 5 5 5 2 proc sql ; create table S1 as select * from a join b on a.k=b.k; quit; Obs k a b c 1 1 1 1 1 2 2 2 2 1 3 3 3 3 1 4 4 4 4 1 5 5 5 5 1
Merge / join ikke match A 1 1 1 1 2 2 2 1 3 3 3 1 B Obs k b c 1 1 1 2 2 2 2 2 3 5 5 2 4 6 6 2 4 4 4 1 5 5 5 1 proc sql ; data D; create table S as select * merge a b ; by k ; run; from a join b on a.k=b.k; quit; b 1 1 1 2 1 2 2 2 2 2 3 3 3 1. b 1 1 1 1 1 2 2 2 1 2 3 5 5 1 5 4 4 4 1. 5 5 5 2 5 6 6. 2 6
Hvilken join? Join er mange ting (Inner) join Right join Left join Full join Cross join Som alle giver forskellige resultater
Join Inner A 1 1 1 1 2 2 2 1 3 3 3 1 B Obs k b c 1 1 1 2 2 2 2 2 3 5 5 2 4 4 4 1 5 5 5 1 proc sql ; 4 6 6 2 data D; merge a b ; by k ; run; b 1 1 1 2 1 2 2 2 2 2 3 3 3 1. create table S as select * from a inner join b on a.k=b.k; quit; b 1 1 1 1 1 2 2 2 1 2 3 5 5 1 5 4 4 4 1. 5 5 5 2 5 6 6. 2 6
Join Inner (default) A 1 1 1 1 B Obs k b c 1 1 1 2 2 2 2 1 2 2 2 2 3 3 3 1 3 5 5 2 4 4 4 1 4 6 6 2 5 5 5 1 proc sql ; data D; merge a b ; by k ; run; create table S as select * from a,b where a.k=b.k; quit; b 1 1 1 2 1 2 2 2 2 2 3 3 3 1. b 1 1 1 1 1 2 2 2 1 2 3 5 5 1 5 4 4 4 1. 5 5 5 2 5 6 6. 2 6
Join Right A 1 1 1 1 B Obs k b c 1 1 1 2 2 2 2 1 2 2 2 2 3 3 3 1 3 5 5 2 4 4 4 1 4 6 6 2 5 5 5 1 proc sql ; data D; create table S as select * merge a b ; by k ; run; from a right join b on a.k=b.k; quit; b 1 1 1 2 1 2 2 2 2 2 3 3 3 1. 4 4 4 1. b 1 1 1 1 1 2 2 2 1 2 3 5 5 1 5 4... 6 5 5 5 2 5 6 6. 2 6
Join Left A 1 1 1 1 B Obs k b c 1 1 1 2 2 2 2 1 2 2 2 2 3 3 3 1 3 5 5 2 4 4 4 1 4 6 6 2 5 5 5 1 proc sql ; data D; merge a b ; by k ; run; create table S as select * from a left join b on a.k=b.k; quit; Obs K A c b 1 1 1 2 1 2 2 2 2 2 3 3 3 1. 4 4 4 1. 5 5 5 2 5 b 1 1 1 1 1 2 2 2 1 2 3 3 3 1. 4 4 4 1. 5 5 5 1 5 6 6. 2 6
Join Full A 1 1 1 1 B Obs k b c 1 1 1 2 2 2 2 1 2 2 2 2 3 3 3 1 3 5 5 2 4 4 4 1 4 6 6 2 5 5 5 1 proc sql ; data D; merge a b ; by k ; run; b 1 1 1 2 1 2 2 2 2 2 3 3 3 1. 4 4 4 1. 5 5 5 2 5 6 6. 2 6 create table S as select * from a full join b on a.k=b.k; quit; b 1 1 1 1 1 2 2 2 1 2 3 3 3 1. 4 4 4 1. 5 5 5 1 5 6... 6
Join Cross A 1 1 1 1 2 2 2 1 3 3 3 1 4 4 4 1 5 5 5 1 data D; merge a b ; by k ; run; Obs K A c b 1 1 1 2 1 2 2 2 2 2 3 3 3 1. 4 4 4 1. 5 5 5 2 5 B Obs k Obs b k c a c b 1 1 1 1 1 2 1 1 1 2 2 2 1 1 2 2 3 2 3 2 3 1 1 3 5 4 5 4 2 4 1 1 4 6 5 6 5 2 5 1 1 6 1 1 1 2 proc sql ; create table S as select * from a cross join b on a.k=b.k; quit; 7 2 2 1 2 8 3 3 1 2 9 4 4 1 2 10 5 5 1 2 11 1 1 1 5 12 2 2 1 5 13 3 3 1 5 14 4 4 1 5 15 5 5 1 5 16 1 1 1 6 17 2 2 1 6 18 3 3 1 6 19 4 4 1 6 20 5 5 1 6 6 6. 2 6
Join Cross A B Obs k Obs b k c a c b 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 2 1 1 2 2 2 1 2 2 3 2 3 2 3 1 1 3 3 3 1 3 5 4 5 4 2 4 1 1 4 4 4 1 4 6 5 6 5 2 5 1 1 6 1 1 1 2 5 5 5 1 data D; merge a b ; by k ; run; b 1 1 1 2 1 2 2 2 2 2 3 3 3 1. 4 4 4 1. 5 5 5 2 5 6 6. 2 6 proc sql ; create table S as select * 7 2 2 1 2 8 3 3 1 2 9 4 4 1 2 10 5 5 1 2 from a cross join b on a.k=b.k; NOTE: The execution of this query involves performing quit; one or more Cartesian product joins that cannot be optimized 11 1 1 1 5 12 2 2 1 5 13 3 3 1 5 14 4 4 1 5 15 5 5 1 5 16 1 1 1 6 17 2 2 1 6 18 3 3 1 6 19 4 4 1 6 20 5 5 1 6
Cartesian product Meget ressourcekrævende på memory (og paging) NOTE: The execution of this query involves performing one or more Cartesian product joins that cannot be optimized
Optimering I/O Anvendte ressourcer er Inputtabeller Outputtabel A S B Utility files
Optimering Hvad kan vi måle? Inputtabeller Outputtabel Pladsforbrug A S B Utility files
Optimering Speciel monitorering Inputtabeller Outputtabel A S B Utility files
Optimering Working set-måling Udviklet til formålet
Optimering Working set-måling Udviklet til formålet Måler størrelsen på foldere eller filer Skriver logfil 1:,Count,Date,Time,D:\SASWORK\_TD6600,D:\SASUTIL\sas_util0001000019C8_kohdoxp5,D:\SASWORK\_TD6600\a.sas7bdat*,D:\SASWO RK\_TD6600\s.sas7bdat.lck 2:,0,24052010,180228,570215682,0,162542592,0 2:,1,24052010,180229,570215682,0,162542592,0 2:,2,24052010,180230,570215682,0,162542592,0 2:,3,24052010,180231,570215682,0,162542592,0 2:,4,24052010,180232,570215682,0,162542592,0 2:,5,24052010,180233,570215682,0,162542592,0 Stopper, når sidste målepunkt forsvinder D:\SASWORK\_TD6600\s.sas7bdat.lck
Optimering Demo
Workspace SQL: Join
Workspace SQL: Join WORK.S SAS Utility Work.a
Workspace DATA step: Merge
Workspace SQL: Join + DATA step: Merge
Workspace SQL: Union
Workspace DATA step: Set
Workspace SQL: Union og DATA step: Set
SAS emulerer SQL På observationer Inner join data D1; merge a(in=f1) b (in=f2); by k ; if F1 and F2 ; run; Left join Right join data D1; merge a(in=f1) b (in=f2); by k ; if F1 ; run; data D1; merge a(in=f1) b (in=f2); by k ; if F2 ; run;
Split A 1 1 1 1 2 2 1 1 3 3 1 1 4 4 1 1 5 5 1 1 data D1 D2 D3 ; SET A ; Proc sql ; Create table D1 as select * from a where K=1 ; Create table D2 as select * from a where K=2 ; Create table D3 as select * from a where K>2 ; Run; IF K=1 then output D1 ; ELSE IF K=2 then output D2; Else output D3;; run; Obs a c Obs a c 1 Obs 1 1 a c 1 1 1 1 1 1 Obs a c Obs a c 1 Obs 1 1 a c 1 1 1 1 1 1
Split A 1 1 1 1 2 2 1 1 3 3 1 1 4 4 1 1 5 5 1 1 data D1 D2 D3 ; SET A ; Proc sql ; Create table D1 as select * from a where K=1 ; Create table D2 as select * from a where K=2 ; Create table D3 as select * from a where K>2 ; Run; IF K=1 then output D1 ; ELSE IF K=2 then output D2; Else output D3;; run; Obs a c Obs a c 1 Obs 1 1 a c 1 1 1 1 1 1 Obs a c Obs a c 1 Obs 1 1 a c 1 1 1 1 1 1
Update SQL kræver grundig forberedelse 1 1 1 1 2 2 1 1 3 3 1 1 1 2 6. 2 4 6 2 3 7 6. 4 4 1 1 5 5 1 1 1 1 1 1 2 2 6 1 3 3 1 1 4 4 6 2 5 5 1 1 6 7 6. data A ; update A B ; By k; run; Proc SQL; update.. ; Insert.; run;
Update 1 1 1 1 2 2 1 1 3 3 1 1 1 2 6. 2 4 6 2 3 7 6. 4 4 1 1 5 5 1 1 1 1 1 1 2 2 6 1 3 3 1 1 4 4 6 2 5 5 1 1 6 7 6. data A ; update A B ; By k; run; Proc SQL; update.. ; Insert.; run;
Update test Nemt at teste SAS -program SQL kræver ny tabel 1 1 1 1 2 2 1 1 3 3 1 1 1 2 6. 2 4 6 2 3 7 6. 4 4 1 1 5 5 1 1 data A _test; 1 1 1 1 2 2 6 1 3 3 1 1 run; update A B ; By k; 4 4 6 2 5 5 1 1 6 7 6.
Opsummering Ingen metode er universel SQL Er hurtig at konstruere Er standard for RDBMS Indeholder implicit funktionalitet Kan være hurtigere DATA step Egen notation Mere fleksibel Kan være hurtigere
Husk altid Hold orden (= data sorteret)
Mere information http://support.sas.com/rnd/scalability
Mere information http://support.sas.com/rnd
Henrik Dorf henrik.dorf@sdk.sas.com