Teradata Unified Big Data Architecture
Agenda Recap the challenges of Big Analytics The 2 analytical gaps for most enterprises Teradata Unified Data Architecture - How we bridge the gaps - The 3 core elements of the architecture - Teradata s solutions in the architecture Bring it all together Teradata, Teradata Aster, and Hadoop. 2
Recap of the Big Data Analytics Challenge
New and Emerging Sources of Data Petabytes Terabytes User Generated Content User Click Stream Web logs Offer history BIG DATA Mobile Web Sentiment Web A/B testing Dynamic Pricing Social Network External Demographics Business Data Feeds Gigabytes Megabytes CRM Segmentation Offer details Customer Touches Affiliate Networks Search marketing Behavioral Targeting HD Video And using an RDBMS/SQL alone is difficult or impossible ERP Purchase detail So it s the data, right? Support Yes Contacts Purchase record Dynamic Funnels Payment record So it s the analytics, right?. Yes So it s the need for iterative visualisation. Yes Or it is just that it cannot be expressed in SQL Yes Speech to Text Product/Service Logs SMS/MMS 4
Big Data Analytics MORE Analytics on ALL the data Enabling All Users, All Tools and Any Data for Capture to Analysis Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc. Discover and Explore Reporting and Execution in the Enterprise Capture, Store and Refine Audio/ Video Images Docs Text Web & Social Machine Logs CRM SCM ERP 5
The Big Data Architecture Today Has Gaps Engineers Gap 1: Analysts Data Scientists Quants Business Analysts Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc. MapReduce (Processing) Gap 2: File system lacks optimisers, data locality, indexes Data Warehouse Database and Analytic Processing Layer Data Storage and Refining Audio/ Video Images Text Web and Social Machine Logs CRM SCM ERP 6
Teradata Unified Big Data Architecture for the Enterprise Engineers Data Scientists Quants Business Analysts Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualisation, etc. Aster MapReduce Portfolio Teradata SQL Analytics Portfolio Discovery Platform SQL-H Integrated Data Warehouse SQL-H Capture, Store, Refine Audio/ Video Images Text Web and Social Machine Logs CRM SCM ERP 7
Teradata Aster Discovery Platform 5.10 Fastest path to big data apps and new business insights Analysts Customers Business Data Scientists Interactive & Visual Big Data Analytic Apps Develop SQL-H Teradata RDBMS Data Acquisition Module Unpack Pivot Apache Log Parser Data Preparation Module Pathing Graph Statistical Analytics Module Flow Viz Hierarchy Viz Affinity Viz Viz Module Attensity Zementis SAS, R Partner & Add-On Modules Growing the Development Bucket 70+ pre-built functions for data acquisition, preparation, analysis & visualization Richest Add-On Capabilities: Attensity, Zementis, SAS, R Visual IDE & VM-based dev environment: develop apps very fast Process SQL SQL-MapReduce Platform Services (e.g. query planning, dynamic workload management, security ) SQL-MapReduce framework Analyze both multi-structured complex and relational data Store Row Store Column Store Integrated hardware and software appliance Relational-data architecture can be extended for non-relational types and procedural M-R analytics 8
Big Data Apps in Days not Weeks or Months DATA SOURCES ASTER DISCOVERY PORTFOLIO Hadoop Data PACKAGED BIG ANALYTICS APPS CUSTOM BIG ANALYTICS APPS Analysts Multi- Structured Data Structured Data Data Acquisition Module Hadoop access Teradata access RDBMS access Data Preparation Module Data Adaptors Data Transformers - JSON, XML, Apache, etc Analytics Module Statistical Pattern Matching Pathing Graph Algorithms Text Visualisation Module Flow Visualizer Hierarchy Flow Sankey Affinity More. Customers Business More OLTP DBMS s Data Scientists 9
MapReduce vs. SQL - Reduce Function 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361 335.2152332 264.3601074 335.2163925 259.6187134 335.2175518 239.7870178 335.2187111 313.8243713 335.2198704 490.8760071 335.2210297 634.064209 335.222189 589.8432007 335.2233483 351.9743347 335.2245077 65.21440887 335.225671 0 336.890869 0 336.892037 75.75605011 336.893205 179.8110657 336.894373 247.535553 336.895541 225.6489563 336.8967091 140.6246338 337.1257588 0 337.1280972 86.48993683 337.1292664 170.0835876 337.1304357 215.8146362 337.1316049 188.9733276 337.1327741 110.2854233 337.1912444 0 337.192414 0 337.1935835 143.2112122 337.1947531 357.401123 337.1959227 467.1167297 337.1970923 411.569458 337.1982619 245.5514221 337.1994315 80.80451202 Data output from Mass Spectrometer Detecting centroids of peaks is highly complex using SQL as it is not a set based operation 10
Almost 800 lines of complex SQL 11 SELECT file_id,scan_id,ren_tm,ms_lvl,mz,i AS n_,sum(i) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS p_i,(case WHEN (i > 0) THEN 1 ELSE 0 END) AS Ind,(Ind - SUM(ind) OVER (PARTITION,(weighted_peak_mz BY file_id, * ms_lvl, chrg) / ren_tm 700000.000000000000000 ORDER BY mz ASC ROWS BETWEEN AS delta_mz 1 PRECEDING AND 1 PRECEDING)),CAST((CASE,CASE WHEN ( B = 1 THEN CSUM(1,Ind) WHEN B (CASE = 0 AND WHEN Ind = 1 THEN 0 ELSE NULL END) AS DECIMAL(38,0)) SUM((weighted_peak_mz AS CurveID * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS FROM dd_stg.mzml BETWEEN 1 PRECEDING AND 1 PRECEDING) WHERE ms_lvl = 1 BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) ) WITH DATA THEN 'Y' PRIMARY INDEX (mz) ELSE NULL END) = 'Y' SELECT file_id,scan_id,ren_tm,ms_lvl,mz OR,i (CASE WHEN,CASE WHEN ind = 1 THEN SUM(CurveID+Mark) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz, ind ROWS UNBOUNDED PRECEDING) SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN ELSE 1 FOLLOWING NULL END AS AND CurveNum SELECT A.file_id,A.ren_tm,A.scan_id,A.ms_lvl,A.CurveNum 1 FOLLOWING) A.Weighted_Peak_mz,A.ren_tm,A.sum_i FROM (SELECT file_id,scan_id,ren_tm,ms_lvl,mz,n_i BETWEEN ((weighted_peak_mz AS i,a.ren_tm - B.ren_tm AS Diff_Ren_Tm * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y',CASE,A.Weighted_Peak_mz - B.Weighted_Peak_mz AS Diff_WP WHEN ELSE NULL,B.CurveNum AS L_CurveNum ( (CASE END) = 'Y',B.Weighted_Peak_mz AS L_Weighted_Peak_mz WHEN n_i OR - p_i > 0 THEN 1,B.ren_tm AS L_ren_tm WHEN n_i (CASE - p_i < WHEN 0 THEN -1,B.sum_i AS L_Sum_I ELSE 0 SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS FROM DD_STG.S2_WEIGHTED_CURVE AS A END) BETWEEN - 2 PRECEDING AND 2 PRECEDING) INNER JOIN DD_STG.S2_WEIGHTED_CURVE AS B SUM(CASE,A.Weighted_Peak_mz - B.Weighted_Peak_mz BETWEEN ((weighted_peak_mz AS Diff_WP * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) ON THEN 'Y' (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000,B.CurveNum WHEN n_i - AS p_i > 0 THEN 1 L_CurveNum AND A.ren_tm WHEN n_i ELSE = - p_i NULL B.ren_tm,B.Weighted_Peak_mz < 0 THEN -1 AS L_Weighted_Peak_mz AND END) A.CurveNum ELSE 0 = 'Y' <> B.CurveNum,B.ren_tm AS L_ren_tm AND B.max_i > (0.66667 END) OVER OR * A.max_i),B.sum_i (PARTITION BY file_id, ms_lvl, ren_tm AS ORDER BY mz ASC L_Sum_I ROWS BETWEEN 1 PRECEDING AND 1 FROM PRECEDING) (CASE DD_STG.S2_WEIGHTED_CURVE WHEN AS A INNER JOIN DD_STG.S2_WEIGHTED_CURVE ) = 2 THEN 1 ELSE 0 AS B ON (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000 END AS AND Mark A.ren_tm = B.ren_tm,Ind AND A.CurveNum <> B.CurveNum,B AND B.max_i > (0.66667 * A.max_i),CurveID ) AS J LEFT JOIN DD_TAB.CHARGE_STATES AS C ON CAST(J.Diff_WP AS DECIMAL(18,2)) = CAST(C.chrg_mz_diff AS DECIMAL(18,2))
Procedural code declared to the Aster as new new MapReduce function called PeakPick while (inputiterator.advancetonextrow()) { currintensity=inputiterator.getdoubleat(5); maxintensity=0.0; //Initialise Temp Array for (int i=0; i <= 50; i++){ curvearray[0][i]=0; curvearray[1][i]=0; if (overlapflag==1){ count = 1; else { count = 0; //Find start of Curve, lastintensity is 0 //or previous lastintensity is higher than lastintensity overlapping peaks (double peak curve) if (currintensity > 0 && lastintensity == 0 overlapflag==1){ //Populate Temp Array with Curve points and find maxintensity to derive threshold while (currintensity > 0){ if(maxintensity < currintensity) maxintensity=currintensity; if (overlapflag==1){ overlapflag=0; curvearray[0][count-1]=overlapmz; curvearray[1][count-1]=overlapintensity; PI = overlapintensity; currintensity=inputiterator.getdoubleat(5); curvearray[0][count]=inputiterator.getdoubleat(4); curvearray[1][count]=inputiterator.getdoubleat(5); count++; inputiterator.advancetonextrow(); PI2 = PI; PI = currintensity; 12 currintensity=inputiterator.getdoubleat(5); if (currintensity > PI && PI2 > PI){ //Overlapping Peak found, store MZ and Intensity and start new Curve for next Iteration overlapflag=1; overlapmz=inputiterator.getdoubleat(4); overlapintensity=inputiterator.getdoubleat(5); break; //Process Temp Array to create intermediate metrics while (curvearray[1][curvecount] > 0){ if (curvearray[1][curvecount] > intensitythreshold){ if (maxmz < curvearray[0][curvecount]){ maxmz=curvearray[0][curvecount]; if (minintensity > curvearray[1][curvecount] minintensity == 0){ minintensity=curvearray[1][curvecount]; if (minmz > curvearray[0][curvecount] minmz == 0){ minmz=curvearray[0][curvecount]; sumintensity=sumintensity+curvearray[1][curvecount]; summz=summz+curvearray[0][curvecount]; summzbyintensity=summzbyintensity+(curvearray[0][curvecou nt]*curvearray[1][curvecount]); curvepoints++; curvecount++;
SQL MapReduce Reduce Function In Teradata Aster SQL-MR code run by analyst becomes trivial SELECT * FROM PeakPick (ON SELECT * FROM STG.MassSpecLoad) Parameters can easily be included in the function and exposed to the analyst In Hadoop, command line interface means Engineers involved at all times 13
TERADATA UNIFIED DATA ARCHITECTURE Data Scientists Quants Customers / Partners Front-Line Workers Engineers Business Analysts Executives Operational Systems LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS Big Data Analytics DISCOVERY PLATFORM INTEGRATED DATA WAREHOUSE Enterprise Analytics CAPTURE STORE REFINE Big Data Management 14 AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
The Integrated Data Warehouse Single View of the Business, Cross-Functional SQL based Business Analysts Knowledge Workers Customers/Partners Marketing Executives Front-line Workers Operational Systems Structured schema Productionised Analytics Active BUSINESS INTELLIGENCE DATA MINING APPLICATIONS Complex mixed workloads Highest service level goals Highest resilience 1000 users INTEGRATED DATA WAREHOUSE 15
The Discovery Environment Project-led view of data approach for big analytics Business Analysts Data Scientists Power Analysts Rules Discovery Big Analytics using SQL-MR Schema-Lite Interactive Discovery Analytics Load fast, act fast, fail fast analytical workload SQL AND MAP-REDUCE BIG ANALYTICS DATA VISUALISATION Interactive Limited service levels Resilience 10 s users DISCOVERY PLATFORM 16
Hadoop Big Data Management Lowest Cost Storage footprint NoSchema design, load raw files Power Analysts Data Scientists IT Professionals Single use Systems MapReduce based Deep history and 1 st level data transformations SPECIAL PURPOSE ANALYTIC TRANSFORMATIONS REGULATORY Simple single use workloads Batch and open source analytics High Data Availability service level goal CAPTURE STORE REFINE High resilience 17
TERADATA UNIFIED DATA ARCHITECTURE Data Scientists Quants Customers / Partners Front-Line Workers Engineers Business Analysts Executives Operational Systems LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS Big Data Analytics DISCOVERY PLATFORM INTEGRATED DATA WAREHOUSE Enterprise Analytics CAPTURE STORE REFINE Big Data Management 18 AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
Unified Data Architecture Give Any User Any Analytic on Any Data 1 2 3 To leverage Big Data you must give all the business analysts in your organization the right analytical tool on all the existing and new data available Unified Data Architecture - architecture that leverages the right technology on the right analytical problems - leveraging best-of-breed technologies Big Data Analytics Teradata and Aster harness the business value of Big Data. Every company needs both a Data Warehouse and a Discovery Platform Big Data Management Hadoop for landing, storing, and refining data Democratise Big Data and Maximise Enterprise Adoption 19