A Data Warehouse Approach to Analyzing All the Data All the Time. Bill Blake Netezza Corporation April 2006



Similar documents
Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Netezza and Business Analytics Synergy

2009 Oracle Corporation 1

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

IBM Netezza High-performance business intelligence and advanced analytics for the enterprise. The analytics conundrum

SQL Server Parallel Data Warehouse: Architecture Overview. José Blakeley Database Systems Group, Microsoft Corporation

Main Memory Data Warehouses

Scaling Your Data to the Cloud

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

IBM Netezza High Capacity Appliance

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

ORACLE DATABASE 10G ENTERPRISE EDITION

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Information management software solutions White paper. Powerful data warehousing performance with IBM Red Brick Warehouse

White Paper February IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario

IBM PureData Systems. Robert Božič 2013 IBM Corporation

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Inge Os Sales Consulting Manager Oracle Norway

Oracle Exadata Database Machine for SAP Systems - Innovation Provided by SAP and Oracle for Joint Customers

EMC Unified Storage for Microsoft SQL Server 2008

SQL Server PDW. Artur Vieira Premier Field Engineer

HP Oracle Database Platform / Exadata Appliance Extreme Data Warehousing

The Netezza FAST Engines Framework

Overview: X5 Generation Database Machines

James Serra Sr BI Architect

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

Real Life Performance of In-Memory Database Systems for BI

Oracle Database In-Memory The Next Big Thing

Maximizing SQL Server Virtualization Performance

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

PureSystems: Changing The Economics And Experience Of IT

Cost-Effective Business Intelligence with Red Hat and Open Source

Benchmarking Cassandra on Violin

High performance ETL Benchmark

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

Advanced In-Database Analytics

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

NextGen Infrastructure for Big DATA Analytics.

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

HP reference configuration for entry-level SAS Grid Manager solutions

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Innovative technology for big data analytics

Big Fast Data Hadoop acceleration with Flash. June 2013

White Paper - GPU-Based SQL Database. SQream Technologies. SQream DB GPU-Based SQL Database Technical Overview White Paper

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

SUN ORACLE EXADATA STORAGE SERVER

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Big Data and Its Impact on the Data Warehousing Architecture

Oracle Maximum Availability Architecture with Exadata Database Machine. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Maximum performance, minimal risk for data warehousing

How To Use Exadata

Bringing Big Data into the Enterprise

High Availability Databases based on Oracle 10g RAC on Linux

Summary: This paper examines the performance of an XtremIO All Flash array in an I/O intensive BI environment.

SUN ORACLE DATABASE MACHINE

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Big Data Challenges in Bioinformatics

In Memory Accelerator for MongoDB

EMC BACKUP MEETS BIG DATA

HP Enterprise Data Warehouse Deep Dive. Steve Tramack, Sr. Engineering Manager, I2A Solutions, HP

SQL Server Business Intelligence on HP ProLiant DL785 Server

DEPLOYING IBM DB2 FOR LINUX, UNIX, AND WINDOWS DATA WAREHOUSES ON EMC STORAGE ARRAYS

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

TUT NoSQL Seminar (Oracle) Big Data

Microsoft SQL Server 2014 Fast Track

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

The HP Neoview data warehousing platform for business intelligence

Next Generation Data Warehousing Appliances

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Can the Elephants Handle the NoSQL Onslaught?

Evolving Solutions Disruptive Technology Series Modern Data Warehouse

Microsoft Analytics Platform System. Solution Brief

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

DBMS / Business Intelligence, SQL Server

Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010

An Oracle White Paper June A Technical Overview of the Oracle Exadata Database Machine and Exadata Storage Server

Exadata Database Machine

Safe Harbor Statement

Safe Harbor Statement

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Real-time Data Replication

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Capacity Management for Oracle Database Machine Exadata v2

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

Transcription:

A Data Warehouse Approach to Analyzing All the Data All the Time Bill Blake Netezza Corporation April 2006

Sometimes A Different Approach Is Useful The challenge of scaling up systems where many applications need to access large data in global parallel file systems is well documented At the Multi Terabyte scale, It is hard to move the data from where it is stored to where it is processed But if moving data to processing is so difficult, why not try an approach where the application owns the data and processing is moved to where the data is stored? The application in this case is the relational database, a very useful tool for data intensive computing 2

Operational vs. Analytical RDBMS Extraction /Cleansing Transformation Loading Reports OLTP Databases Processing Millions Of Transactions/sec OLAP Databases aka Data Warehouse Processing Hundreds of Terabytes/hour 3

The Legacy Focus: Transaction Processing Applications Client Database Server Storage DATA SQL DATA SQL Optimizer Logging Lock Mgr CACHE DSS Query FastLoad Partitioning Pre Mat Views KAIO Raid I/O I/O I/O Raw/Cook Vol Mgr Lock Mgr Lock Mgr I/O I/O CACHE CACHE Local Applications 4

Netezza s s Data Warehouse Appliance BI Applications Client ODBC 3.X JDBC Type 4 SQL/99 RDBMS + Server + Storage Local Applications Netezza Performance Server 5

The Challenges Driving Us At Netezza Forces driving disruptive change Sub-transactional data in a fullyconnected world Ever-increasing need for speed Increasing regulatory requirements Market mandate for operational simplicity Need for actionable intelligence from unlimited data at real-time speeds This Need Cannot be Met by Today s Systems Linux cluster scaling limited by network performance & system management complexity Scaling with large NUMA SMP servers limited by I/O, network performance & operating system complexity 6

Not All Computing Tasks Fit into The Analytic DB Challenge There are benefits to scaling up analytic DBs: Transactional and referential integrity High level query language (with parallel run time optimization performed by application s query planner) Operation on sets of records in tables (vs sequential access of records in files) Database standards have matured and are now consistent across the industry Data volumes have grown from gigabytes to hundreds of terabytes Disk storage is now less than $1 per Gigabyte! 7

For Perspective (1980 s) The relational database was invented on a system that merged server, storage and database It was called a mainframe! CPU IOP IOP 8

By The 1990 s, Rules Changed Mainframes attacked by killer micros! grew large I/O became weak System costs dropped Storage moved off to the network CPU CPU CPU CPU CPU CPU CPU CPU Very Large I/O Storage Area Network 9

Capacity Was Added By Clustering CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU I/O I/O I/O I/O I/O Storage Area Network CPU CPU CPU CPU CPU CPU CPU CPU SAN limits Moving Data to the Processors 10

Data Flow The Traditional Way Applications Client $$$ Processing $$$ Fiber Channels $$$ Storage SMP HOST 1 $$$ C6 C12 F7 F13 C38 F39 G13 G22 SMP HOST 2 $$$ SMP HOST N Local Applications Hours or Days 11

Moving Processing to the Data Active Disk architectures > Integrated processing power and memory into disk units > Scaled processing power as the dataset grew Decision support algorithms offloaded to Active Disks to support key decision support tasks > Active Disk architectures use stream-based model ideal for the software architecture of relational databases In Netezza s NPS System: Snippet Processing Units take streams as inputs and generate streams as outputs 12

SQL Query Flow Diagram Table D Table A Table B Table C Join Scan 13

Streaming Data Flow BI Applications Client ODBC 3.X JDBC Type 4 SQL/92 Fast Loader/ Unloader Netezza Performance Server C6 C12 F7 F13 C38 F39 G13 G22 SMP Host SPU C6 C7 SPU C12 F12 G12 C13 F13 G13 C14 F14 G14 SPU C21 F21 G21 C22 F22 G22 C23 F23 G23 SPU F6 F7 G6 G7 C30 F30 G30 C31 F31 G31 SPU C38 F38 G38 C39 F39 G39 C40 F40 G40 A B C D E F G H 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Streaming data, joins, & aggs @ 50MB/sec Local Applications Bulk data movement: 250 GB/hour - uncompressed (1 TB/hour Target) Up to 120 TB/hour Cross Section BW per 8650 system 14

Active Disks as Intelligent Storage Nodes Netezza Performance Server Netezza added: Highly optimized query planning Code generation Stream processing Snippet Processing Unit (SPU) Result: 10X to 100X performance speedup over existing systems A compute node for directly processing SQL queries on tables 15

Asymmetric Massively Parallel Processing BI Applications Client ODBC 3.X JDBC Type 4 SQL/92 Netezza Performance Server Front End SQL Compiler Query Plan Optimize Admin Execution Engine DBOS Gigabit Ethernet 1 2 3 Snippet Processing Unit (SPU) Processor + streaming DB logic Processor + streaming DB logic Processor + streaming DB logic High-Performance Database Engine Streaming joins, aggregations, sorts, etc. Local Applications Fast Loader/Unloader Linux SMP Host 1000+ Processor + streaming DB logic Massively Parallel Intelligent Storage Move processing to the data (maximum I/O to a single table) 16

Packaging For High Density And Low Power Full GigE to each SPU 1 GB RAM Socketed DIMM Enterprise SATA Disk Drive 150 GB / 400 GB 440GX Power PC Dual NICs 1M Gate AFTER Higher Performance Greater Scalability Higher Reliability 17

Binary Compiled Queries Executed on a Massively Parallel Grid select c_name, sum(o_totalprice) price from customer, orders select c_name, sum(o_totalprice) price from customer, orders where o_orderkey in (select l_orderkey from lineitem2 where where o_orderkey in (select l_orderkey from lineitem2 where o_orderkey=l_orderkey and l_shipdate>='01-01-1995' and o_orderkey=l_orderkey and l_shipdate>='01-01-1995' and l_shipdate<='01-31-1995') and c_custkey=o_custkey group by l_shipdate<='01-31-1995') /********* Code and **********/ c_custkey=o_custkey group by c_name;" test_tim /********* >test.out Code **********/ c_name price c_name;" test_tim >test.out c_name price --------------------+----------- void GenPlan1(CPlan *plan, char *bufstarts,char *bufends, bool --------------------+----------- void GenPlan1(CPlan *plan, char *bufstarts,char *bufends, bool Customer#000000796 318356.97 lastcall) { Customer#000000796 318356.97 lastcall) { Customer#000001052 293680.56 // Customer#000001052 293680.56 // Customer#000001949 215280.98 // Setup for next loop (nodes 00..07) Customer#000001949 215280.98 // Setup for next loop (nodes 00..07) Customer#000002093 282531.93 // Customer#000002093 282531.93 // Customer#000005656 335297.31 // node 00 (TScanNode) Customer#000005656 335297.31 // node 00 (TScanNode) Customer#000005861 233691.03 TScanNode *node0 = (TScanNode*)plan->m_nodeArray[0]; Customer#000005861 233691.03 TScanNode *node0 = (TScanNode*)plan->m_nodeArray[0]; Customer#000006002 267000.92 // For ScanNode: Customer#000006002 267000.92 // For ScanNode: Customer#000006343 595819.82 TScan0 *Scan0 = BADPTR(TScan0*); Customer#000006343 595819.82 TScan0 *Scan0 BADPTR(TScan0*); Customer#000006532 442254.91 CTable *tscan0 = plan->m_nodearray[0]->m_result; Customer#000006532 442254.91 CTable *tscan0 = plan->m_nodearray[0]->m_result;. char *nullsscan0p = BADPTR(char *);. char *nullsscan0p = BADPTR(char *); real 0m0.552s // node 01 (TRestrictNode) real 0m0.552s // node 01 (TRestrictNode) user 0m0.010s TRestrictNode *node1 = (TRestrictNode*)plan->m_nodeArray[1]; user 0m0.010s TRestrictNode *node1 = (TRestrictNode*)plan->m_nodeArray[1]; sys 0m0.000s // node 02 (TProjectNode) sys 0m0.000s // node 02 (TProjectNode) TProjectNode *node2 = (TProjectNode*)plan->m_nodeArray[2]; TProjectNode *node2 = (TProjectNode*)plan->m_nodeArray[2]; // node 03 (TSaveTempNode) // node 03 (TSaveTempNode) TSaveTempNode *node3 = (TSaveTempNode*)plan->m_nodeArray[3]; TSaveTempNode *node3 = (TSaveTempNode*)plan->m_nodeArray[3]; // For SaveTemp Node: // For SaveTemp Node: TSaveTemp3 *SaveTemp3 = BADPTR(TSaveTemp3*); TSaveTemp3 *SaveTemp3 = BADPTR(TSaveTemp3*); CTable *tsavetemp3 = node3->m_result; CTable *tsavetemp3 = node3->m_result; CRecordStore *recstore3 = tsavetemp3->m_recstore; CRecordStore *recstore3 = tsavetemp3->m_recstore; // node 04 1011010101010101010111110101010100100101010111010101001011110101 (THashNode) // node 04 1011010101010101010111110101010100100101010111010101001011110101 (THashNode) 0100101011110110100101010101110101011001010101010111110100100101 0100101011110110100101010101110101011001010101010111110100100101 0101010101010101010010101001111110101010101010101001010101010010 0101010101010101010010101001111110101010101010101001010101010010 1001011010011111111010101010100110100101010101001010101010100101 1001011010011111111010101010100110100101010101001010101010100101 01010101010010101010100111010101010101010101010 01010101010010101010100111010101010101010101010 18

A Look Inside the SPU Netezza Performance Server Gigabit Ethernet Snippet Processing Unit (SPU) Snippet Queue PowerPC Query Engine Joining Sorting Grouping Main Streaming Record Processor Project Restrict Primary SPU Swap Transaction/Lock Manager Replication Manager Mirror 19

SELECT count ( * ), sex, age FROM emp WHERE state = VA and age > 18 GROUP BY sex, age ORDER BY age ; name address city state zip sex age dob 20

SELECT count ( * ), sex, age FROM emp WHERE state = VA and age > 18 GROUP BY sex, age ORDER BY age ; name address city state zip sex age dob name address city state zip sex age dob First things first. The table is distributed amongst all of the SPU s in the system so that is can be processed in parallel. When the table is read, your scan speed is the SUM of the speed of all of the disk drives combined. 21

SELECT count ( * ), sex, age FROM emp WHERE state = VA and age > 18 GROUP BY sex, age ORDER BY age ; name address city state zip sex age dob name address city state zip sex age dob Record Level Operations Extract fields of interest PROJECTION On each SPU, the / disk controller SELECTs just the columns of interest. 22

SELECT count ( * ), sex, age FROM emp WHERE state = VA and age > 18 GROUP BY sex, age ORDER BY age ; name address city state zip sex age dob name address city state zip sex age dob Record Level Operations Extract fields of interest from records of interest RESTRICTION The is also responsible for choosing the records of interest applying the conditions of the WHERE clause. 23

SELECT count ( * ), sex, age FROM emp WHERE state = VA and age > 18 GROUP BY sex, age ORDER BY age ; name address city state zip sex age dob name address city state zip sex age dob Record Level Operations Extract fields of interest from records of interest Set Operations Data is joined, sorted, grouped, aggregated Processor N I C Processor N I C Processor N I C Processor N I C Processor N I C 24

SELECT count ( * ), sex, age FROM emp WHERE state = VA and age > 18 GROUP BY sex, age ORDER BY age ; name address city state zip sex age dob name address city state zip sex age dob Record Level Operations Extract fields of interest from records of interest Set Operations Data is joined, sorted, grouped, aggregated Processor N I C Processor N I C Results are consolidated Processor Processor N I C N I C SMP Host End User Applications Processor N I C 25

What about scientific data and non-sql heuristics? BLAST is a widely used tool for finding similar sequences in large databases of sequences Netezza has integrated the BLAST heuristic algorithms into a new type of SQL Join: The syntax is an extension of the SQL92 generalized join syntax: SQL92: SELECT <cols> FROM <t1> <jointype> <t2> ON <join-condition> The blast join syntax where the controls is a literal string is: SELECT <cols> FROM <haystack> [ALIGN <needles>] [WITH <controls>] ON BLASTX(<haystack.seq>,<needles.seq>,<controls.args>) Thus a simple literal protein blast looks like: SELECT <cols> FROM haystack ON BLASTP(haystack.seq, 'ZZAADEDAAM', '-e.001') 26

Netezza Performance Server Family NPS 8000z-Series High-Performance Products 8000z Series: 1-33 TB 8050z 8150z 8250z 8450z 8650z Processors 56 112 224 448 672 User Space 2.75 TB 5.5 TB 5.5 11 TB 5.5 22 TB 5.5 33 TB Continued Innovation in the 8000 Family Built & Priced for PERFORMANCE Enhanced performance, reliability and system capacity Simple, scalable capacity expansion across the product range 27

Netezza Performance Server Family NPS 10000-Series High-Density Products 10000 Series: Up to 100 TB 0.9 Terabyte Total DRAM 0.3 Petabyte Total Storage Processors User Space 10400 HD 448 10 50 TB 10800 HD 896 10 100 TB Introducing the 10000 Product Family HIGH PERFORMANCE & HIGH DATA DENSITY in a single NPS appliance Simple, cost-effective, scalable capacity expansion Up to 12.5 TB of user space per rack 28

Some of Our Customers by Vertical Market Retail Telecom Online Analytic Services Financial Services Healthcare Other 29

Proven Results: Leading Food & Drug Retailer 400 300 Minutes 100 0 14 min 8 secs 18 min 22 min Teradata 4 min Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 *Netezza results based on an NPS 8150. Teradata queries run on a 40-node system (5200 &5300) Netezza 8150 55 min Situation Competed against a Teradata system 25x more expensive Total amt of data loaded: 3.5TB Time to load: 28 hours 118GB/hour Queries included market basket penetration, Y/Y comparisons, top UPC by movements, price optimization tracking, etc. Running SQL Results NPS system went from loading dock to installed, configured and running in three hours Queries were run substantially faster, including one that was over five times faster 30

Proven Results: Report Execution from a Government POC Situation Competed against a very large Teradata system (96 nodes) Total amount of data loaded: 2.5 TB 200+GB/hour load rates (single stream) Representative set of resource-intensive production MicroStrategy reports Impact NPS system went from loading dock to installed, configured and running in five hours Queries showed substantial improvement on Netezza 15 times faster on average! Total execution time (13 reports) was ~7 ½ hours on TD vs. only 47 min on Netezza 4000.00 3000.00 40x faster! Seconds 2000.00 51x faster! 1000.00 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 Queries Teradata Netezza 8050 * Netezza results based on an NPS 8250. Teradata queries run on a 96-node system (52xx and 53xx) 31

Proven Results: Analytic Service Provider Time (sec) 497x slower 10000 7020 1000 100 10 1 2-Way Table Join 46x slower 660 268 19x slower Open Source DB Traditional RDMBS 1 Traditional RDMBS 2 Netezza 8150 14 POC Performance 2-way Cartesian Join Mixed Read/Write Test 37.6M rows with 122.9M rows Performance Improvement with Netezza > 497x v. Open Source DB > 46x v. Traditional RDBMS 1 > 19x v. Traditional RDBMS 2 32

Proven Results: E-Business Customer 70,000 Seconds Red Brick / HP (3.2 Billion) Netezza 8250 (5.4 Billion) Sequential Situation Red Brick: 3.2 billion rows NPS: 5.4 billion rows 6 queries load, expansion and test Business Objects and SQL 40,000 10,000 19 secs 362 secs 6 secs 205 secs 691 secs 6 secs Query Performance NPS system handled 69% more data volume but was able to complete the total workload in 21 minutes vs. 50 hours, 143x faster! 1 2 3 4 5 6 Queries Load Performance 140+ GB/hr *Netezza results based on an NPS 8250. Red Brick results on HP SuperDome 32 CPU/32GB RAM and EMC SAN 33

Netezza s DW System Architecture Evolution Unix Large SMP CPU CPU CPU CPU Very Large I/O Netezza Asymmetric MPP Optimal Grid for DW CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU I/O SMP Host Gigabit Ethernet 1 2 3 1000+ µprocessor + streaming DB logic µprocessor + streaming DB logic µprocessor + streaming DB logic µprocessor + streaming DB logic Massively Parallel Intelligent Storage CPU IOP IOP Storage Area Network CPU CPU CPU CPU I/O I/O I/O CPU CPU CPU CPU I/O I/O CPU CPU CPU CPU Storage Area Network Netezza Optimal Grid of Grids reaching Petabyte sizes within or across locations Cluster of Linux Small SMP Mainframe 34

Thank You!