Beyond Conventional Data Warehousing. Florian Waas Greenplum Inc.



Similar documents
Virtuoso and Database Scalability

2009 Oracle Corporation 1

Real Life Performance of In-Memory Database Systems for BI

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

I/O Considerations in Big Data Analytics

Main Memory Data Warehouses

Oracle Exadata Database Machine for SAP Systems - Innovation Provided by SAP and Oracle for Joint Customers

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Big Data Analytics - Accelerated. stream-horizon.com

SUN ORACLE EXADATA STORAGE SERVER

Oracle Database 11g Comparison Chart

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

James Serra Sr BI Architect

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Symantec NetBackup 5000 Appliance Series

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Oracle Big Data SQL Technical Update

SQL Server Business Intelligence on HP ProLiant DL785 Server

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Cost-Effective Business Intelligence with Red Hat and Open Source

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

IBM Netezza High-performance business intelligence and advanced analytics for the enterprise. The analytics conundrum

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

Optimizing Large Arrays with StoneFly Storage Concentrators

Capacity Management for Oracle Database Machine Exadata v2

Innovative technology for big data analytics

EMC GREENPLUM DATABASE

Big Data Technologies Compared June 2014

Advanced In-Database Analytics

Implement Hadoop jobs to extract business value from large and varied data sets

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Using Hadoop to Expand Data Warehousing

IOS110. Virtualization 5/27/2014 1

Real-time Data Replication

Deploying Microsoft SQL Server 2005 Business Intelligence and Data Warehousing Solutions on Dell PowerEdge Servers and Dell PowerVault Storage

Fundamentals Curriculum HAWQ

How To Build An Exadata Database Machine X2-8 Full Rack For A Large Database Server

An Oracle White Paper August Oracle WebCenter Content 11gR1 Performance Testing Results

<Insert Picture Here> Oracle Exadata Database Machine Overview

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

In-Memory Analytics for Big Data

4 th Workshop on Big Data Benchmarking

TekSouth Fights US Air Force Data Center Sprawl with iomemory

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Performance and Scalability Overview

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

Big Data Processing: Past, Present and Future

Inge Os Sales Consulting Manager Oracle Norway

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Configuration Maximums VMware Infrastructure 3

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

SQL Maestro and the ELT Paradigm Shift

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Database Performance with In-Memory Solutions

The HP Neoview data warehousing platform for business intelligence

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

The syslog-ng Store Box 3 F2

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

Application of Predictive Analytics for Better Alignment of Business and IT

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Express5800 Scalable Enterprise Server Reference Architecture. For NEC PCIe SSD Appliance for Microsoft SQL Server

Clickstream Data Warehouse Initiative

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Universal PMML Plug-in for EMC Greenplum Database

Big Data and Its Impact on the Data Warehousing Architecture

IBM Tivoli Monitoring for Databases

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Configuring Apache Derby for Performance and Durability Olav Sandstå

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

An Oracle White Paper October Oracle: Big Data for the Enterprise

High performance ETL Benchmark

Exadata Database Machine

BMC Recovery Manager for Databases: Benchmark Study Performed at Sun Laboratories

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

IBM WebSphere DataStage Online training from Yes-M Systems

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

Eliminate SQL Server Downtime Even for maintenance

An Oracle White Paper June Oracle: Big Data for the Enterprise

Hadoop IST 734 SS CHUNG

System requirements for MuseumPlus and emuseumplus

Implementing a Data Warehouse with Microsoft SQL Server 2014

Implementing a SQL Data Warehouse 2016

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

SUN ORACLE DATABASE MACHINE

Dell s SAP HANA Appliance

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Bringing Big Data into the Enterprise

<Insert Picture Here> Move to Oracle Database with Oracle SQL Developer Migrations

How to Enhance Traditional BI Architecture to Leverage Big Data

White Paper February IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

IBM DB2 Universal Database Data Warehouse Edition. Version 8.1.2

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

DBMS / Business Intelligence, SQL Server

Transcription:

Beyond Conventional Data Warehousing Florian Waas Greenplum Inc.

Takeaways The basics Who is Greenplum? What is Greenplum Database? The problem Data growth and other recent trends in DWH A look at different customers and their requirements The solution Teaching an old dog new tricks: using an RDBMS for massively parallel data processing Conclusion

Greenplum Company Snapshot Greenplum Inc. What: High-performance database software for Business Intelligence and Data Warehousing Where: Based in San Mateo, CA When: Founded in June 2003; product GA in February 2006 Who: Technical pioneers in data warehousing (Teradata, Microsoft, Oracle, Informix, Tandem, PostgreSQL, ) Strategic Partner: Powers the Sun Data Warehouse Appliance.

Greenplum Database DBMS highly scalable fault-tolerant high-performance Based on Postgres Shared-nothing architecture Commodity hardware Currently supported on Solaris, Linux

Architecture

The Sun Data Warehouse Appliance Sample Hardware Configuration Sun Fire X4500 Data Server 2 dual core AMD processors 48 Hitachi Deskstar SATA II 7200 rpm 500GB drives 6 Marvell 8-port Serial- ATA 2.0 Storage Controllers Leverages Hypertransport architecture to achieve highperformance I/O

Trends in DWH: Data Growth Growth of customer base E.g. phone carriers in Asia Additional data sources E.g. click-stream and ad-impression data Data processing Data Bunnies E.g. intermediate results of analysis, aggregated/expanded Data will continue to outpace Moore s Law

Trends in DWH: Customers New Customers Not your typical DB customers No pre-existing DB infrastructure Atypical data: logs, click-stream, etc. Weight of data less significant E.g. CDR call detail records Click-stream vs. sales/transaction records Often: Reflects behavior, not deliberate purchase decision bankers vs. teens Analysis as service

Trends in DWH: Analysis Turn-around on reporting Similar/same requirements despite increased data volume Automated/on-line decision making process E.g. ad placement in social network applications Advanced data analysis processes over massive amounts of data E.g. Bayesian classification

Requirements Petabyte-scale storage High-performance query processing Fault-tolerance/high-availability Constant loading activity Richer processing capabilities Leverage parallelism automatically Cannot move data (size, privacy concerns) Integrate with existing programming environments Not strictly a DWH requirement

Leveraging Greenplum Database GPDB designed for Scalability High-performance query processing Fault-tolerance How to address processing needs?

How to use GPDB for Data Processing Typical installation 10s to 100s of CPU cores 100s GB memory 100s TB disk space Often largest individual system in data center Slack resources during off-peak times

Example: ETL (1) Customer s System 40 nodes 160 CPU cores 1 TB main memory 3.6 TB/h load rate ETL jobs 18 hours to process 1 day s worth of data 5 serial streams Load time < 1hr

Example: ETL (2) ETL crucial in daily processing Mainly data cleaning: string manipulations, conversions, etc. Hard to parallelize effectively, load-balance Hard to recover if falling behind E.g. glitches in ETL logic, data contamination Desired run time < 4hrs

Example: ETL (3) Load dirty raw data directly into GPDB Trade-off: raw data bulkier Rewrite ETL logic in SQL Cleaner program Run SQL statements on GPDB Automatic parallelization, fully transparent Max degree of parallelism Run time < 3 hrs

Solution Leverage existing query processing infrastructure Rewrite procedural logic in SQL Enjoy benefits of SQL Automatic parallelization Add UDFs and UDAggs in other languages as needed, e.g. Java, C#, etc.

Challenges Query Processing does not mean read-only Database Technology suffers image problem SQL is difficult Declarative programming perceived as non-intuitive SQL dialects (portability issues) Too powerful overwhelming Requires special skillset/expertise

Summary Database technology for DWH addresses scalability, fault-tolerance and performance needs Users are looking for additional mileage from large-scale DWH installations ELT, and tools like UDFs, UDAggs become more attractive Existing database technology to be revamped into massively parallel processing engine

Beyond Conventional Data Warehousing Florian Waas Greenplum Inc.