D1 Solutions AG a Netcetera Company Real Life Performance of In-Memory Database Systems for BI 10th European TDWI Conference Munich, June 2010
10th European TDWI Conference Munich, June 2010 Authors: Dr. Andreas Hauenstein Dr. Simon Hefti Dr. Andrej Vckovski
In-Memory Database Systems Buzzwords: Column-Orientation, In-Memory, Shared Nothing Meaning: Looks like Oracle/DB2/SQLServer from the outside, just much faster We are talking about relational systems, queryable in SQL We are not talking about client side caching (Microstrategy or QlikView do this) There is a new generation of DB systems, for example MonetDB, Exasol, Greenplum, LucidDB
Business Intelligence Data Warehouse We are not looking at transactional systems Any DB of an online shop or any DB driving a web site is transactional Typically BI applications are driven by a non-transactional data store that is bulk loaded in intervals by an ETL process. This is called a data warehouse. Next generation DB systems also exist for transactional systems. An example is Oracle TimesTen. This is a different subject. DB Systems Spezialized for Transactions (e.g. TimesTen) DB Systems Specialized for Analytics (e.g. Teradata) General Purpose DB Systems (e.g. Oracle, SQL Server)
Business Intelligence Generated SQL Tools with a GUI that generate SQL statements Examples: Business Objects, OBIEE, Microstrategy, Cognos No SQL tuning possible Bad SQL Non-technical users Frequently changing queries Lots of averages and sums, groupings, consolidation
Real Life Problem (1) Consolidation of numbers along a hierarchy Use a Parent-Child Table with a bridge table to do this in a relational DB
Real Life Problem (2) Every company has this sort of problem The most important people (CEO) experience the worst performance OLAP tools exist because this sort of query is traditionally slow on relational systems At a customer, 6 GB of data resulted in a 20 minute wait for the CEO Even Pre-Calculating all reports over night became difficult
The Data Model Bridge Table 400 K Rows 300 K Rows 8191 12 4096 nodes levels leaves 500 K Rows
Size of the Data Blocks Rows DIM_ACCOUNTING 9'780 532 067 DIM_BUSINESSTYPE 10 181 DIM_CLIENT 29 819 453 392 DIM_MEASURE 6 81 DIM _ORG DIM_ORG_FLAT DIM_PRODUCT 123 118 11 875 8 916 53 248 344 380 775 561 blocks * 8 192 Bytes = 6 GB DIM_TIME 11 501 DIM_TRANS 77 3 001 DIM_UNIT 5 81 T_FACTS 723 739 16 019 518 775 561 17 415 366 Quite small data volume Bad performance on several platforms Realistic scenario
Data Generation create_dim( p_bf => 2, p_depth => 12, p_name => 'org', p_cols => 'org01,org02,org03,org04,org05,org06,org07,org08,org09,org10', p_types => 't10,t10,t10,t10,t10,t10,t10,t10,t10,t10 ); One function call creates complete dimension table dim_org Generates id column, parent pointer, bridge table dim_org_flat Generated from a helper table with just integers and random numbers Similar function to generate fact table Started out as PL/SQL, now a Perl script that works with any DB It is easy to model any scenario with this tool
The Test Query Generated by BI tool
Initial Tests on Oracle and SQL Server Aggregated Fact Rows Machine OS DBMS 16 Mio 1 Mio 3500 Description IBM 9117-570 8 GB RAM 1.9 GHt 4 CPU s AIX Oracle 10G 1200 sec 168 sec 167 sec Expensive Production Server Dell Dimension E521 4GB RAM Windows 2003 Server Oracle 10 G 1023 sec 205 sec 159 sec Home PC Dell Dimension E521 4 GB RAM Windows 2003 Server MS SQL Server 2005 741 sec 699 sec 293 sec HP DL 380 Proliant Server 0.5 GB RAM Intel Xeon 3.2 GHz Red Hat Linux Oracle 10 G 1432 sec 413 sec 386 sec Linux with little RAM All the same order of magnitude Adding RAM does not help a traditional DB PCs are better than you think
A New Generation DB System Aggregated Fact Rows Machine OS DBMS 16 Mio 1 Mio 3500 Description IBM 9117-570 8 GB RAM 1.9 GHt 4 CPU s AIX Oracle 10G 1200 sec 168 sec 167 sec Expensive Production Server Dell Dimension E521 4GB RAM Windows 2003 Server Oracle 10 G 1023 sec 205 sec 159 sec Home PC Dell Dimension E521 4 GB RAM Windows 2003 Server MS SQL Server 2005 741 sec 699 sec 293 sec HP DL 380 Proliant Server 0.5 GB RAM Intel Xeon 3.2 GHz Red Hat Linux Oracle 10 G 1432 sec 413 sec 386 sec Linux with little RAM Exasol Test System 2 Quad Core Intel CPU 32 GB RAM 2 nodes Exacluster (Linux Microkernel) Exasol 22 sec 2 sec 0 sec In Memory DB Im memory DB factor 30-50 faster That s the speed of sound relative to a bicycle With generic Intel hardware Worth looking at several of these new systems
A New Generation DB System 1600 1400 1200 1000 800 600 400 200 0 DD SQL DD CRA HP IBM Exa Im memory DB factor 30-50 faster That s the speed of sound relative to a bicycle With generic Intel hardware Worth looking at several of these new systems
The Contenders Oracle 11 G MySQL MonetDB LucidDB Greenplum (their own hardware) Exasol (their own hardware)
The Test Server Intel Dual Xeon E 5205 16 GB RAM 2 x 250 GB SATA Disk 64 Bit Debian Linux
Interesting DB Systems That Were Not Tested Teradata Oracle ExaData Netezza Vertica Infobright Kognitio The field is very active and new products and approaches keep entering the market.
MonetDB Origin: Result of research at CWI in the Netherlands Open Source: Yes Free of Charge: Yes Remarks: o o o Recent publicity through a paper in Communications of the ACM: Breaking the Memory Wall in MonetDB Constantly changing as research progresses Easy to get into direct contact with the developers Quote from the website: MonetDB is a open-source database system for high-performance Applications in data mining, OLAP, GIS, XMLQuery, text and multimedia retrieval.
LucidDB Origin: Formerly part of LucidEra in San Mateo, California Open Source: Yes Free of Charge: Yes Remarks: o Emphasizes ease of configuration and maintenance o Mostly written in Java Quote from the website: LucidDB is the first and only open-source RDBMS purpose-built entirely for data warehousing and business intelligence. It is based on architectural cornerstones such as column-store, bitmap indexing, hash join/aggregation, and page-level multiversioning.
Greenplum Origin: Located in San Mateo, California. Postgres based. Open Source: Based on Open Source Technology Free of Charge: No Remarks: o Based on similiar hardware architecture as Exasol o Highly configurable and tunable, lots of features o Column store is an option, default is row store Quote from the website: Greenplum Database utilizes a shared-nothing MPP (massively parallel processing) architecture that has been designed from the ground up for BI and analytical processing using commodity hardware. In this architecture, data is automatically partitioned across multiple 'segment' servers, and each 'segment' owns and manages a distinct portion of the overall data. All communication is via a network interconnect -- there is no disk-level sharing or contention to be concerned with (i.e. it is a 'shared-nothing architecture).
Exasol Origin: Developed from scratch in Nürnberg, Germany Open Source: No Free of Charge: No Remarks: o Based on similiar hardware architecture as Greenplum o Pure column store DB o Emphasizes ease of administration o No need to create indexes or gather statistics o Imitates some Oracle-isms for compatibility Quote from the website: The database has been specially developed for analysis and is being used successfully for data warehousing, Web analytics, data mining applications and more. In contrast with universal databases, this specialization means that the data to be analyzed can be made available to analysis tools virtually in real time.
Typical Shared Nothing Node Combine many of these, connected by GB Ethernet
Results With 16 Mio Rows in the Fact Table 2500 2280 2000 1500 1000 500 0 460 226 31 13 10 Oracle MySQL LucidDB MonetDB Greenplum Exasol Oracle on a new 64 Bit box is 4 times faster than on an average 32 bit box Both Oracle and LucidDB were twice as fast after dropping all indexes on the fact table (those are the times in the chart) We did not manage to tune MySQL to get acceptable performance for a free system, LucidDB has good performance and little hassle MonetDB needed a fix in the optimizer before coping with the query Next generation in memory DBs are at least one order of magnitude faster
Performance Scaling 400 350 364 300 288 Exasol [sec] (public demo system) 250 200 150 183 133 210 Exasol [sec] (untuned comparable hardware) Exasol [sec] (local dimensions comparable hardware ) Greenplum[sec] 100 105 97 50 26 54 0 13 6 3 16 160 320 Both systems scale linearly It is possible to query at least ten times the data volume efficiently The vendors claim unlimited linear scaling by adding commodity hardware
Conclusion Big Lessons Database technology is in upheaval at the moment By adopting the new technologies, you can totally revolutionize the way you access your data Prices will fall rapidly. This is like the PC revolution. Small Lessons If you have an Oracle on a 32 Bit system, move to a 64 Bit architecture. It will give you a factor 4 without any pain If your table scans are slow, drop all indexes If you move to a new technology, you will get a factor 50 The commercial systems are worth their money. Their SQL is more compatible, and they are more stable