Big Data Landscape for Databases Bob Baran Senior Sales Enginee rbaran@splicemachine.com! May 12, 2015
Typical Database Workloads OLTP Applications Real-Time Web, Mobile, and IoT Applications Real-Time, Operational Reporting Ad-Hoc Analytics Enterprise Data Warehouses Typical Databases MySQL Oracle MongoDB Cassandra MySQL Oracle MySQL Oracle Greenplum Paraccel Netezza Teradata Oracle Sybase IQ Use Cases ERP, CRM, Supply Chain Web, mobile, social IoT Operational Datastores Crystal Reports Exploratory Analytics Data Mining Enterprise Reporting Workload Strengths Real-time updates ACID transactions High concurrency of small reads/ writes Range queries Real-time updates High ingest rates High concurrency of small reads/ writes Range queries Real-time updates Canned, parameterized reports Range queries Complex queries requiring full table scans Append only Parameterized reports against historical data Operational Analytical 2
Recent History of RDBMSs RDBMS Definition Relational with joins ACID transactions Secondary indexes Typically row-oriented Operational and/or analytical workloads By early 2000s Limited innovation Looked like Oracle and Teradata won 3
Hadoop Shakes Up Batch Analytics Data processing framework Cheap distributed file system Brute force, batch processing through MapReduce Great for batch analytics Great place to dump data to look at later 4
NoSQL Shakes Ups Operational DBs NoSQL wave Companies like Google, Amazon and LinkedIn needed greater scalability & schema flexibility New databases developed by developers, not database people Provided scale-out, but lost SQL Worked well at web startups because: In some cases, use cases did not need ACID Willing to handle exceptions at app level 5
Convoluted Evolution of Databases Hadoop 2005 NoSQL Databases 2010 Scale-out SQL Databases 2013 Scale Out Sc ala bili ty Traditional RDBMSs 1980s-2000s Scale Up Hierarchical/ Network Databases 1970s Indexed Files (ISAM) 1960s Functionality 6
Mainstream user changes Driven by web, social, mobile, and Internet of Things Major increases in scale 30% annual data growth Significant requirements for semi-structured data Though relatively little unstructured Technology adoption continuum What is it? Scale-out SQL DBs for operational apps Should I use it? NoSQL for web apps Hadoop technologies for analytics Why wouldn t I use it? Cloud 7
Schema on Ingest vs. Schema on Read Data Stream Schema on Ingest Schema on Read Application Structured data should always remain structured Schema on Read if you only use data a few times a year Add schema if data used regularly Even schemaless MongoDB requires schema - 10 Things You Should Know About Running MongoDB At Scale By Asya Kamsky, Principal Solutions Architect at MongoDB Item #1 have a good schema and indexing strategy #
Scale-out is the future of databases How do I scale? Scale Up Scale Out NoSQL NewSQL SQL-on- Hadoop MPP Hadoop RDBMS Analytic Engines 9
NoSQL Pros Cons Easy scale-out Flexible schema Easier web development with hierarchical data structures (MongoDB) Cross-data center replication (Cassandra) No SQL requires retraining and app rewrites No joins i.e., no cross row/ document dependencies No reliable updates through transactions across rows/tables Eventual consistency (Cassandra) Not designed to do aggregations required for analytics 10
NewSQL Pros Cons Easy scale-out ANSI SQL eliminates retraining and app rewrites Reliable updates through ACID transactions RDBMS functionality Strong cross-data center replication (NuoDB) Proprietary scale-out, unproven into petabytes Must manage another distributed infrastructure beyond Hadoop Can not leverage Hadoop ecosystem of tools 11
NewSQL In-Memory Pros Cons Easy scale-out High performance because everything in memory ACID transactions within nodes Memory 10-20x more expensive Limited SQL Limited cross-node transactions Proprietary scale-out, unproven into petabytes Must manage another distributed infrastructure beyond Hadoop Can not leverage Hadoop ecosystem 12
Operational RDBMS on Hadoop Pros Cons Easy scale-out Scale-out infrastructure proven into petabytes ANSI SQL eliminates retraining and app rewrites Reliable updates through ACID transactions Leverages Hadoop distributed infrastructure and tool ecosystem Full table scans slower than MPP DBs, but faster than traditional RDBMSs Existing HDFS data must be reloaded through SQL interface 13
MPP Analytical Databases Pros Cons Easy scale-out Very fast performance for full table scans Highly parallelized, shared nothing architectures May have columnar storage (Vertica) No maintenance of indexes (Netezza) Poor concurrency models prevent support of real-time apps Poor performance for range queries Need to redistribute all data to add nodes (hash partitioning) May require specialized hardware (Netezza) Proprietary scale out - can not leverage Hadoop ecosystem of tools 14
SQL-on-Hadoop Analytical Engines Pros Cons Easy scale-out Scale-out proven into petabytes Leverages Hadoop distributed infrastructure Can leverage Hadoop ecosystem of tools Relatively immature, especially compared to MPP DBs Limited SQL Poor concurrency models prevent support of real-time apps No reliable updates through transactions Intermediate results must fit in memory (Presto) 15
Future: Hybrid In-Memory Architectures Memory Cache with Disk - Unsophisticated memory management Pure In-Memory - Very expensive Hybrid In-Memory - Flexible, cost-effective - Controlled by optimizer - In-memory materialized views? 16
Summary Future of Databases Predicted Trends Scale-out dominates databases Developers stop worrying about data size and develop new data-driven apps Hybrid in-memory architecture becomes mainstream Predicted Winners Hadoop becomes de facto distributed file system NoSQL used for simple web apps Scale-out SQL RDBMSs replace traditional RDBMSs 17
Questions? Bob Baran Senior Sales Engineer rbaran@splicemachine.com! May 12, 2015
Powering Real-Time Apps on Hadoop Bob Baran Senior Sales Engineer rbaran@splicemachine.com! May 12, 2015
Who Are We? THE ONLY HADOOP RDBMS Power operational applications on Hadoop Affordable, Scale-Out Commodity hardware Elastic Easy to expand or scale back 10x Better Price/Perf Transactional Real-time updates & ACID Transactions ANSI SQL Leverage existing SQL code, tools, & skills Flexible Support operational and analytical workloads #
What People are Saying Recognized as a key innovator in databases Quotes Scaling out on Splice Machine presented some major benefits over Oracle...automatic balancing between clusters...avoiding the costly licensing issues. An alternative to today s RDBMSes, Splice Machine effectively combines traditional relational database technology with the scale-out capabilities of Hadoop. The unique claim of Splice Machine is that it can run transactional applications as well as support analytics on top of Hadoop. Awards 21
Advisory Board Advisory Board includes luminaries in databases and technology Mike Franklin Computer Science Chair, UC Berkeley Director, UC Berkeley AmpLab Founder of Apache Spark Roger Bamford Former Principal Architect at Oracle Father of Oracle RAC Marie-Anne Neimat Co-Founder, Times-Ten Database Former VP, Database Eng. at Oracle Ken Rudin Head of Analytics at Facebook Former GM of Oracle Data Warehousing 22
Combines the Best of Both Worlds Hadoop Scale-out on commodity servers Proven to 100s of petabytes Efficiently handle sparse data Extensive ecosystem RDBMS ANSI SQL Real-time, concurrent updates ACID transactions ODBC/JDBC support #
Focused on OLTP and Real-Time Workloads OLTP Applications Real-Time Web, Mobile, and IoT Applications Real-Time, Operational Reporting Ad-Hoc Analytics Enterprise Data Warehouses Typical Databases MySQL Oracle MySQL Oracle MongoDB Cassandra MySQL Oracle Greenplum Paraccel Netezza Teradata Oracle Sybase IQ Use Cases ERP, CRM, Supply Chain Web, mobile, social IoT Operational Datastores Crystal Reports Exploratory Analytics Data Mining Enterprise Reporting Workload Strengths Real-time updates ACID transactions High concurrency of small reads/ writes Range queries Real-time updates High ingest rates High concurrency of small reads/ writes Range queries Real-time updates Canned, parameterized reports Range queries Complex queries requiring full table scans Append only Parameterized reports against historical data 24
OLTP Campaign Management: Harte-Hanks Overview Digital marketing services provider Unified Customer Profile Real-time campaign management OLTP environment with BI reports Challenges Oracle RAC too expensive to scale Queries too slow even up to ½ hour Getting worse expect 30-50% data growth Looked for 9 months for a cost-effective solution Solution Diagram Cross-Channel Campaigns Real-Time Personalization Real-Time Actions Initial Results 10-20x price/perf with no application, BI or ETL rewrites ¼ cost with commodity scale out 3-7x faster through parallelized queries 25
Reference Architecture: Operational Data Lake Offload real-time reporting and analytics from expensive OLTP and DW systems ERP CRM OLTP Systems Stream or Batch Updates Operational Data Lake ETL Data Warehouse Executive Business Reports Supply Chain Ad Hoc Analytics HR Datamart Operational Reports & Analytics Real-Time, Event-Driven Apps #
Streamlining the Structured Data Pipeline in Hadoop ERP CRM Source Systems Sqoop Traditional Hadoop Pipeline Apply Inferred Schema SQL Query Engines BI Tools Stored as flat files vs. ERP CRM Source Systems Streamlined Hadoop Pipeline Exisiting ETL Tool Stored in same schema BI Tools Advantages Reduced operational costs with less complexity Reduced processing time and errors with fewer translations Real-time updates for data cleansing Better SQL support 27
Complementing Existing Hadoop-Based Data Lakes Optimizing storage and querying of structured data as part of ELT or Hadoop query engines Pig ERP CRM OLTP Systems 1 HCATALOG 3 SCHEMA ON READ: Ad-hoc Hadoop queries across structured and unstructured data Supply Chain HR SCHEMA ON INGEST: Streamlined, structured-tostructured integration Structured Data Unstructured Data 2 SCHEMA BEFORE READ: Repository for structured data or metadata from ELT process on unstructured data #
Proven Building Blocks: Hadoop and Derby APACHE DERBY ANSI SQL-99 RDBMS Java-based ODBC/JDBC Compliant! APACHE HBASE/HDFS Auto-sharding Real-time updates Fault-tolerance Scalability to 100s of PBs Data replication #
HBase: Proven Scale-Out Auto-sharding Scales with commodity hardware Cost-effective from GBs to PBs High availability thru failover and replication LSM-trees #
Splice Optimizations to HBase Splice Storage is optimized over raw HBase We use Bitmap Indexes to store data in packed byte arrays This approach allows us to store data in a much smaller footprint than traditional HBase With a TPCH schema, we found a 10X reduction in data size reduction Requires far less hardware and resources to perform the same workload Asynchronous Write Pipeline HBase writes (puts) are not pipelined and block while the call is being made Splice s write pipeline allows us to reach speeds of over 100K writes / second per HBase node This allows extremely high ingest speeds without requiring more hardware and custom code Transactions As scalability increases, the likelihood of failures increases We utilize Snapshot Isolation to make sure if there is a failure, it does not corrupt existing data RDBMS Capabilities The use of SQL vs. custom scans and the ability for an optimizer to choose the best access path to the data Core Data Management functions (Indexes, Constraints, typed columns, etc.) 31
Distributed, Parallelized Query Execution Parallelized computation across cluster Moves computation to the data Utilizes HBase co-processors No MapReduce HBase Co-Processor!HBase Server Memory Space L G N E E D #
ANSI SQL-99 Coverage Data types e.g., INTEGER, REAL, CHARACTER, DATE, BOOLEAN, BIGINT Conditional functions e.g., CASE, searched CASE DDL e.g., CREATE TABLE, CREATE SCHEMA, ALTER TABLE, DELETE, UPDATE Privileges e.g., privileges for SELECT, DELETE, INSERT, EXECUTE Predicates e.g., IN, BETWEEN, LIKE, EXISTS Cursors e.g., updatable, read-only, positioned DELETE/UPDATE DML e.g., INSERT, DELETE, UPDATE, SELECT Joins e.g., INNER JOIN, LEFT OUTER JOIN Query specification e.g., SELECT DISTINCT, GROUP BY, HAVING SET functions e.g., UNION, ABS, MOD, ALL, CHECK Transactions e.g., COMMIT, ROLLBACK, READ COMMITTED, REPEATABLE READ, READ UNCOMMITTED, Snapshot Isolation Sub-queries Aggregation functions e.g., AVG, MAX, COUNT String functions e.g., SUBSTRING, concatenation, UPPER, LOWER, POSITION, TRIM, LENGTH Triggers User-defined functions (UDFs) Views including grouped views Window Functions (rank, rownumber, ) 33
Window Functions (Advanced Analytics Functions) Analytics such as Running total, Moving averages, Top-N Queries Performs calculations across a set of table rows related to the current row in the window Similar to aggregate functions with two significant differences: Outputs one row for each input value it operates upon. Groups rows with window partitioning and frame clauses vs. Group BY SPLICE MACHINE Currently Supports RANK DENSE_RANK ROW NUMBER AVG SUM COUNT MAX MIN 34
Lockless, ACID transactions Adds multi-row, multi-table transactions to HBase with rollback Fast, lockless, high concurrency Extends research from Google Percolator, Yahoo Labs, U of Waterloo Patent pending technology #
Customer Performance Benchmarks Typically 10x price/performance improvement SPEED 3-7x 20x 7x FASTER VS. PRICE/ PERFORMANCE 10-20x 10x 30x LOWER #
Applications, BI / SQL tool support via ODBC/JDBC #
Splice Machine Safe Journey Process Initial Overview Rapid Assessment Proof of Concept Pilot Project Enterprise Implementation Splice Machine overview Set the stage for Rapid Assessment Half day workshop Assess Splice Machine fit Identity target use cases Risk assessment of use cases Agree upon success criteria Prove client use case on Splice Machine hosted environment Benchmark using customer queries and schema On Customer data or generated data that resembles customer data Identify paid pilot use case with limited change management impact Install Splice Machine on client environment Deploy use case/ application on client data Prove Splice Machine against key requirements Kickstart Requirements Design/Dev QA Test Cutover Hypercare 1 day 5 days (including prep) 2 weeks 3-6 weeks 3-10 months 38
Safe Journey Enterprise Implementation Stages Kickstart Requirements Design/Dev QA Test Parallel Ops Cutover Hypercare Packaged 2 week program to get new client off to strong start on solid foundation! Incorporates: Splice Architecture & Development courses Risk Assessment Workshop Implementation Blueprint Establish clear functional and performance requirements document! Can be a refresh only if project is a port of an existing app to Splice Based on Agile method. Phase is divided into 2 week sprints! Stories covering a set of required capabilities are assigned to each developer! A design doc is created, code is written, unit tests are written and executed until they pass The QA test period includes: Performance Test End-to-End System Integration Test User Acceptance Test! Depending on scale of project there may be multiple iterations of each test with break/ fix cycles in between Used when an existing system is being ported to Splice Machine from another database! The new Splice Machine-based system runs side by side with the old system for a period of time Optional Formal period in which Splicebased solution goes-live and pre-existing system is deprecated Period of onsite support during cutover and for a period immediately following golive Optional #
Common Risks and Mitigation Strategies Data migration Risk: Clients are typically migrating very large data sets to Splice Machine. Issues with migration of certain data types such as dates can waste a lot of time reloading large amounts of data Solution: First migrate a small subset of tables that contain all required Changes data types. to source Ensure schema these during migrate implementation successfully before migrating the entire database Risk: Changes to the schema of the source database to be migrated during the course of the implementation will lead to a significant amount of rework and reloading of data, adding unplanned time to the project Solution: All stakeholders agree up front to freeze the schema as of an Stored agreed procedure upon date conversion prior to the Design/Development stage. Risk: Stored procedures need to be converted from the original language (e.g., PL/SQL) to Java. Complex stored procedures make include significant amounts or procedural code as well as multiple SQL statements Solution: Carefully review the function and design of SPs to be 40
Common Risks and Mitigation Strategies SQL compatibility Risk: Even though Splice Machine conforms to the ANSI 99+ SQL standard, virtually every database has unique syntax and some queries may need to be modified. Additionally, SQL generated by packaged applications may not be modifiable. Solution: Formal review of SQL syntax during the requirements phase. Modify relevant queries during the Design/Dev phase. If not modifiable an enhancement request for Splice Machine to support the required syntax out of the box may needed. Indexing Risk: Proper indexing is usually important to maximize the performance of Splice Machine. Splice Machine indexes are likely to differ from the indexes required for a traditional RDBMS Solution: Ensure that query performance SLAs are clearly defined in the Requirements phase. Incorporate proper index design early in the Design/Dev phase. Assume some iteration will be required to achieve the optimal indexes Hadoop knowledge Risk: Project stakeholders often have limited knowledge of Hadoop and the distributed computing paradigm. This can lead to confusion about the Splice Machine value proposition and the and the advantages of moving to a scale-out architecture Solution: Include the Splice Machine Kickoff Program at the beginning of the implementation project. This includes essential training on Hadoop and related fundamentals concepts critical to realizing value from a Splice Machine deployment 41
Summary THE ONLY HADOOP RDBMS Power operational applications on Hadoop Affordable, Scale-Out Commodity hardware Elastic Easy to expand or scale back 10x Better Price/Perf Transactional Real-time updates & ACID Transactions ANSI SQL Leverage existing SQL code, tools, & skills Flexible Support operational and analytical workloads #
Questions? Bob Baran Senior Sales Engineer rbaran@splicemachine.com! May 12, 2015