Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photographic, or otherwise, without the explicit written permission of the copyright owners. Big Data: Big IT Party? by Rick F. van der Lans R20/Consultancy BV Twitter @rick_vanderlans www.r20.nl Rick F. van der Lans Rick F. van der Lans is an independent consultant, lecturer, and author. He specializes in warehousing, business intelligence, base technology, and virtualization. He is managing director of R20/Consultancy B.V.. Rick has been involved in various projects in which warehousing, and integration technology was applied. Rick van der Lans is an internationally acclaimed lecturer. He has lectured professionally for the last twenty five years in many of the European and Middle East countries, the USA, South America, and in Australia. He has been invited by several major software vendors to present keynote speeches. He is the author of several books on computing, including his new Data Virtualization for Business Intelligence Systems. Some of these books are available in different languages. Books such as the popular Introduction to SQL is available in English, Dutch, Italian, Chinese, and German and is sold world wide. He also authored The SQL Guide to Ingres and SQL for MySQL Developers. As author for BeyeNetwork.com, writer of whitepapers, chairman for the annual European Enterprise Data and Business Intelligence Conference, and as columnist for a few IT magazines, he has close contacts with many vendors. R20/Consultancy B.V. is located in The Hague, The Netherlands, www.r20.nl. You can get in touch with Rick via: Email: rick@r20.nl Twitter: @Rick_vanderlans LinkedIn: http://www.linkedin.com/pub/rick-van-der-lans/9/207/223 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 2 Do We Agree On What Big Data Is? Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 3 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 4 1
WikiBon February 2014 WikiBon February 2014 Source: http://wikibon.org/wiki/v/big_data_vendor_revenue_and_market_forecast_2013-2017 Source: http://wikibon.org/wiki/v/big_data_vendor_revenue_and_market_forecast_2013-2017 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 5 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 6 Gartner: Big Data Market Forecast Gartner s Hype Cycle for Emerging Tech s July 2013 Big will drive $232 billion in spending through 2016. It will directly or indirectly drive $96 billion of worldwide IT spending in 2012, and is forecast to drive $120 billion of IT spending in 2013. Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 7 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 8 2
McKinsey Global Institute: Benefits of Big Data Big Data has the potential to increase the value of the US Health Care industry by $300 Billion to increase the industry value of Europe s public sector administration by EUR 250 Billion to decrease manufacturing (development and assembly) costs by 50% to increase service provider revenue by $100 Billion due to global personal location to increase US Retails net margin by 60% Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 9 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 10 Big Data Exaggerations It s All About Analytics Big : A revolution that will transform how we live, work and think Companies are being destroyed and created around big, Management of big Key to survival in the health care sector Big has arrived and is shaping IT today The disruptive power of big Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 11 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 12 3
Analytical Challenges of Tomorrow Improve product development Optimize business processes Improve customer care Improve customer delight Improve pro-active customer care Personalize products External Data: UK-based Retail Company 10 degree rise in temperature means 300% more barbecue meat, 45% more lettuce, and 50% more coleslaw A city-center store will see an uplift in sandwiches (to eat outside) on a warm weekday, and almost no effect at all on a warm weekend Result: 6 million UK pounds less food wastage in the summer, 50 million less stock in warehouses Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 13 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 14 Social Media Data Sensor Data Internet of Things Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 15 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 16 4
Privacy? Quantity Quality Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 17 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 18 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 19 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 20 5
Databases are Boring! Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 21 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 22 Source: The 451 Group SQL is Intergalactic DataSpeak! Or was? Can We Exploit This? Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 23 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 24 6
Scale Up Scale Out Operations of a Query scale up scale out Scale up (vertical scaling) means adding more resources to one node in a system Scale out (horizontal scaling) means adding more nodes to a system Continuous availability/redundancy Cost/performance flexibility Contiguous upgrades Geographical distribution WITH FLIGHTPLAN(FLIGHTNO, PLAN_AIRPORTS, PLAN_FLIGHTS, START_AIRPORT, END_AIRPORT, START_TIME, END_TIME, DEPARTURE_AIRPORT, ARRIVAL_AIRPORT, DEPARTURE_TIME, ARRIVAL_TIME, PRICE, STOPS) AS (SELECT FLIGHTNO, CAST(DEPARTURE_AIRPORT '->' ARRIVAL_AIRPORT AS VARCHAR(100)), CAST(RTRIM(CHAR(FLIGHTNO)) AS VARCHAR(100)), DEPARTURE_AIRPORT, ARRIVAL_AIRPORT, DEPARTURE_TIME, ARRIVAL_TIME, DEPARTURE_AIRPORT, ARRIVAL_AIRPORT, DEPARTURE_TIME, ARRIVAL_TIME, PRICE, 0 FROM FLIGHTS WHERE DEPARTURE_AIRPORT='AMS' AND CAST(DEPARTURE_TIME AS DATE) = '2007-03-01' UNION ALL SELECT P.FLIGHTNO, P.PLAN_AIRPORTS '->' F.ARRIVAL_AIRPORT, P.PLAN_FLIGHTS '->' RTRIM(CHAR(F.FLIGHTNO)), P.START_AIRPORT, F.ARRIVAL_AIRPORT, P.START_TIME, F.ARRIVAL_TIME, P.DEPARTURE_AIRPORT, P.ARRIVAL_AIRPORT, P.DEPARTURE_TIME, P.ARRIVAL_TIME, P.PRICE + F.PRICE, STOPS+1 FROM FLIGHTPLAN AS P, FLIGHTS AS F WHERE P.ARRIVAL_AIRPORT = F.DEPARTURE_AIRPORT AND P.ARRIVAL_TIME < F.DEPARTURE_TIME AND F.DEPARTURE_AIRPORT <> 'PHX' AND LOCATE(F.ARRIVAL_AIRPORT, P.PLAN_AIRPORTS) = 0 AND STOPS < 1 AND P.ARRIVAL_TIME + 4 HOURS > F.DEPARTURE_TIME) SELECT PLAN_AIRPORTS, PLAN_FLIGHTS, START_AIRPORT, END_AIRPORT, START_TIME, END_TIME, PRICE FROM FLIGHTPLAN WHERE END_AIRPORT = 'PHX' ORDER BY PRICE ASC FETCH FIRST 1 ROW ONLY Analytical functions Recursive operations Joins Having filters Group by Complex scalar functions Projections and simple transformations Filters - selections Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 25 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 26 Parallel Database Architecture Effect of Partitions on Query Response Database server Application Analytical functions Recursive operations Joins Having filters Group by Complex scalar functions Projections and simple transformations Filters - selections Master Worker 1 Worker 2 Worker 3 total throughput bottleneck number of partitions/processors Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 27 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 28 7
Internal Database Server Administration The Market of Hadoop/NoSQL Products NewSQL Source: VoltDB / Michael Stonebraker Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 29 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 30 Categories of Database Servers Aggregate Data Model all base servers SQL base servers NoSQL base servers Classic SQL base servers Analytical SQL base servers NewSQL base servers Key-value Document Column-family Graph base servers Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 31 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 32 8
Strong Consistency vs. Eventual Consistency SQL DBMS versus NoSQL Solution Strong application application Eventual SQL base server NoSQL solution Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 33 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 34 Hadoop Components The 2 nd Generation of Hadoop Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 35 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 36 9
Hadoop 2.0 Examples of Complex Values (1) Comma-separated value "Anchorage Daily News","PO Box 149001","Anchorage","AK","99514-9001", "907-257-4200","907-258-2157","71","","82", "http://www.adn.com/",newsroom@adn.com EDIFACT message UNB+UNOA:1+005435656:1+006415160:1+060515:1434+00000000000778'XXXUNH+ 00000000000117+INVOIC:D:97B:UN'XXXBGM+380+342459+9'XXXDTM+ 3:20060515:102'XXXRFF+ON:521052'XXXNAD+BY+792820524::16++ CUMMINSMIDRANGEENGINEPLANT'XXXNAD+SE+005435656::16++ GENERALWIDGETCOMPANY'XXXCUX+1:USD'XXXLIN+1++157870:IN'XXXIMD+ F++:::WIDGET'XXXQTY+47:1020:EA'XXXALI+US'XXXMOA+203:1202.58'XXXPRI+ INV:1.179'XXXLIN+2++157871:IN'XXXIMD+F++:::DIFFERENTWIDGET'XXXQTY+ 47:20:EA'XXXALI+JP'XXXMOA+203:410'XXXPRI+INV:20.5'XXXUNS+S'XXXMOA+ 39:2137.58'XXXALC+C+ABG'XXXMOA+8:525'XXXUNT+23+00000000000117'XXXUNZ+ 1+00000000000778' Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 37 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 38 Example of Complex Value (2) Unraveling the Data Model Weblog record datestamp ip request 6/1/2012 11:10:19 AM 107.1.187.170 GET /x.php?u=http://studio-5.financialcontent.com/synacor?page=quote&ticker=ddd HTTP/1.1 6/1/2012 5:53:49 AM 107.1.2.180 GET /tv/3/player/vendor/chef%20tips /player/fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1 6/1/2012 8:55:54 AM 107.34.51.63 GET /tv/3/search/content/the%20andy%20griffith%20show/s/the%20 Andy%20Griffith%20Show HTTP/1.1 6/1/2012 3:12:43 PM 107.5.115.117 GET /tv/3/search/content/kathie%20lee%20gifford's%20epic%20'today'%20gaffe/s/kathie %20Lee%20Gifford's%20epic%20'Today'%20gaffe HTTP/1.1 6/1/2012 4:48:35 PM 108.225.132.245 GET /tv/3/search/content/deadliest%20catch/s/deadliest%20catch HTTP/1.1 6/1/2012 10:25:12 AM 108.246.20.125 GET /x.php?u=http://studio- 5.financialcontent.com/synacor?Page=QUOTE&Ticker=DJ:DJI HTTP/1.1 6/1/2012 1:58:14 AM 108.246.25.117 GET /tv/3/player/vendor/chef%20tips/player /fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1 1 2 3 Unravel & Store Store Store Classic base Classic base Query Query & unravel MapReduce base Query & unravel Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 39 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 40 10
Schema-On-Write SoW = Data written to a base has a schema A schema is not optional Fixed schema-on-write All records in a table have the same schema For example, SQL systems Variable schema-on-write When is stored in the base, a schema is written together with the itself Different records in a table can have different schemas Schema-On-Read SoR = Data written to a base has a schema Stored has no schema Complex values or schema-less values Schema-on-application-read The application assigns a schema to the schema-less (unraveling) Schema-on-base-read The base server assigns a schema to the schema-less The application receives with a schema Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 41 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 42 Tyranny of Performance The Balancing Act Performance Scalability Availability Productivity Maintainability Time-to-market Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 43 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 44 11
The Classic Reporting Environment The Upcoming Analytical Labyrinth unstructured operational external private applications bases personal store Executive applications bases personal store staging area marts staging area marts Interactive Interactive warehouse warehouse Predictive analytics sandboxes big analytics big Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 45 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 46 Do We Want Analytical Silos? Heading for an Integration Labyrinth applications Self-service BI iterative predictive analytics mobile predefined applications Self-service BI iterative predictive analytics mobile predefined bases big unstructured sandboxes private bases big unstructured sandboxes private staging area warehouse & marts social media streaming bases external staging area warehouse & marts social media streaming bases external Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 47 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 48 12
Different Database Workloads Hadoop APIs Too Technical? OLXP xml base sql base OLAP OLAP base sql base OLCP OO base sql base OLTP pre-relational base sql base time Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 49 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 50 Is Google Going SQL? Market of SQL-fication Products 2012: Spanner supports general-purpose transactions, and provides a SQL-based query language. Google s motivation: We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. SQL-on-Hadoop Engines Examples: Apache Hive, Cassandra CQL, CitusDB, Cloudera Impala, Concurrency Lingual, Hadapt, InfiniDB, JethroData, MammothDB, MapR Drill, MemSQL, Pivotal HawQ, Progress DataDirect, ScleraDB, Simba, SpliceMachine, Data virtualization and federation servers Examples: Cirro, Cisco/Composite, Denodo, Informatica IDS, RedHat Jboss Data Virtualization, Stonebond, SQL bases (polyglot persistence) Examples: EMC Greenplum UAP, Hadapt, Microsoft Polybase, Paraccell, Tera Aster base (SQL-H), Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 51 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 52 13
CitusData CitusDB JethroData Jethro HDFS CitusDB MongoDB Designed for analytical queries Characteristics No use of MapReduce or Hive Knows the location of speeds up access Based on PostgreSQL Queries are pushed to the nodes Statistics are collected on the UDFs are supported Jethro HDFS Designed for interactive queries Characteristics Every column is indexed!! Append-only inverted lists index entries are appended Inserts no impact on reads 30-40% extra storage Columnar store Ansi-92 SQL: DDL + query Supports joins Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 53 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 54 PivotalHD Hawq PivotalHD Hawq Architecture HBase HawQ HDFS PivotalHD Hawq = Greenplum on HDFS Dual base strategy Uses the same file format as GemFire/SQLFire for transactions Greenplum = mature costbased query optimizer Hawq compatible with Greenplum ACID compliant Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 55 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 56 14
Data Virtualization Overview (1) Data Virtualization Overview (2) application analytics & internal portal mobile App website dashboard application analytics & internal portal mobile App website dashboard SQL statement ODBC/SQL JDBC/SQL XML/SOAP REST/JSON XQuery MDX/DAX Data Virtualization Server statement SOAP message Data Virtualization Server JMS message SQL CICS JMS SQL SQL+ XSLT SOAP Hive Prop. Excel JSON bases warehouse & marts streaming applications bases unstructured ESB big social media private external bases warehouse & marts streaming applications bases unstructured ESB big social media private external Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 57 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 58 Definition of Data Virtualization Data virtualization is the technology that offers consumers a unified, abstracted, and encapsulated view for querying and manipulating stored in a heterogeneous set of. The Market of Data Virtualization Servers Cirro Data Hub Cisco/Composite Information Server Denodo Platform IBM InfoSphere Federation Server Informatica Data Services Information Builders EII Oracle Data Services Integrator Progress Easyl Red Hat Teiid and Jboss Data Virtualization Stone Bond Enterprise Enabler Virtuoso And many more Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 59 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 60 15
Data Stays Where it s Collected Data generated by day is more than can be moved across the network. Network will look like this Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 61 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 62 Data Virtualization to the Rescue? Data Virtualization Server Big Data is Too Big To Move Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 63 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 64 16
C-Level and Big Data 85% expect to gain substantial business and IT benefits from Big Data initiatives 85% have Big Data initiatives planned or in progress 70% report that these initiatives are enterprise-driven 85% of the initiatives are sponsored by a C-level executive or the head of a line of business 75% expect an impact across multiple lines of business Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 65 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 66 C-Level and Big Data Battle of Chancellorsville, 1863 15% ranked their access to as adequate or world-class 21% ranked their analytic capabilities as adequate or world-class 17% ranked their ability to use and analytics to transform their business as more than adequate or world-class USA Army Strength: 133,000 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 67 CFA Army Strength: 60,000 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 68 17
You Can t Hide For Big Data Anymore IT specialists? IT departments? Benelux / Europe? Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 69 Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 70 Big IT Party?? Classic SQL base servers Analytical SQL SQL base servers base servers NewSQL base servers all Key-value base servers Document NoSQL base servers Column-family Graph base servers Copyright 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 71 Copyright Recommended Books 1991-2014 R20/Consultancy B.V., The Hague, The Netherlands 72 18