Big Data and Big Data Modeling The Age of Disruption Robin Bloor The Bloor Group March 19, 2015 TP02
Presenter Bio Robin Bloor, Ph.D. Robin Bloor is Chief Analyst at The Bloor Group. He has been an industry analyst and commentator on technology for 25 years, with expertise in software development, database, BI and associated technologies. He is a frequent keynote speaker at industry events and primary author of The Bloor Group s research reports. 2
Big Data and Big Data Modeling The Age of Disruption The Data Curve and the Data Warehouse Disruption, Disruption, Disruption A New Modeling Dynamic 3
The Data Curve
The Visible Big Data Trend Corporate data volumes grow at about 55% per annum exponentially Data has been growing at this rate for, maybe, 40 years There is nothing new about big data. It clings to an established exponential trend (It may be speeding up) 5
Technology Evolution (The Way We Were Bloor Curve) 6
And This Implies Software architectures change: centralized, client/server, 3 tier/web, service-oriented architecture, etc. Applications migrate according to latencies. Dominant applications and software brands can die via the innovator s dilemma. Wholly new applications appear because of lower latencies e.g., virtual machines and complex event processing (CEP). 7
The Invisible Data Trend: Moore s Law Cubed The biggest databases are new databases They grow at the cube of Moore s Law Moore s Law = 10x every 6 years VLDB: 1000x every 6 years 1991/2 megabytes 1997/8 gigabytes 2003/4 terabytes 2009/10 petabytes 2015/16 exabytes 8
The Genesis of Hadoop The old databases were having scaling problems. New databases appeared, but so did Hadoop. The number of data sources was exploding. Hadoop quickly became the staging area for these databases, even though it was immature. 9
The Evolution of Hadoop From Serial batch workloads MapReduce Versatile data storage Key-value access only An island of processing To Multiple concurrent workloads Multiple algorithms Optimized data storage SQL, JSON and even SPARQL access Integrated processing 10
The Data Warehouse: From/To Bloor Group 11
The Staging Workload Bloor Group 12
Disruption, Disruption, Disruption
Disruption in Several Dimensions 1. At the hardware layer 2. In software architecture 3. In the data layer 14
Parallelism: The Imp is Out of the Bottle Multicore chips enabled parallelism It has changed the whole performance equation It enabled Big Data Big Data is really Big Processing 15
Technology Revolutions Tech Revolution Architecture Computer Online PC Internet Mobile Internet of Things (IoT) Batch Centralized Client/server Multi-tier Service orientation Event driven/big data/parallel/distributed 16
Unprecedented Acceleration Moore s Law regularly delivered a speed-up of 10x every 6 years Implication: apps get faster every 6 years or so Parallelism delivers an almost unlimited speed-up, assuming you can build the application with a scalable architecture Implications: see later 17
Hardware Disruption: It s Over for Spinning Disk Solid state drives are now on the Moore s Law curve Disk is not and never was (in respect to seek time) All traditional databases were engineered for spinning disk and not for scale-out This explains the new database management (DBMS) products Bloor Group 18
Hardware: In-Memory Disruption Memory may gradually become the primary store for data (this impacts data flows) Almost all applications are poorly built for this Memory is an accelerator as is CPU cache. This is becoming a factor 19
Hardware: The Memory Cascade On chip speed v RAM L1(32K) = 100x L2(246K) = 30x L3(8-20Mb) = 8.6x RAM v SSD RAM = 300x SSD v Disk SSD = 10x Note: Vector instructions and data compression 20
Hardware: Putting a SoC in IT It s possible that the CPUmemory split will vanish (soon) This requires the emergence of the commodity System on a Chip (SoC) There are already Systems on a Chip that run Linux Grids of Systems on a Chip could replace grids of servers Graphic from Samsung Electronics 21
Data Disruption The Barriers are Down Internal Server log files Network log files Unstructured sources Data streams Web data External Mobile data Social media data Internet of things Web scavenging Data markets External streams 22
Data Flow A Set of Principles The data layer is one logical collection of data, both external and internal The data flows, from ingest through a refining process to a point of application It is best if data doesn t flow much Hadoop means corporate data staging Beyond that a database is required to manage workloads 23
The Corporate Data Flows There need to be two data flows (at minimum) Currently we can distinguish between: Real-time/business time applications Analytical applications We will build specific architectures for this 24
A New Modeling Dynamic
The Staging Workloads Data mapping/modeling Metadata discovery Metadata management Master data management Data lineage and lifecycle Bloor Group 26
The New World #1 The primary driver of the new world is that external data sources have expanded Data is being captured without metadata knowledge or even relationship knowledge Unstructured/semi-structured data is prevalent even normal The provenance of data has become an issue The new dimensions: geography and time 27
The New World #2 The single source of truth idea is dead. MDM will become about ontologies Modeling will not die or even diminish but we will explicitly model for context Data flows will be modeled There will be a metadata warehouse There will be event to entity models We will record data lineage We may need to model data lifecyclesw 28
Big Data and Big Data Modeling The Age of Disruption In Summary The Data Curve and the Data Warehouse Disruption, Disruption, Disruption A New Modeling Dynamic 29
Thank You for Attending! For any further questions, feel free contact me following ERworld. Robin Bloor email: robin.bloor@bloorgroup.com twitter: @robinbloor www.insideanalysis.com Please enjoy the rest of your time at ERworld 2015! 30
Legal Notice Copyright CA 2015. All trademarks, trade names, service marks and logos referenced herein belong to their respective companies. No unauthorized use, copying or distribution permitted. THIS PRESENTATION IS FOR YOUR INFORMATIONAL PURPOSES ONLY. CA assumes no responsibility for the accuracy or completeness of the information. TO THE EXTENT PERMITTED BY APPLICABLE LAW, CA PROVIDES THIS DOCUMENT AS IS WITHOUT WARRANTY OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NONINFRINGEMENT. In no event will CA be liable for any loss or damage, direct or indirect, in connection with this presentation, including, without limitation, lost profits, lost investment, business interruption, goodwill, or lost data, even if CA is expressly advised of the possibility of such damages. 31