Thomas Baumann Swiss Mobiliar Bern, Switzerland

Similar documents
Building Scalable Big Data Pipelines

Enterprise Operational SQL on Hadoop Trafodion Overview

Big Data Technologies Compared June 2014

REAL-TIME BIG DATA ANALYTICS

Workshop on Hadoop with Big Data

The Internet of Things and Big Data: Intro

How to Choose Between Hadoop, NoSQL and RDBMS

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

How To Create A Data Visualization With Apache Spark And Zeppelin

How Companies are! Using Spark

Real Time Big Data Processing

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

TRAINING PROGRAM ON BIGDATA/HADOOP

INTRODUCTION TO CASSANDRA

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Architectures for Big Data Analytics A database perspective

Hadoop Ecosystem B Y R A H I M A.

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

NextGen Infrastructure for Big DATA Analytics.

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Big Data and Data Science: Behind the Buzz Words

GridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

The Future of Data Management

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

So What s the Big Deal?

Hadoop and Map-Reduce. Swati Gore

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Dominik Wagenknecht Accenture

How To Use Big Data For Telco (For A Telco)

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Cloud Big Data Architectures

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Using distributed technologies to analyze Big Data

P4.1 Reference Architectures for Enterprise Big Data Use Cases Romeo Kienzler, Data Scientist, Advisory Architect, IBM Germany, Austria, Switzerland

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Information Builders Mission & Value Proposition

How To Handle Big Data With A Data Scientist

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

What s next for the Berkeley Data Analytics Stack?

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Advanced Big Data Analytics with R and Hadoop

HDP Enabling the Modern Data Architecture

Big Data Course Highlights

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Cloud Scale Distributed Data Storage. Jürmo Mehine

Moving From Hadoop to Spark

Getting Started Practical Input For Your Roadmap

Open Source Technologies on Microsoft Azure

Applications for Big Data Analytics

Data sharing in the Big Data era

Native Connectivity to Big Data Sources in MSTR 10

BIG DATA What it is and how to use?

Understanding NoSQL on Microsoft Azure

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Integrating Big Data into the Computing Curricula

Transforming the Telecoms Business using Big Data and Analytics

Oracle Big Data SQL Technical Update

BIG DATA TECHNOLOGY. Hadoop Ecosystem

CIO Guide How to Use Hadoop with Your SAP Software Landscape

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

NoSQL for SQL Professionals William McKnight

Business Intelligence for Big Data

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

Making Sense of Big Data in Insurance

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Microsoft Azure Data Technologies: An Overview

How To Scale Out Of A Nosql Database

Comparing SQL and NOSQL databases

Challenges for Data Driven Systems

Big Data With Hadoop

NoSQL Data Base Basics

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

From Spark to Ignition:

Lecture Data Warehouse Systems

Ali Ghodsi Head of PM and Engineering Databricks

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Big Data/Data Analytics & Software Development

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Introduction to Apache Cassandra

HDP Hadoop From concept to deployment.

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Transcription:

Thomas Baumann Swiss Mobiliar Bern, Switzerland

WHAT IS BIG DATA (1/3): 3-5 V 1 + 1 = 11 Volume Velocity Variety Veracity Value From Terabytes to Exabytes to process From milliseconds to minutes to respond From structured to unstructured to store and query From ACID to inconsistent to manage From data to insight to transform

WHAT IS BIG DATA (2/3): THE IT PERSPECTIVE Distributed, scalable, fault-tolerant technologies Query Languages Pig Latin Hive QL Impala CQL Cypher Map/Reduce Process and Resource Managers YARN Cassandra Kernel Neo4j Kernel Data Stores HDFS (Hadoop Distributed File System) Cassandra Neo4j

WHAT IS BIG DATA (3/3) Big Data := Gaining actionable insights to create competitive advantage and to mitigate risks from combining new data sources by using scalable technologies. Th.Baumann 2015

WHAT IS BIG DATA (3/3) Big Data Ecosystem Actionable Insights Actionable Information IoT, Sensors Events Data «Lake» OLTP data Data Warehouse event processing OLTP transactions data processing (ETL)

CONTENT Swiss Mobiliar Company Introduction What Is Big Data In More Detail Auditing Big Data: What Is Specific For Big Data? Using Big Data Tools and Technologies For Yourself

SWISS MOBILIAR Switzerland s most personal insurer legal form of a cooperative association (mutual company). Switzerland s number one insurer for household contents, business and pure risk life insurance. close to customers throughout the country thanks to around 80 general agencies at 160 locations. over 1.7 million insured persons or firms. 13x continuously over 4,400 employees and 325 trainees.

INSURANCE MARKET GROWTH IN SWITZERLAND Close to 2/3 of Market Growth to Swiss Mobiliar Growth Mobiliar Market Growth in Mio CHF. Source: Schweizerischer Versicherungsverband

THE SPEAKER Born in1963 MSc. from the Swiss Federal Institute of Technology (ETH Zurich) Computer Sciences combined with probability theory and statistics These days, we would call this mix Big Data or Data Sciences Has been focused on DBMS and performance since 1992 Internationally recognized database expert and speaker on numerous conferences Minister of Performance at Swiss Mobiliar dedicated to performance since 1963 also produces this search result:

CONTENT Swiss Mobiliar Company Introduction What Is Big Data In More Detail Auditing Big Data: What Is Specific For Big Data? Using Big Data Tools and Technologies For Yourself

NIST BIG DATA REFERENCE ARCHITECTURE Source: NIST National Institute of Standards and Technology, U.S. Department of Commerce

BIG DATA ARCHITECTURE PRINCIPLES Scale Out Use Commodity Hardware (see example on next slide) Scalable Redundancy Duplicate Data to provide data safety Fault tolerance for both data and jobs Data Locality Minimize amount of network traffic

SAMPLE HARDWARE TYPE (COURTESY OF HP)

DATA PROCESSING ARCHITECTURE FOR BIG DATA incoming data Master Data (immutable, append-only, schema-on-read) Streaming Data (real-time processing) Precomputed Data (completely re-calculated data) Real-Time Views (read/write database systems) Query (merges precomputed data with real-time views)

COMMON TOOLS YOU MIGHT HAVE HEARD ABOUT incoming data Master Data (immutable, append-only, schema-on-read) Streaming Data (real-time processing) Precomputed Data (completely re-calculated data) Real-Time Views (read/write database systems) Query Cypher (merges Gremlin precomputed Spark Mllib data with Spark real-time GraphX views) R CQL

SCHEMA-ON-WRITE VS. SCHEMA-ON-READ Traditional RDBMS is Schema-On-Write Data persisted in tabular, agreed and consistent form Structure must be decided before writing Data integration happens in ETL Big Data is Schema-On-Read Data persisted without any checking Interpretation of data captured in code by each program accessing the data Data quality depends on code quality

NO-SQL DATABASE OVERVIEW Data Volume KeyValue DB Wide Column DB Document Store RDBMS Graph DB Transactional Properties RDBMS CAP Text Search DB Data Structure Complexity ACID

Degree of Data Relationship SWEET SPOT FOR DBMS Graph DB RDBMS RDBMS incl. Column Store (IBM DB2 Analytics Accelerator, Oracle DB In-Memory, SAP HANA) NoSQL Datenbanken (Cassandra, Oracle NoSQL, Redis, Riad, HBase, etc.) 100 Tbyte 500 K Isrt/sec 1000 Tbyte 3 M Isrt/sec Volume Velocity

CONTENT Swiss Mobiliar Company Introduction What Is Big Data In More Detail Auditing Big Data: What Is Specific For Big Data? Using Big Data Tools and Technologies For Yourself

ISSUES OF INTEREST Business IT Alignment Deployment Model Privacy Backup and Recovery Detecting Data Manipulation DLP (Data Loss Prevention)

WHY IS BIG DATA SECURITY DIFFERENT? Data might be gathered from different end points. Data search and selection can lead to privacy and security policy concerns. Privacy-preserving mechanisms are needed for Big Data, such as for Personally Identifiable Information (PII). Big Data is pushing beyond traditional definitions for information trust, openness, and responsibility. Information assurance and disaster recovery may require unique and emergent practices. Big Data creates targets of increased value. Risks have increased for de-anonymization and transfer of PII without consent traceability. Source: NIST National Institute of Standards and Technology, U.S. Department of Commerce. Big Data Interoperability Framework, Volume1: Definitions

DATA PRIVACY VERSUS BIG DATA Data Privacy Principles Targeted use of data gathered Consent required Transparent Usage of data Limited amount of data stored Proven necessity of data store Big Data Principles Analytics of heterogeneus sources Consent not traceable Undefined purpose of data store Unlimited data storage Data usable for future use

AUDITING BIG DATA OPERATIONS A good source for operations, but also for auditors Covers Metadata Backup and Recovery Tasks for Security and Availability Performance Management and Monitoring Patching Troubleshooting Check if these points are adressed in your target s environment to be audited

DETECTING DATA MANIPULATION Requirement for Data(base) Activity Monitoring Even more important than in traditional world Fast data processing requires short time to react Act before React How is DAM organized in your Big Data ecosystem?

CONTENT Swiss Mobiliar Company Introduction What Is Big Data In More Detail Auditing Big Data: What Is Specific For Big Data? Using Big Data Tools and Technologies For Yourself

MOTIVATION Big Data is about Volume, Velocity, Variety, Veracity of Data Are these V s familiar to you in your daily work as an auditor? If yes, Big Data tools and technologies might help you in your job All tools and frameworks are Open Source Most of them are easy to use Cloud Services available (usually for free while working on small data) Many of those tools are really cool There is more out there than just Microsoft Excel The tools on the following pages are arbitrarily selected by the author and do not necessarily represent best of class tools

A VARIETY OF COMPANIES AND PRODUCTS Source: www.gigaom.com

USE CASES Hadoop, Hive and Impala to Analyze Open Data Are there any insurance claims for damages due to storm or strong winds, but meteo data shows maximum wind was unsufficient to cause damages? Impact Analysis using Connected Data with Neo4j Graph DB Suppose we immerse a large porous stone in a bucket of water. Will the center of the stone be wetted? Analogous problems where to apply this algorithm: Objects an administrator might reach? Spread of (computer) viruses? Impact of unavailability of a component? Were people involved in a damage known to each other before?

USE CASE 1 Hadoop, Hive and Impala to Analyze Open Data Are there any insurance claims for damages due to storm or strong winds, but meteo data shows maximum wind was unsufficient to cause damages? Master Data (immutable, append-only, schema-on-read) Precomputed Data (completely re-calculated data) incoming data hands-on exercise/demo: step-by-step implementation

USE CASE 1 LOAD INTERMED_METEODATA Claims caused by wind, but wind was unsufficiently strong that day in that region TRANSFORM INSERT METEODATA Claims DB Station Measurement 1 Measurement 2 *) http://data.geo.admin.ch.s3.amazonaws.com/ch.meteoschweiz.swissmetnet/vqha69.txt

USE CASE 1: HIVE/IMPALA IMPLEMENTATION $ wget http://data.geo.admin.ch.s3.amazonaws.com/ch.meteoschweiz.swissmetnet/vqha69.txt $ tail -n +4 VQHA69.txt > tmp.txt && mv tmp.txt VQHA69.txt hive> use meteodaten; hive> LOAD DATA LOCAL INPATH 'VQHA69.txt' OVERWRITE INTO TABLE intermed_meteodata; hive> INSERT INTO TABLE meteodata SELECT REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){1}', 1) station, REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){2}', 1) timestamp_gmt, REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){3}', 1) temp, REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){5}', 1) regen, REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){8}', 1) luftdruck, REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){9}', 1) wind FROM intermed_meteodata; hive> SELECT tag, SUM(regen) AS totalregen FROM (SELECT SUBSTR(timestamp_gmt,5,4) AS tag, regen, station FROM meteodata) AS T1 GROUP BY tag ORDER BY totalregen DESC LIMIT 2 faster with Impala hive> SELECT timestamp_gmt, temp FROM meteodata WHERE station='abo' AND SUBSTR(timestamp_gmt,1,8)= '20150708 ORDER BY timestamp_gmt

USE CASE 2 Neo4j Graph DB to Analyze Connected Data Suppose we immerse a large porous stone in a bucket of water. Will the center of the stone be wetted? Analogous problems where to apply this algorithm: Objects an administrator might reach? Spread of (computer) viruses? Impact of unavailability of a component? incoming data Cypher Query Language MATCH allshortestpaths ((a)-[*]-(z)) WHERE a.name="alpha" and z.name="omega" RETURN COUNT(*) actionable insights hands-on exercise/demo: step-by-step implementation

USE CASE 2 a W Is there any connection between a und W?

USE CASE 2: RDBMS SOLUTION Ex:. 35x25 percolation matrix create table percolation (x1 integer, y1 integer, x2 integer, y2 integer); a) recursive query: b) Explicit joins b1) Init b2) Loop (until n=35 or convergence of min(level)) b3) final analysis with zw (n,m,depth) as ( select x2,y2,1 from percolation where x1=1 union all select x2,y2,depth+1 from zw, percolation where x1=n and y1=n and y2=m+1 or x1=n and y1=m and x2=n+1 or x1=n and y1=m and x2=n-1 or x1=n and y1=m and y2=m-1 and depth < 875) -- max 25x35 iterations select max(n) from zw insert into zw select x2,y2,1 from percolation where x1=1; insert into zw select x2, y2, min(level) from ( select distinct x2,y2,depth+1 as level from zw, percolation where x1=n and y1=n and y2=m+1 or x1=n and y1=m and x2=n+1 or x1=n and y1=m and x2=n-1 or x1=n and y1=m and y2=m-1) t group by x2,y2; select * from zw where x1=35 runs eternally 15-30 sec, dependent on graph density

USE CASE 2: GRAPH DBMS SOLUTION Node Definitionen: CREATE (sx_y:site{name:"sx_y"}) CREATE (s0:start{name:"alpha"}) CREATE (s9:ende{name:"omega"}) Edge Definitionen: MATCH (sx1_y1:site {name:'sx1_y1'}), (sx1_y2:site {name:'sx1_y2'}) CREATE (sx1_y1) - [:bond] -> (sx1_y1+1), (sx1_y2) - [:bond] -> (sx1_y2+1) Query: MATCH allshortestpaths((a)-[*]-(z)) WHERE a.name="alpha" and z.name="omega" RETURN COUNT(*) 3-5 sec, dependent on graph density

THANK YOU FOR YOUR ATTENTION Dress up and get ready for the Super Spy Event, buses leaving at 6:10 PM