Thomas Baumann Swiss Mobiliar Bern, Switzerland

WHAT IS BIG DATA (1/3): 3-5 V 1 + 1 = 11 Volume Velocity Variety Veracity Value From Terabytes to Exabytes to process From milliseconds to minutes to respond From structured to unstructured to store and query From ACID to inconsistent to manage From data to insight to transform

WHAT IS BIG DATA (2/3): THE IT PERSPECTIVE Distributed, scalable, fault-tolerant technologies Query Languages Pig Latin Hive QL Impala CQL Cypher Map/Reduce Process and Resource Managers YARN Cassandra Kernel Neo4j Kernel Data Stores HDFS (Hadoop Distributed File System) Cassandra Neo4j

WHAT IS BIG DATA (3/3) Big Data := Gaining actionable insights to create competitive advantage and to mitigate risks from combining new data sources by using scalable technologies. Th.Baumann 2015

WHAT IS BIG DATA (3/3) Big Data Ecosystem Actionable Insights Actionable Information IoT, Sensors Events Data «Lake» OLTP data Data Warehouse event processing OLTP transactions data processing (ETL)

CONTENT Swiss Mobiliar Company Introduction What Is Big Data In More Detail Auditing Big Data: What Is Specific For Big Data? Using Big Data Tools and Technologies For Yourself

SWISS MOBILIAR Switzerland s most personal insurer legal form of a cooperative association (mutual company). Switzerland s number one insurer for household contents, business and pure risk life insurance. close to customers throughout the country thanks to around 80 general agencies at 160 locations. over 1.7 million insured persons or firms. 13x continuously over 4,400 employees and 325 trainees.

INSURANCE MARKET GROWTH IN SWITZERLAND Close to 2/3 of Market Growth to Swiss Mobiliar Growth Mobiliar Market Growth in Mio CHF. Source: Schweizerischer Versicherungsverband

THE SPEAKER Born in1963 MSc. from the Swiss Federal Institute of Technology (ETH Zurich) Computer Sciences combined with probability theory and statistics These days, we would call this mix Big Data or Data Sciences Has been focused on DBMS and performance since 1992 Internationally recognized database expert and speaker on numerous conferences Minister of Performance at Swiss Mobiliar dedicated to performance since 1963 also produces this search result:

NIST BIG DATA REFERENCE ARCHITECTURE Source: NIST National Institute of Standards and Technology, U.S. Department of Commerce

BIG DATA ARCHITECTURE PRINCIPLES Scale Out Use Commodity Hardware (see example on next slide) Scalable Redundancy Duplicate Data to provide data safety Fault tolerance for both data and jobs Data Locality Minimize amount of network traffic

SAMPLE HARDWARE TYPE (COURTESY OF HP)

DATA PROCESSING ARCHITECTURE FOR BIG DATA incoming data Master Data (immutable, append-only, schema-on-read) Streaming Data (real-time processing) Precomputed Data (completely re-calculated data) Real-Time Views (read/write database systems) Query (merges precomputed data with real-time views)

COMMON TOOLS YOU MIGHT HAVE HEARD ABOUT incoming data Master Data (immutable, append-only, schema-on-read) Streaming Data (real-time processing) Precomputed Data (completely re-calculated data) Real-Time Views (read/write database systems) Query Cypher (merges Gremlin precomputed Spark Mllib data with Spark real-time GraphX views) R CQL

SCHEMA-ON-WRITE VS. SCHEMA-ON-READ Traditional RDBMS is Schema-On-Write Data persisted in tabular, agreed and consistent form Structure must be decided before writing Data integration happens in ETL Big Data is Schema-On-Read Data persisted without any checking Interpretation of data captured in code by each program accessing the data Data quality depends on code quality

NO-SQL DATABASE OVERVIEW Data Volume KeyValue DB Wide Column DB Document Store RDBMS Graph DB Transactional Properties RDBMS CAP Text Search DB Data Structure Complexity ACID

Degree of Data Relationship SWEET SPOT FOR DBMS Graph DB RDBMS RDBMS incl. Column Store (IBM DB2 Analytics Accelerator, Oracle DB In-Memory, SAP HANA) NoSQL Datenbanken (Cassandra, Oracle NoSQL, Redis, Riad, HBase, etc.) 100 Tbyte 500 K Isrt/sec 1000 Tbyte 3 M Isrt/sec Volume Velocity

ISSUES OF INTEREST Business IT Alignment Deployment Model Privacy Backup and Recovery Detecting Data Manipulation DLP (Data Loss Prevention)

WHY IS BIG DATA SECURITY DIFFERENT? Data might be gathered from different end points. Data search and selection can lead to privacy and security policy concerns. Privacy-preserving mechanisms are needed for Big Data, such as for Personally Identifiable Information (PII). Big Data is pushing beyond traditional definitions for information trust, openness, and responsibility. Information assurance and disaster recovery may require unique and emergent practices. Big Data creates targets of increased value. Risks have increased for de-anonymization and transfer of PII without consent traceability. Source: NIST National Institute of Standards and Technology, U.S. Department of Commerce. Big Data Interoperability Framework, Volume1: Definitions

DATA PRIVACY VERSUS BIG DATA Data Privacy Principles Targeted use of data gathered Consent required Transparent Usage of data Limited amount of data stored Proven necessity of data store Big Data Principles Analytics of heterogeneus sources Consent not traceable Undefined purpose of data store Unlimited data storage Data usable for future use

AUDITING BIG DATA OPERATIONS A good source for operations, but also for auditors Covers Metadata Backup and Recovery Tasks for Security and Availability Performance Management and Monitoring Patching Troubleshooting Check if these points are adressed in your target s environment to be audited

DETECTING DATA MANIPULATION Requirement for Data(base) Activity Monitoring Even more important than in traditional world Fast data processing requires short time to react Act before React How is DAM organized in your Big Data ecosystem?

MOTIVATION Big Data is about Volume, Velocity, Variety, Veracity of Data Are these V s familiar to you in your daily work as an auditor? If yes, Big Data tools and technologies might help you in your job All tools and frameworks are Open Source Most of them are easy to use Cloud Services available (usually for free while working on small data) Many of those tools are really cool There is more out there than just Microsoft Excel The tools on the following pages are arbitrarily selected by the author and do not necessarily represent best of class tools

A VARIETY OF COMPANIES AND PRODUCTS Source: www.gigaom.com

USE CASES Hadoop, Hive and Impala to Analyze Open Data Are there any insurance claims for damages due to storm or strong winds, but meteo data shows maximum wind was unsufficient to cause damages? Impact Analysis using Connected Data with Neo4j Graph DB Suppose we immerse a large porous stone in a bucket of water. Will the center of the stone be wetted? Analogous problems where to apply this algorithm: Objects an administrator might reach? Spread of (computer) viruses? Impact of unavailability of a component? Were people involved in a damage known to each other before?

USE CASE 1 Hadoop, Hive and Impala to Analyze Open Data Are there any insurance claims for damages due to storm or strong winds, but meteo data shows maximum wind was unsufficient to cause damages? Master Data (immutable, append-only, schema-on-read) Precomputed Data (completely re-calculated data) incoming data hands-on exercise/demo: step-by-step implementation

USE CASE 1 LOAD INTERMED_METEODATA Claims caused by wind, but wind was unsufficiently strong that day in that region TRANSFORM INSERT METEODATA Claims DB Station Measurement 1 Measurement 2 *) http://data.geo.admin.ch.s3.amazonaws.com/ch.meteoschweiz.swissmetnet/vqha69.txt

USE CASE 1: HIVE/IMPALA IMPLEMENTATION $ wget http://data.geo.admin.ch.s3.amazonaws.com/ch.meteoschweiz.swissmetnet/vqha69.txt $ tail -n +4 VQHA69.txt > tmp.txt && mv tmp.txt VQHA69.txt hive> use meteodaten; hive> LOAD DATA LOCAL INPATH 'VQHA69.txt' OVERWRITE INTO TABLE intermed_meteodata; hive> INSERT INTO TABLE meteodata SELECT REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){1}', 1) station, REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){2}', 1) timestamp_gmt, REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){3}', 1) temp, REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){5}', 1) regen, REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){8}', 1) luftdruck, REGEXP_EXTRACT(VALUE, '^(?:([^\ ]*)\.){9}', 1) wind FROM intermed_meteodata; hive> SELECT tag, SUM(regen) AS totalregen FROM (SELECT SUBSTR(timestamp_gmt,5,4) AS tag, regen, station FROM meteodata) AS T1 GROUP BY tag ORDER BY totalregen DESC LIMIT 2 faster with Impala hive> SELECT timestamp_gmt, temp FROM meteodata WHERE station='abo' AND SUBSTR(timestamp_gmt,1,8)= '20150708 ORDER BY timestamp_gmt

USE CASE 2 Neo4j Graph DB to Analyze Connected Data Suppose we immerse a large porous stone in a bucket of water. Will the center of the stone be wetted? Analogous problems where to apply this algorithm: Objects an administrator might reach? Spread of (computer) viruses? Impact of unavailability of a component? incoming data Cypher Query Language MATCH allshortestpaths ((a)-[*]-(z)) WHERE a.name="alpha" and z.name="omega" RETURN COUNT(*) actionable insights hands-on exercise/demo: step-by-step implementation

USE CASE 2 a W Is there any connection between a und W?

USE CASE 2: RDBMS SOLUTION Ex:. 35x25 percolation matrix create table percolation (x1 integer, y1 integer, x2 integer, y2 integer); a) recursive query: b) Explicit joins b1) Init b2) Loop (until n=35 or convergence of min(level)) b3) final analysis with zw (n,m,depth) as ( select x2,y2,1 from percolation where x1=1 union all select x2,y2,depth+1 from zw, percolation where x1=n and y1=n and y2=m+1 or x1=n and y1=m and x2=n+1 or x1=n and y1=m and x2=n-1 or x1=n and y1=m and y2=m-1 and depth < 875) -- max 25x35 iterations select max(n) from zw insert into zw select x2,y2,1 from percolation where x1=1; insert into zw select x2, y2, min(level) from ( select distinct x2,y2,depth+1 as level from zw, percolation where x1=n and y1=n and y2=m+1 or x1=n and y1=m and x2=n+1 or x1=n and y1=m and x2=n-1 or x1=n and y1=m and y2=m-1) t group by x2,y2; select * from zw where x1=35 runs eternally 15-30 sec, dependent on graph density

USE CASE 2: GRAPH DBMS SOLUTION Node Definitionen: CREATE (sx_y:site{name:"sx_y"}) CREATE (s0:start{name:"alpha"}) CREATE (s9:ende{name:"omega"}) Edge Definitionen: MATCH (sx1_y1:site {name:'sx1_y1'}), (sx1_y2:site {name:'sx1_y2'}) CREATE (sx1_y1) - [:bond] -> (sx1_y1+1), (sx1_y2) - [:bond] -> (sx1_y2+1) Query: MATCH allshortestpaths((a)-[*]-(z)) WHERE a.name="alpha" and z.name="omega" RETURN COUNT(*) 3-5 sec, dependent on graph density

THANK YOU FOR YOUR ATTENTION Dress up and get ready for the Super Spy Event, buses leaving at 6:10 PM