Elastic Enterprise Data Warehouse Query Log Analysis on a Secure Private Cloud



Similar documents
SAP HANA In-Memory Database Sizing Guideline

Technical Challenges for Big Health Care Data. Donald Kossmann Systems Group Department of Computer Science ETH Zurich

Inge Os Sales Consulting Manager Oracle Norway

Driving Peak Performance IBM Corporation

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Testing 3Vs (Volume, Variety and Velocity) of Big Data

QAD Business Intelligence Release Notes

Parallel Data Warehouse

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Data Management in the Cloud

Open source Google-style large scale data analysis with Hadoop

SQL Server Administrator Introduction - 3 Days Objectives

Creating Connection with Hive

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

W I S E. SQL Server 2012 Database Engine Technical Update WISE LTD.

Enterprise and Standard Feature Compare

POLAR IT SERVICES. Business Intelligence Project Methodology

CitusDB Architecture for Real-Time Big Data

Optimize Oracle Business Intelligence Analytics with Oracle 12c In-Memory Database Option

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

Overview: X5 Generation Database Machines

GRIDS IN DATA WAREHOUSING

How to Enhance Traditional BI Architecture to Leverage Big Data

Oracle Database 11g Comparison Chart

The basic data mining algorithms introduced may be enhanced in a number of ways.

Main Memory Data Warehouses

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Real Time Big Data Processing

2009 Oracle Corporation 1

Microsoft SQL Server 2008 R2 Enterprise Edition and Microsoft SharePoint Server 2010

Cloudera Certified Developer for Apache Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

Performance and Scalability Overview

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

LEARNING SOLUTIONS website milner.com/learning phone

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Business Intelligence and Healthcare

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

Jeffrey D. Ullman slides. MapReduce for data intensive computing

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Bringing Big Data Modelling into the Hands of Domain Experts

Innovative technology for big data analytics

Performance and Scalability Overview

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

ANALYTICS IN BIG DATA ERA

Cloud Computing at Google. Architecture

Big Data Analytics Nokia

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

SQL Server 2012 Performance White Paper

Oracle Big Data SQL Technical Update

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

SharePlex for SQL Server

Would-be system and database administrators. PREREQUISITES: At least 6 months experience with a Windows operating system.

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Management of Very Large Security Event Logs

Big Data Explained. An introduction to Big Data Science.

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Server Consolidation with SQL Server 2008

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

QlikView Business Discovery Platform. Algol Consulting Srl

Big Data Analytics - Accelerated. stream-horizon.com

Testing Big data is one of the biggest

MOC 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Oracle Architecture, Concepts & Facilities

So What s the Big Deal?

Cost-Effective Business Intelligence with Red Hat and Open Source

2015 Analyst and Advisor Summit. Advanced Data Analytics Dr. Rod Fontecilla Vice President, Application Services, Chief Data Scientist

Big + Fast + Safe + Simple = Lowest Technical Risk

Open source large scale distributed data management with Google s MapReduce and Bigtable

Solution for Staging Area in Near Real-Time DWH Efficient in Refresh and Easy to Operate. Technical White Paper

SSIS Training: Introduction to SQL Server Integration Services Duration: 3 days

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

NoSQL for SQL Professionals William McKnight

Hadoop and Map-Reduce. Swati Gore

IBM Netezza High Capacity Appliance

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

SQL Server What s New? Christopher Speer. Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft.

Introducing Oracle Exalytics In-Memory Machine

Application of Predictive Analytics for Better Alignment of Business and IT

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Building Scalable Big Data Pipelines

HOW TO: Using Big Data Analytics to understand how your SharePoint Intranet is being used

Security Information/Event Management Security Development Life Cycle Version 5

LearnFromGuru Polish your knowledge

Transcription:

Elastic Enterprise Data Warehouse Query Log Analysis on a Secure Private Cloud Data Warehouse and Business Intelligence Architect Credit Suisse, Zurich Joint research between Credit Suisse and ETH Zurich: Willy Lai, Maria Grineva, Maxim Grinev, Donald Kossmann, Georg Polzer, Kurt Stockinger Date: Nov. 28, 2011, Slide 1

Agenda Auditing in Enterprises Traditional Data Warehouse Approach vs. Cloud Approach Query Log Analysis with Xadoop Results of Security Analysis on Real Data Date: Nov. 28, 2011, Slide 2

The Challenge that Many Companies Face Requirement: Every operation against the core database (data warehouse) must be traceable and explainable. Audits are performed at random points in time: Which user accessed attribute A1, A3 and A6 of tables T1 and T2? Which user deleted attribute A4 of view V2? Which user updated the value of attribute A5 of table T3? Capacity and performance management: Which table partitions were never accessed over the last year? Candidate for archiving. What are the top 10 longest running queries? Candidate for query optimization. Date: Nov. 28, 2011, Slide 3

Data Warehouse Application Platform Data Warehouse Application Platform Shared platform for integrating data from multiple internal and external sources for developing, deploying and operating applications that implement reporting, analysis and data mining functions. Scope Reporting and analysis (standard and ad-hoc reporting, Online Analytical Processing (OLAP) and data mining (in special areas (CRM)) Data from last end-of-day processing (4500 jobs per day) No operational/transactional reporting or direct initiation of business transactions Key figures Servers: 2 M9000, 25 M5000, several V490 / T2000, > 1000 CPUs, 4TB main memory ~700 TB storage with growth rate of 15-20 TB/month (overall ET, IT, UAT and P) Throughput between DWH production server and HDS: 10-20 TB/day (>1 GB/s) Users and applications ~100 applications on the platform with some 14 000 users Accounting (Management Accounting, Financial Accounting) Legal & Compliance (Anti Money Laundering, Basel II, Swiss National Reporting) Customer Relationship Management (Front Support, Marketing) Operations (Credit-MIS, Credit Risk) etc. Systems Management (IT DWH) Date: Nov. 28, 2011, Slide 4

Full Logging of Database Queries in Production All the queries against the databases of the data warehouse are logged via Oracle Audit Option: One XML-file per user session. Compressed at regular intervals. Many Terabytes of uncompressed XML files per month. Data is kept for n days then it is archived. Date: Nov. 28, 2011, Slide 5

Possible Solutions: Data Warehouse vs. Cloud Approach Build a data warehouse: Not cost effective for a few queries per a year. Use private cloud-computing approach based on Hadoop/PIG: Ad-hoc analysis at unknown time periods: Compute resources are only needed for short periods. Hadoop is open source software with proven track record: Parallel file system where data is split into several chunks. Allows parallel analysis/querying of data. PIG is XML-based query language that runs on top of Hadoop: Query logs are already stored in XML files. Hadoop/PIG approach can be leveraged to analyze the queries in parallel. Framework can easily be deployed. Date: Nov. 28, 2011, Slide 6

Data Warehouse and Private Cloud Approach Audit required: shipping of XML-based query log files Data warehouse processing (typical business analytics) Hadoop/PIG-based analysis of XML query logs (large XML file distributed over all cloud nodes) Date: Nov. 28, 2011, Slide 7

Data: Audit Logs and Database Schema Snapshots Audit logs: Contain all database accesses (queries) of the data warehouse. The information logged in a single access contains: Audit type Session-, statement-, entry id Extended timestamp DB User, OS User, User Host OS Process SQL Text Database schema snapshots: Generated once a day. Capture the entire schema of the data warehouse: Table owners, names and configuration View owners, names, SQL statement and configuration View dependencies Synonyms Date: Nov. 28, 2011, Slide 8 8

Resolving Matching Illustration: Query Processing Steps Audit log entry: Grouped & aggregated results <AuditRecord> <Extended_Timestamp>2010-06-30 </Extended_Timestamp> <DB_User>DEMOUSER</DB_User> <Sql_Text> select e.*, het.salary, het.title from employees e, highsal_engineer_titles het where e.emp_no = het.emp_no </Sql_Text> <AuditRecord> Schem a snapshot of views: <VIEW_NAME>HIGHSAL_ENGINEER_TITLES</VIEW_NAME> <TEXT_LENGTH>100</TEXT_LENGTH> <TEXT> select e.emp_no, t.title, s.salary from employees e, salaries s, title t where e.emp_no = s.emp_no and e.emp_no = t.emp_no and s.salary >= 200000 and UPPER(t.title) LIKE '%ENGINEER%' </TEXT> tables: employees {(DEMOUSER,2010-06-30 )} salaries {(DEMOUSER,2010-06-30 )} title {(DEMOUSER,2010-06-30 )} highsal_engineer_titles {(DEMOUSER,2010-06-30 )} attributes: employees.* {(DEMOUSER,2010-06-30 )} employees.emp_no {(DEMOUSER,2010-06-30 )} title.emp_no {(DEMOUSER,2010-06-30 )} title.title {(DEMOUSER,2010-06-30 )} salaries.salary {(DEMOUSER,2010-06-30 )} salaries.emp_no {(DEMOUSER,2010-06-30 )} highsal_engineer_titles.salary {(DEMOUSER,2010-06-30 )} highsal_engineer_titles.title {(DEMOUSER,2010-06-30 )} {(DEMOUSER,2010-06-30 )} user: DEMOUSER {(2010-06-30, {(employees),(salaries),(title)},{(employees.*),(employ ees.emp_no),(title.emp_no),(title.title),(salaries.sala ry),(salaries.emp_no),(highsal_engineer_titles.salary), (highsal_engineer_titles.title),(highsal_engineer_title s.emp_no) })} Date: Nov. 28, 2011, Slide 9 9

Architecture of Xadoop-Based Query Processing Date: Nov. 28, 2011, Slide 10 10

Typical Audit Query: "Was TableX Accessed During Lunch Time?" register./pigxml.jar define DATECOMP ch.ethz.xadoop.udf.datecomp(); A = load '/user/drwho/querylogs1.xml' using ch.ethz.xadoop.loader.xmlloader() as (..); Date: Nov. 28, 2011, Slide 11

Typical Audit Query: "Was TableX Accessed During Lunch Time?" register./pigxml.jar define DATECOMP ch.ethz.xadoop.udf.datecomp(); A = load '/user/drwho/querylogs1.xml' using ch.ethz.xadoop.loader.xmlloader() as (..); B = filter A by sql_text matches '*TableX*' and DATECOMP((chararray)extended_timestamp, '2011-05-17T12:00:00.000000')>0 and DATECOMP((chararray)extended_timestamp, '2011-05-17T14:00:00.000000')<0; Date: Nov. 28, 2011, Slide 12

Typical Audit Query: "Was TableX Accessed During Lunch Time?" register./pigxml.jar define DATECOMP ch.ethz.xadoop.udf.datecomp(); A = load '/user/drwho/querylogs1.xml' using ch.ethz.xadoop.loader.xmlloader() as (..); B = filter A by sql_text matches '*TableX*' and DATECOMP((chararray)extended_timestamp, '2011-05-17T12:00:00.000000')>0 and DATECOMP((chararray)extended_timestamp, '2011-05-17T14:00:00.000000')<0; C = foreach B generate db_user, sql_text, extended_timestamp; dump C; store C into '/user/drwho/analysis_querylogs1_2011_05_17.res'; Date: Nov. 28, 2011, Slide 13

Synthetic Data and Queries at ETH Zurich - Measurement Results on 1 Node Test set input Input data size Output data size Execution time 1 000 000 1 215 MB Tables: Attributes: 423 MB 1 196 MB 1.4 h Users: 637 MB 2 500 000 3 038 MB Tables: Attributes: 1057 MB 2 982 MB 3.3 h Users: 1 392 MB 5 000 000 6 075 MB Tables: Attributes: 2 114 MB 5 958 MB 6.7 h Users: 3 185 MB 10 000 000 12 151 MB Tables: Attributes: 4 226 MB 11 620 MB 13.6 h Users: 6 369 MB Date: Nov. 28, 2011, Slide 14 Thursday, July 12, 2011 14

Execution Times Measured at ETH Zurich 13.6 h 6.7 h 3.4 h Date: Nov. 28, 2011, Slide 15 15

Experiments inside Credit Suisse on Real Data #1 We ran our SQL parser on real audit log/schema view snapshot files. We measured the number of audit records/view SQL statements that could be parsed. 8.63% of view SQL statements could not be parsed due to: Syntactically incorrect SQL statem ents. Issues in the generation of the schema snapshots (Oracle bug). Date: Nov. 28, 2011, Slide 16 16

Experiments inside Credit Suisse on Real Data #2 We extracted object names (tables and views) from successfully parsed SQLs. Measured matching objects between audit log entries and schema snapshots. 99.66% and 96.5% success rates, respectively. Limitations: Newly created objects not present in the schema cannot be matched. Neither can be table functions and DB-links. Date: Nov. 28, 2011, Slide 17 17

Dealing with Intraday Schema Modification: The Problem Assume we have a vicious person that does not want anybody to know that he/she has accessed some critical data. CREATE VIEW my_secret as SELECT * FROM my_secret Table accessed 04.07.2011 12:00 my_secret? DROP VIEW my_secret Create view on critical data Access view on critical data Delete view on critical data Schema Snapshot 04.07.2011 Schema Snapshot 05.07.2011 Schema Snapshot 06.07.2011 Date: Nov. 28, 2011, Slide 18 Thursday, July 12, 2011 18

Dealing with Intraday Schema Modification: The Counter Measure Log all create and drop statements in a separate file. Output all tables and views that could not be matched with the schema. For all these tables check whether they can be matched to some create or drop statements and resolve accordingly. CREATE VIEW my_secret as SELECT * FROM my_secret DROP VIEW my_secret CREATE and DROP log 04.07.2011 05:30 CREATE VIEW my_secret as 04.07.2011 19:45 DROP VIEW my_secret Create view on critical data Access view on critical data Delete view on critical data Unmatched objects log 04.07.2011 12:00 my_secret Tables accessed 04.07.2011 12:00 my_secret, Schema Snapshot 04.07.2011 Schema Snapshot 05.07.2011 Schema Snapshot 06.07.2011 Date: Nov. 28, 2011, Slide 19 Thursday, July 12, 2011 19

Conclusions Prototype implementation was performed on parts of the production query logs of the Credit Suisse data warehouse. Xadoop-based analysis with above 95% accuracy and linear scalability: Several different groups within Credit Suisse in Zurich and New York provided excellent feedback and are ready for further collaboration. Next steps: Perform large-scale query log analysis against the warehouses covering data volumes of several months. Implement more advanced use cases with several new stakeholders in Credit Suisse based on machine learning techniques. Date: Nov. 28, 2011, Slide 20 20