Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Similar documents

SEIZE THE DATA SEIZE THE DATA. 2015

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

How To Handle Big Data With A Data Scientist

How To Use Hp Vertica Ondemand

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

HP Vertica. Echtzeit-Analyse extremer Datenmengen und Einbindung von Hadoop. Helmut Schmitt Sales Manager DACH

Testing 3Vs (Volume, Variety and Velocity) of Big Data

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

BIG DATA What it is and how to use?

A Brief Introduction to Apache Tez

The Vertica Database simply fast!

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data Introduction

Navigating the Big Data infrastructure layer Helena Schwenk

How To Scale Out Of A Nosql Database

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Big Data and Market Surveillance. April 28, 2014

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

From Spark to Ignition:

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Reference Architecture, Requirements, Gaps, Roles

Big Data on Microsoft Platform

Analyzing Big Data with AWS

Using distributed technologies to analyze Big Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Transforming the Telecoms Business using Big Data and Analytics

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Advanced Big Data Analytics with R and Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Client Overview. Engagement Situation. Key Requirements

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Real Time Big Data Processing

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

BIG DATA TRENDS AND TECHNOLOGIES

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

BIG DATA SOLUTION DATA SHEET

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

The Internet of Things and Big Data: Intro

Hadoop & Spark Using Amazon EMR

Innovative technology for big data analytics

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Native Connectivity to Big Data Sources in MSTR 10

Chapter 7. Using Hadoop Cluster and MapReduce

Are You Ready for Big Data?

Big Data and Industrial Internet

An Oracle White Paper October Oracle: Big Data for the Enterprise

Lecture 10 - Functional programming: Hadoop and MapReduce

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

An Oracle White Paper June Oracle: Big Data for the Enterprise

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Search and Real-Time Analytics on Big Data

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Architectures for Big Data Analytics A database perspective

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Tap into Hadoop and Other No SQL Sources

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

REAL-TIME BIG DATA ANALYTICS

Hadoop IST 734 SS CHUNG

<Insert Picture Here> Big Data

Real-time Big Data Analytics with Storm

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

I/O Considerations in Big Data Analytics

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

The Inside Scoop on Hadoop

NoSQL for SQL Professionals William McKnight

Are You Ready for Big Data?

Virtualizing Apache Hadoop. June, 2012

Next-Generation Cloud Analytics with Amazon Redshift

An Oracle White Paper September Oracle: Big Data for the Enterprise

Hadoop Job Oriented Training Agenda

CitusDB Architecture for Real-Time Big Data

Turning Data Into Answers With HP Vertica

Hadoop Ecosystem B Y R A H I M A.

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data Explained. An introduction to Big Data Science.

How To Turn Big Data Into An Insight

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

[Hadoop, Storm and Couchbase: Faster Big Data]

Transcription:

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica jcampbell@vertica.com

Big Data - Revisited Are the terms Big Data and Hadoop synonymous? What are the primary drivers for government agencies in addressing Big Data? What other types of tools are available to work with Big Data? 2

The Big Data Challenge V a l u e Volume Variety Velocity Social Media Video Audio Email Texts Mobile Transactional Data Machine/Sensor Docs Diverse Users BIG DATA New Solutions Are Needed 1000x Search Engine Images Ad Hoc Questions

In Data, There is Gold What value are you looking to find in your data? How fast do you need to find gold? Make sure you don t get fools gold In Data There is Gold

Mining for Gold in Big Data Approaches to Finding Gold in Big Data Analysis and reporting are not the same thing Organizations should not equate reporting with analysis Reporting Environments Select reports to run Execute reports View results Analysis is an interactive process of analyzing data Frame research/investigation question Identify data requirements Analyze data (interactive process) Interpret the results Inflexible Predefined Flexible Custom Focused on finding answers

Reporting Vs. Analytics Reporting Standard views of data Answers standard set of questions Does not require a human Is inflexible Analytics Interactive Process Correctly Frame Problem Collection of Data Analyze Data Interpret Results Provides answers Customized Involves human interaction Flexible Real Time

Analytic Pain Points Low performance Limited functionality Complexity in deployment and use Is not timely with demands for analytics results Does not interoperate with other big data platforms (i.e. Hadoop) Skilled labor requirements of newer technologies Older technologies unable answer big data challenges

Hadoop Answers Many Big Data Challenges Varied Data Structures Varied Data Sources Large Data Volumes Rich Set of Analytics Quick analysis of complex relationships Performance Enhanced Queries Interactive Analysis

Hadoop Architectural Components Process Layer Map Reduce Map Step Create key/value tuples Reduce Step Receives sorted key value tuples and runs user provided program Job Manager Job Manager Manage jobs, which include tasks Task Manager Resource Management Storage Layer across all nodes Manage each individual task (could be one or more per node) Added in latest Hadoop Release HDFS Other Storage Cluster file system written in Java that sits on top of host file system. HDFS Amazon S3, CloudStore, FTP Filesystem, other distributed files systems available through file://url

Hadoop Key-Value and Database Storage Systems Client Applications Clients can be SQL, NoSQL, programs, etc. Key Value / Database System May create files, indexes, depending on Apache project Map / Reduce May use Map/Reduce Framework HDFS or other Distributed File System HDFS Provides Underling Distributed Storage Mechanism

Choosing The Right Tool for the Job Vertica for Interactive And Realtime Analytics Hadoop for Long-running Batch Analytics (fault tolerance) Map reduce works best when there is a large set of input data where only a small portion of the data is required for analysis

A Platform Designed for Big Data Next Generation Administration and Design Tools Columnar Compression Concurrent Load & Query Elastic Cluster SQL Analytics User Defined Analytics Optimized Connectors Standard Interface Native ColumnarRDBMS Native and Performance Optimized High Availability Real Time Massively Parallel Processing

What Analytics can HP Vertica handle? SQL SQL analytic functions Graph Monte Carlo Statistical Geospatial Extended SQL Sessionization Time series Pattern matching Event series joins SDKs C++ R? Check out: https://github.com/vertica/vertica Extension Packages

SQL Analytics + Built for Big Data Features Time series gap filing and interpolation Event-based window functions and sessionization Pattern matching Event series join Statistical functions Geospatial functions Benefits High performance (Keep Data close to CPU) Low cost (Industry Standard building blocks) Ease of use (Automated + Available) Use Cases Tickstore data cleanups CDR/VOD data analysis Clickstream sessionization Data aggregation and compression Monte Carlo simulation Social graph analysis Sensor Data SmartGrid Predictive maintenance

User-Defined Extensions in R What is R? Open source language for statistical computing Wide range of packages available for advanced data mining and statistical analysis Advantages of UDx in R HP Vertica automatically parallelizes the execution of user-defined R code Optimized data transfer between HP Vertica and R Vertica Cluster

UDx in R Example: K Means Clustering Function Setup + Usage -- Define function CREATE LIBRARY rlib AS /path/rcode.r LANGUAGE 'R'; CREATE TRANSFORM FUNCTION Kmeans AS LANGUAGE 'R' NAME 'kmeansclufactory' LIBRARY rlib; -- Use function CREATE TABLE point_data ( x FLOAT, y FLOAT ); SELECT Kmeans(x, y) OVER() FROM point_data; R Source Code # Example: K-means (k=5) # Input: two-dimensional points # Output: the point coordinates plus their assigned # cluster kmeansclu <- function(x) { cl <- kmeans(x,5,10) res <- data.frame(x[,1:2], cl$cluster) res } kmeansclufactory <- function() { list(name=kmeansclu, udxtype=c("transform"), intype=c("float","float"), outtype=c("float","float","int"), outnames=c("x","y","cluster") ) }

HP Vertica and Hadoop are Complementary HP Vertica Designed for performance Interactive analytics A rich SQL ecosystem Both Purpose Built Scalable Analytics Platforms Hadoop Designed for fault tolerance Batch analytics A rich Programming Model

Hadoop + HP Vertica: Joint Use Cases Use Case 1: Hadoop for data integration, transformation, and data quality management HP Vertica for structured analytics, traditional business intelligence data warehousing, and analysis and reporting Assumes a balance composition of developers fluent on Hadoop and SQL Use Case 2: Hadoop as an operational data store HP Vertica for data augmentation of data in Hadoop. Assumes more SQL developers than Hadoop developers Leverages the strength of team mix Use Case 3: Data federation across Hadoop and HP Vertica Variety of user interfaces for data interaction and an analysis data store HDFS for Storage, HP Vertica + Hadoop for Analytics Real-time analytics on HP Vertica (needs speed) Long-running/exploratory analytics on Hadoop (needs fault tolerance) Load from HDFS directly to HP Vertica HP Vertica SQL access to HDFS

HP Vertica - Hadoop Connector Allows flexibility & interoperability Integrate with Hadoop/MapReduce and Pig HP Vertica-aware extension to Hadoop Specialized adapter for distributed streaming between Hadoop and HP Vertica Developers need access to fast DBMS that co-exists with Hadoop rather than being embedded Operate on different clusters, generally by different groups of people Allows customers to scale computation independent of DBMS HDFS File Data data data data data da data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data Hadoop / Vertica: ETL MapReduce / Pig Job DFS Block 2 DFS Block 1 Map DFS Block 2 DFS Block 2 DFS Block 3 DFS Block 1 DFS Block 1 DFS Block 3 Hadoop / HP Vertica: Advanced Analytics Map Reduce Map MapReduce / Pig Job DFS Block 1 DFS Block 1 DFS Block 1 Map HP Vertica HP Vertica DFS Block 2 Map DFS Block 2 Reduce HP Vertica DFS Block 2 Map DFS Block 3 DFS Block 3

Native Load and Query from HDFS Goal: Query data residing on HDFS directly from Vertica Goal: Load data staged on HDFS into BI schema in Vertica Method: Develop User Defined Loader to HDFS data files Define External Table for a virtual table view of HDFS data Benefits: Simple, direct integration with HDFS (no MapReduce) Data remains in Hadoop no synchronization required Queries access latest information in HDFS HP Vertica External Table Method: Develop User Defined Loader to HDFS data files Load data directly into Vertica from HDFS Benefits: Simple, direct integration with HDFS (no MapReduce) Data stored in Vertica s queryoptimized format for near realtime analysis and reporting HP Vertica

Custom Connectors with User Defined Load Override any part of HP Vertica s normal load process Source (stream data from any source) Filter (transform data to a new format) Parser (convert data stream into database tuples) E.g. Use source and filter to load audio data directly into Vertica: HP Vertica COPY music (filename AS Sample, time_index, data filler int, L AS data, R AS data) FIXEDWIDTH COLSIZES (17, 18) WITH SOURCE ExternalSource(cmd= arecord -d 10 ) FILTER ExternalFilter(cmd= sox type wav type dat - ); Read: http://www.vertica.com/2012/07/09/on the trail of a red tailed hawk part 2/ 2 1

QUESTIONS? Jim Campbell HP Vertica jimcampbell@hp.com or jcampbell@vertica.com P: 703-753-5970