Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica jcampbell@vertica.com
Big Data - Revisited Are the terms Big Data and Hadoop synonymous? What are the primary drivers for government agencies in addressing Big Data? What other types of tools are available to work with Big Data? 2
The Big Data Challenge V a l u e Volume Variety Velocity Social Media Video Audio Email Texts Mobile Transactional Data Machine/Sensor Docs Diverse Users BIG DATA New Solutions Are Needed 1000x Search Engine Images Ad Hoc Questions
In Data, There is Gold What value are you looking to find in your data? How fast do you need to find gold? Make sure you don t get fools gold In Data There is Gold
Mining for Gold in Big Data Approaches to Finding Gold in Big Data Analysis and reporting are not the same thing Organizations should not equate reporting with analysis Reporting Environments Select reports to run Execute reports View results Analysis is an interactive process of analyzing data Frame research/investigation question Identify data requirements Analyze data (interactive process) Interpret the results Inflexible Predefined Flexible Custom Focused on finding answers
Reporting Vs. Analytics Reporting Standard views of data Answers standard set of questions Does not require a human Is inflexible Analytics Interactive Process Correctly Frame Problem Collection of Data Analyze Data Interpret Results Provides answers Customized Involves human interaction Flexible Real Time
Analytic Pain Points Low performance Limited functionality Complexity in deployment and use Is not timely with demands for analytics results Does not interoperate with other big data platforms (i.e. Hadoop) Skilled labor requirements of newer technologies Older technologies unable answer big data challenges
Hadoop Answers Many Big Data Challenges Varied Data Structures Varied Data Sources Large Data Volumes Rich Set of Analytics Quick analysis of complex relationships Performance Enhanced Queries Interactive Analysis
Hadoop Architectural Components Process Layer Map Reduce Map Step Create key/value tuples Reduce Step Receives sorted key value tuples and runs user provided program Job Manager Job Manager Manage jobs, which include tasks Task Manager Resource Management Storage Layer across all nodes Manage each individual task (could be one or more per node) Added in latest Hadoop Release HDFS Other Storage Cluster file system written in Java that sits on top of host file system. HDFS Amazon S3, CloudStore, FTP Filesystem, other distributed files systems available through file://url
Hadoop Key-Value and Database Storage Systems Client Applications Clients can be SQL, NoSQL, programs, etc. Key Value / Database System May create files, indexes, depending on Apache project Map / Reduce May use Map/Reduce Framework HDFS or other Distributed File System HDFS Provides Underling Distributed Storage Mechanism
Choosing The Right Tool for the Job Vertica for Interactive And Realtime Analytics Hadoop for Long-running Batch Analytics (fault tolerance) Map reduce works best when there is a large set of input data where only a small portion of the data is required for analysis
A Platform Designed for Big Data Next Generation Administration and Design Tools Columnar Compression Concurrent Load & Query Elastic Cluster SQL Analytics User Defined Analytics Optimized Connectors Standard Interface Native ColumnarRDBMS Native and Performance Optimized High Availability Real Time Massively Parallel Processing
What Analytics can HP Vertica handle? SQL SQL analytic functions Graph Monte Carlo Statistical Geospatial Extended SQL Sessionization Time series Pattern matching Event series joins SDKs C++ R? Check out: https://github.com/vertica/vertica Extension Packages
SQL Analytics + Built for Big Data Features Time series gap filing and interpolation Event-based window functions and sessionization Pattern matching Event series join Statistical functions Geospatial functions Benefits High performance (Keep Data close to CPU) Low cost (Industry Standard building blocks) Ease of use (Automated + Available) Use Cases Tickstore data cleanups CDR/VOD data analysis Clickstream sessionization Data aggregation and compression Monte Carlo simulation Social graph analysis Sensor Data SmartGrid Predictive maintenance
User-Defined Extensions in R What is R? Open source language for statistical computing Wide range of packages available for advanced data mining and statistical analysis Advantages of UDx in R HP Vertica automatically parallelizes the execution of user-defined R code Optimized data transfer between HP Vertica and R Vertica Cluster
UDx in R Example: K Means Clustering Function Setup + Usage -- Define function CREATE LIBRARY rlib AS /path/rcode.r LANGUAGE 'R'; CREATE TRANSFORM FUNCTION Kmeans AS LANGUAGE 'R' NAME 'kmeansclufactory' LIBRARY rlib; -- Use function CREATE TABLE point_data ( x FLOAT, y FLOAT ); SELECT Kmeans(x, y) OVER() FROM point_data; R Source Code # Example: K-means (k=5) # Input: two-dimensional points # Output: the point coordinates plus their assigned # cluster kmeansclu <- function(x) { cl <- kmeans(x,5,10) res <- data.frame(x[,1:2], cl$cluster) res } kmeansclufactory <- function() { list(name=kmeansclu, udxtype=c("transform"), intype=c("float","float"), outtype=c("float","float","int"), outnames=c("x","y","cluster") ) }
HP Vertica and Hadoop are Complementary HP Vertica Designed for performance Interactive analytics A rich SQL ecosystem Both Purpose Built Scalable Analytics Platforms Hadoop Designed for fault tolerance Batch analytics A rich Programming Model
Hadoop + HP Vertica: Joint Use Cases Use Case 1: Hadoop for data integration, transformation, and data quality management HP Vertica for structured analytics, traditional business intelligence data warehousing, and analysis and reporting Assumes a balance composition of developers fluent on Hadoop and SQL Use Case 2: Hadoop as an operational data store HP Vertica for data augmentation of data in Hadoop. Assumes more SQL developers than Hadoop developers Leverages the strength of team mix Use Case 3: Data federation across Hadoop and HP Vertica Variety of user interfaces for data interaction and an analysis data store HDFS for Storage, HP Vertica + Hadoop for Analytics Real-time analytics on HP Vertica (needs speed) Long-running/exploratory analytics on Hadoop (needs fault tolerance) Load from HDFS directly to HP Vertica HP Vertica SQL access to HDFS
HP Vertica - Hadoop Connector Allows flexibility & interoperability Integrate with Hadoop/MapReduce and Pig HP Vertica-aware extension to Hadoop Specialized adapter for distributed streaming between Hadoop and HP Vertica Developers need access to fast DBMS that co-exists with Hadoop rather than being embedded Operate on different clusters, generally by different groups of people Allows customers to scale computation independent of DBMS HDFS File Data data data data data da data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data Hadoop / Vertica: ETL MapReduce / Pig Job DFS Block 2 DFS Block 1 Map DFS Block 2 DFS Block 2 DFS Block 3 DFS Block 1 DFS Block 1 DFS Block 3 Hadoop / HP Vertica: Advanced Analytics Map Reduce Map MapReduce / Pig Job DFS Block 1 DFS Block 1 DFS Block 1 Map HP Vertica HP Vertica DFS Block 2 Map DFS Block 2 Reduce HP Vertica DFS Block 2 Map DFS Block 3 DFS Block 3
Native Load and Query from HDFS Goal: Query data residing on HDFS directly from Vertica Goal: Load data staged on HDFS into BI schema in Vertica Method: Develop User Defined Loader to HDFS data files Define External Table for a virtual table view of HDFS data Benefits: Simple, direct integration with HDFS (no MapReduce) Data remains in Hadoop no synchronization required Queries access latest information in HDFS HP Vertica External Table Method: Develop User Defined Loader to HDFS data files Load data directly into Vertica from HDFS Benefits: Simple, direct integration with HDFS (no MapReduce) Data stored in Vertica s queryoptimized format for near realtime analysis and reporting HP Vertica
Custom Connectors with User Defined Load Override any part of HP Vertica s normal load process Source (stream data from any source) Filter (transform data to a new format) Parser (convert data stream into database tuples) E.g. Use source and filter to load audio data directly into Vertica: HP Vertica COPY music (filename AS Sample, time_index, data filler int, L AS data, R AS data) FIXEDWIDTH COLSIZES (17, 18) WITH SOURCE ExternalSource(cmd= arecord -d 10 ) FILTER ExternalFilter(cmd= sox type wav type dat - ); Read: http://www.vertica.com/2012/07/09/on the trail of a red tailed hawk part 2/ 2 1
QUESTIONS? Jim Campbell HP Vertica jimcampbell@hp.com or jcampbell@vertica.com P: 703-753-5970