A Scalable Data Transformation Framework using the Hadoop Ecosystem

Similar documents
Big Data Analytics Nokia

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Apache HBase. Crazy dances on the elephant back

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop Ecosystem B Y R A H I M A.

Complete Java Classes Hadoop Syllabus Contact No:

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop and Map-Reduce. Swati Gore

Big Data Course Highlights

Workshop on Hadoop with Big Data

Using distributed technologies to analyze Big Data

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Constructing a Data Lake: Hadoop and Oracle Database United!

Practical Hadoop by Example

Analytics on Spark &

Trafodion Operational SQL-on-Hadoop

Media Upload and Sharing Website using HBASE

Data storing and data access

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Data processing goes big

Cost-Effective Business Intelligence with Red Hat and Open Source

Apache Kylin Introduction Dec 8,

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

The Future of Data Management

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

ITG Software Engineering

Comparing SQL and NOSQL databases

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Implement Hadoop jobs to extract business value from large and varied data sets

COURSE CONTENT Big Data and Hadoop Training

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Qsoft Inc

Oracle Big Data SQL Technical Update

MongoDB Developer and Administrator Certification Course Agenda

Peers Techno log ies Pv t. L td. HADOOP

Building Scalable Big Data Pipelines

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Moving From Hadoop to Spark

xpaaerns on Spark, Shark, Tachyon and Mesos

<Insert Picture Here> Big Data

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Hadoop IST 734 SS CHUNG

CitusDB Architecture for Real-Time Big Data

How To Scale Out Of A Nosql Database

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Move Data from Oracle to Hadoop and Gain New Business Insights

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Deploying Hadoop with Manager

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Hadoop Job Oriented Training Agenda

MySQL and Hadoop Big Data Integration

Real-time Big Data Analytics with Storm

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

ITG Software Engineering

Internals of Hadoop Application Framework and Distributed File System

MDM and Data Warehousing Complement Each Other

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

What's New in SAS Data Management

How To Create A Data Visualization With Apache Spark And Zeppelin

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Relational Databases for the Business Analyst

Cloudera Certified Developer for Apache Hadoop

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Big Data Too Big To Ignore

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Time-Series Databases and Machine Learning

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

SQL on NoSQL (and all of the data) With Apache Drill

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Using RDBMS, NoSQL or Hadoop?

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

ILMT Central Team. Performance tuning. IBM License Metric Tool 9.0 Questions & Answers IBM Corporation

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Apache Sqoop. A Data Transfer Tool for Hadoop

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Practical Cassandra. Vitalii

Transcription:

A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO

AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional Aspects of the Use Case Big Data Architecture, Design and Implementation Lessons Learned Conclusion Questions

About Penton Professional information services company Provide actionable information to five core markets Agriculture Transportation Natural Products Infrastructure Industrial Design & Manufacturing Success Stories EquipmentWatch.com Prices, Specs, Costs, Rental Govalytics.com Analytics around Gov tcapital spending down to county level SourceESB NextTrend.com Vertical Directory, electronic parts Identify new product trends in the natural products industry

About Serendio Serendio provides Big Data Science Solutions & Services for Data-Driven Enterprises. www.serendio.com

Data Processing at Penton

What got us thinking? Business units process data in silos Heavy ETL Hours to process, in some cases days Not even using all the data we want Not logging what we needed to Can t scale for future requirements

The Data Processing Pipeline Assembly Line processing Biz Value Data Processing Pipeline New features New New Insights Products Data Processing Pipeline

Penton examples Daily Inventory data, ingested throughout the day (tens of thousands of parts) Auction and survey data gathered daily Aviation Fleet data, varying frequency Ingest, store Clean, validate Apply Business Rules Map Analyze Report Distribute Various data formats, mostly unstructured Slow Extract, Transform and Load = Frustration + missed business SLAs Won t scale for future

Current Design Survey data loaded as CSV files Data needs to be scrubbed/mapped All CSV rows loaded into one table Once scrubbed/mapped data is loaded into main tables Not all rows are loaded, some may be used in the future

What were our options? Expand RDBMS options -Expensive -Complex Adopt Hadoop Ecosystem - M/R: Ideal for Batch Processing - Flexible for storage - NoSQL: scale, usability and flexibility HBASE Oracle Drools SQL Server

POC Use Case

Primary Use Case Daily model data upload and map Ingest data, build buckets Map data (batch and interactive) Build Aggregates (dynamic) Issue: Mapping time

Functional Aspects

Data Scrubbing Standardized names for fields/columns Example - Country Unites States of America -> USA United States -> USA

Data Mapping Converting Fields - > Ids Manufacturer - Caterpillar -> 25 Model - Caterpillar/Front Loader -> 300 Requires the use of lookup tables and partial/fuzzy matching strings

Data Exporting Move scrubbed/mapped data to main RDBMS

Key Pain Points CSV data table continues to grow Large size of the table impacts operations on rows in a single file CSV data could grow rapidly in the future

Criteria for New Design Ability to store an individual file and manipulate it easily No join/relationships across CSV files Solution should have good integration with RDBMS Could possibly host the complete application in future Technology stack should possibly have advanced analytics capabilities NoSQLmodel would allow to quickly retrieve/address individual file and manipulate it

Big Data Architecture

Solution Architecture RDB-> Data Upload UI CSV Files Data manipulation APIs exposed through REST layer Drools for rule based data scrubbing Use HBase as a store for CSV files REST API CSV and Rule Management Endpoints API Calls Drools HBASE HADOOP HDFS Launch MR Jobs Operations on all/groups of files using MR jobs Insert Accepted Data Push Updates Master database of Products/ Parts Survey REST Operations on individual files in UI through Hbase Get/Put Existing Business Applications Current Oracle Schema

Hbase Schema Design One row per HBase row One file per HBase row One cell per column qualifier (simple and started the development with this approach) One row per column qualifier (more performant approach)

Hbase Rowkey Design Row Key Composite Created Date (YYYYMMDD) User FileType GUID Salting for better region splitting One byte

Hbase Column Family Design Column Family Data separated from Metadata into two or more column families One cf for mapping data (more later) One cf for analytics data (used by analytics coprocessors)

M/R Jobs Jobs Scrubbing Mapping Export Schedule Manually from UI On schedule using Oozie

Sqoop Jobs One time FileDetailExport (current CSV) RuleImport (all current rules) Periodic Lookup Table Data import Manufacture Model State Country Currency Condition Participant

Application Integration - REST Hide HBase AP/Java APIs from rest of application Language independence for PHP front-end REST APIs for CSV Management Drools Rule Management

Lessons Learned

Performance Benefits Mapping 20000 csv files, 20 million records Time taken 1/3rd of RDBMS processing Metrics < 10 secs vs (Oracle Materialized View) Upload a file < 10 secs Delete a file < 10 secs

Hbase Tuning Heap Size for RegionServer MapReduce Tasks Table Compression SNAPPY for Column Family holding csv data Table data caching IN_MEMORY for lookup tables

Application Design Challenges Pagination implemented using intermediate REST layer and scan.setstartrow. Translating SQL queries Used Scan/Filter and Java (especially on coprocessor) No secondary indexes - used FuzzyRowFilter Maybe something like Phoenix would have helped Some issues in mixed mode. Want to move to 0.96.0 for Some issues in mixed mode. Want to move to 0.96.0 for better/individual column family flushing but needed to 'port' coprocessors (to protobuf)

Hbase Value Proposition Better response in UI for CSV file operations - Operations within a file (map, import, reject etc) not dependent on the db size Relieve load on RDBMS -no more CSV data tables Scale out batch processing performance on the cheap (vs vertical RDBMS upgrade) Redundant store for CSV files Versioning to track data cleansing

Roadmap Benchmark with 0.96 Retire Coprocessors in favor of Phoenix (?) Lookup Data tables are small. Need to find a better alternative than HBase Design UI for a more Big Data appropriate model Search oriented paradigm, than exploratory/ paginative Add REST endpoints to support such UI

Wrap-Up

Conclusion PoC demonstrated value of the Hadoop ecosystem Co-existence of Big data technologies with current solutions Adoption can significantly improve scale New skill requirements

Thank You Rajesh.Nair@Penton.com Kiru@Serendio.com