A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO
AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional Aspects of the Use Case Big Data Architecture, Design and Implementation Lessons Learned Conclusion Questions
About Penton Professional information services company Provide actionable information to five core markets Agriculture Transportation Natural Products Infrastructure Industrial Design & Manufacturing Success Stories EquipmentWatch.com Prices, Specs, Costs, Rental Govalytics.com Analytics around Gov tcapital spending down to county level SourceESB NextTrend.com Vertical Directory, electronic parts Identify new product trends in the natural products industry
About Serendio Serendio provides Big Data Science Solutions & Services for Data-Driven Enterprises. www.serendio.com
Data Processing at Penton
What got us thinking? Business units process data in silos Heavy ETL Hours to process, in some cases days Not even using all the data we want Not logging what we needed to Can t scale for future requirements
The Data Processing Pipeline Assembly Line processing Biz Value Data Processing Pipeline New features New New Insights Products Data Processing Pipeline
Penton examples Daily Inventory data, ingested throughout the day (tens of thousands of parts) Auction and survey data gathered daily Aviation Fleet data, varying frequency Ingest, store Clean, validate Apply Business Rules Map Analyze Report Distribute Various data formats, mostly unstructured Slow Extract, Transform and Load = Frustration + missed business SLAs Won t scale for future
Current Design Survey data loaded as CSV files Data needs to be scrubbed/mapped All CSV rows loaded into one table Once scrubbed/mapped data is loaded into main tables Not all rows are loaded, some may be used in the future
What were our options? Expand RDBMS options -Expensive -Complex Adopt Hadoop Ecosystem - M/R: Ideal for Batch Processing - Flexible for storage - NoSQL: scale, usability and flexibility HBASE Oracle Drools SQL Server
POC Use Case
Primary Use Case Daily model data upload and map Ingest data, build buckets Map data (batch and interactive) Build Aggregates (dynamic) Issue: Mapping time
Functional Aspects
Data Scrubbing Standardized names for fields/columns Example - Country Unites States of America -> USA United States -> USA
Data Mapping Converting Fields - > Ids Manufacturer - Caterpillar -> 25 Model - Caterpillar/Front Loader -> 300 Requires the use of lookup tables and partial/fuzzy matching strings
Data Exporting Move scrubbed/mapped data to main RDBMS
Key Pain Points CSV data table continues to grow Large size of the table impacts operations on rows in a single file CSV data could grow rapidly in the future
Criteria for New Design Ability to store an individual file and manipulate it easily No join/relationships across CSV files Solution should have good integration with RDBMS Could possibly host the complete application in future Technology stack should possibly have advanced analytics capabilities NoSQLmodel would allow to quickly retrieve/address individual file and manipulate it
Big Data Architecture
Solution Architecture RDB-> Data Upload UI CSV Files Data manipulation APIs exposed through REST layer Drools for rule based data scrubbing Use HBase as a store for CSV files REST API CSV and Rule Management Endpoints API Calls Drools HBASE HADOOP HDFS Launch MR Jobs Operations on all/groups of files using MR jobs Insert Accepted Data Push Updates Master database of Products/ Parts Survey REST Operations on individual files in UI through Hbase Get/Put Existing Business Applications Current Oracle Schema
Hbase Schema Design One row per HBase row One file per HBase row One cell per column qualifier (simple and started the development with this approach) One row per column qualifier (more performant approach)
Hbase Rowkey Design Row Key Composite Created Date (YYYYMMDD) User FileType GUID Salting for better region splitting One byte
Hbase Column Family Design Column Family Data separated from Metadata into two or more column families One cf for mapping data (more later) One cf for analytics data (used by analytics coprocessors)
M/R Jobs Jobs Scrubbing Mapping Export Schedule Manually from UI On schedule using Oozie
Sqoop Jobs One time FileDetailExport (current CSV) RuleImport (all current rules) Periodic Lookup Table Data import Manufacture Model State Country Currency Condition Participant
Application Integration - REST Hide HBase AP/Java APIs from rest of application Language independence for PHP front-end REST APIs for CSV Management Drools Rule Management
Lessons Learned
Performance Benefits Mapping 20000 csv files, 20 million records Time taken 1/3rd of RDBMS processing Metrics < 10 secs vs (Oracle Materialized View) Upload a file < 10 secs Delete a file < 10 secs
Hbase Tuning Heap Size for RegionServer MapReduce Tasks Table Compression SNAPPY for Column Family holding csv data Table data caching IN_MEMORY for lookup tables
Application Design Challenges Pagination implemented using intermediate REST layer and scan.setstartrow. Translating SQL queries Used Scan/Filter and Java (especially on coprocessor) No secondary indexes - used FuzzyRowFilter Maybe something like Phoenix would have helped Some issues in mixed mode. Want to move to 0.96.0 for Some issues in mixed mode. Want to move to 0.96.0 for better/individual column family flushing but needed to 'port' coprocessors (to protobuf)
Hbase Value Proposition Better response in UI for CSV file operations - Operations within a file (map, import, reject etc) not dependent on the db size Relieve load on RDBMS -no more CSV data tables Scale out batch processing performance on the cheap (vs vertical RDBMS upgrade) Redundant store for CSV files Versioning to track data cleansing
Roadmap Benchmark with 0.96 Retire Coprocessors in favor of Phoenix (?) Lookup Data tables are small. Need to find a better alternative than HBase Design UI for a more Big Data appropriate model Search oriented paradigm, than exploratory/ paginative Add REST endpoints to support such UI
Wrap-Up
Conclusion PoC demonstrated value of the Hadoop ecosystem Co-existence of Big data technologies with current solutions Adoption can significantly improve scale New skill requirements
Thank You Rajesh.Nair@Penton.com Kiru@Serendio.com