DMX-h ETL Use Case Accelerator. Word Count



Similar documents
DMX-h ETL Use Case Accelerator. Web Log Aggregation

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Considerations for Mainframe Application Modernisation

Speeding ETL Processing in Data Warehouses White Paper

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Testing 3Vs (Volume, Variety and Velocity) of Big Data

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

COURSE CONTENT Big Data and Hadoop Training

Connecting Hadoop with Oracle Database

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Data Domain Profiling and Data Masking for Hadoop

Sybase Adaptive Server Enterprise

ITG Software Engineering

Near-Instant Oracle Cloning with Syncsort AdvancedClient Technologies White Paper

Package hive. January 10, 2011

Cloudera Certified Developer for Apache Hadoop

Using distributed technologies to analyze Big Data

Lecture 10 - Functional programming: Hadoop and MapReduce

Extreme computing lab exercises Session one

Hadoop WordCount Explained! IT332 Distributed Systems

CS455 - Lab 10. Thilina Buddhika. April 6, 2015

Integrating VoltDB with Hadoop

Product Brief. DC-Protect. Content based backup and recovery solution. By DATACENTERTECHNOLOGIES

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data, SAP HANA. SUSE Linux Enterprise Server for SAP Applications. Kim Aaltonen

The Hadoop Framework

Introduction to Hadoop

ITG Software Engineering

Big Fast Data Hadoop acceleration with Flash. June 2013

HP SiteScope. Hadoop Cluster Monitoring Solution Template Best Practices. For the Windows, Solaris, and Linux operating systems

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data and Scripting map/reduce in Hadoop

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Hadoop Streaming. Table of contents

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Hadoop and Map-reduce computing

CA Big Data Management: It s here, but what can it do for your business?

Distributed Computing and Big Data: Hadoop and MapReduce

Implement Hadoop jobs to extract business value from large and varied data sets

Improve your IT Analytics Capabilities through Mainframe Consolidation and Simplification

BIG DATA SOLUTION DATA SHEET

Introduction to DISC and Hadoop

See the Big Picture. Make Better Decisions. The Armanta Technology Advantage. Technology Whitepaper

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Multi-Datacenter Replication

Oracle Big Data Essentials

Predictive Research Inc., Predict & Benefit

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

Fundamentals Curriculum HAWQ

American International Journal of Research in Science, Technology, Engineering & Mathematics

Big Business, Big Data, Industrialized Workload

Metalogix Storage Expert Installation Guide

Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide

Dell Solutions Configurator Guide for the Dell Blueprint for Big Data & Analytics

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Lesson 7 Pentaho MapReduce

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Comparison of Different Implementation of Inverted Indexes in Hadoop

Hadoop and Map-Reduce. Swati Gore

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

BrightStor ARCserve Backup for Windows

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

Deltek Costpoint Process Execution Modes

A Brief Outline on Bigdata Hadoop

Introduction to Hadoop

HADOOP AND MAINFRAMES CRAZY OR CRAZY LIKE A FOX? Mike Combs, VP of Marketing mcombs@veristorm.com

ACCELERATOR 6.3 ALARMS AND OPERATIONAL MEASUREMENTS

Dell Reference Configuration for Hortonworks Data Platform

Information Architecture

Integrate Big Data into Business Processes and Enterprise Systems. solution white paper

Important Notice. (c) Cloudera, Inc. All rights reserved.

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Deltek Vision 7.0 LA. Technical Readiness Guide

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

MapReduce. Tushar B. Kute,

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0

Presenters: Luke Dougherty & Steve Crabb

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

ACCELERATOR 6.4 CISCO UNITY 3.1/4.1 INTEGRATION GUIDE

Debunking The Myths of Column-level Encryption

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat


Data Mining in the Swamp

Transcription:

DMX-h ETL Use Case Accelerator Word Count

Syncsort Incorporated, 2015 All rights reserved. This document contains proprietary and confidential material, and is only for use by licensees of DMExpress. This publication may not be reproduced in whole or in part, in any form, except with written permission from Syncsort Incorporated. Syncsort is a registered trademark and DMExpress is a trademark of Syncsort, Incorporated. All other company and product names used herein may be the trademarks of their respective owners. The accompanying DMExpress program and the related media, documentation, and materials ("Software") are protected by copyright law and international treaties. Unauthorized reproduction or distribution of the Software, or any portion of it, may result in severe civil and criminal penalties, and will be prosecuted to the maximum extent possible under the law. The Software is a proprietary product of Syncsort Incorporated, but incorporates certain third-party components that are each subject to separate licenses and notice requirements. Note, however, that while these separate licenses cover the respective third-party components, they do not modify or form any part of Syncsort s SLA. Refer to the Third-party license agreements topic in the online help for copies of respective third-party license agreements referenced herein.

Table of Contents Table of Contents 1 Introduction... 1 2 Word Count with DMX-h IX... 2 2.1 J_SplitWords Subjob... 2 2.1.1 T_WordCountPivotTokens Task... 2 2.2 T_WordCountAggregateWords Task... 3 Appendix A Word Count with DMX-h User-defined MapReduce... 6 A.1 Map Step... 6 A.1.1 MT_PivotWordTokens.dxt... 7 A.1.2 MT_MapWordAgg.dxt... 7 A.2 Reduce Step... 9 A.2.1 RT_FinalReduceAgg.dxt... 10 DMX-h ETL Use Case Accelerator: Word Count i

Introduction 1 Introduction Word Count is a common application used to explain how Hadoop works. The application identifies all uniquely occurring words in a set of source data, and for each word, computes the total number of occurrences of that word. This use case accelerator implements Word Count in DMX-h ETL by efficiently tokenizing lines into words and aggregating on the words to obtain a total count for each. The use case accelerators are developed as standard DMExpress jobs with just minor accommodations that allow them to be run in or outside of Hadoop: When specified to run in the Hadoop cluster, DMX-h Intelligent Execution (IX) automatically converts them to be executed in Hadoop using the optimal Hadoop engine, such as MapReduce. In this case, some parts of the job may be run on the Linux edge node if not suitable for Hadoop execution. Otherwise, they are run on a Windows workstation or Linux edge node, which is useful for development and testing purposes before running in the cluster. While the method presented in the next section is the simplest and most efficient way to develop this example, the more complex user-defined MapReduce solution is provided as a reference in Appendix A. For guidance on setting up and running the examples both outside and within Hadoop, see Guide to DMExpress Hadoop Use Case Accelerators. For details on the requirements for IX jobs, see Developing Intelligent Execution Jobs in the DMExpress Help. DMX-h ETL Use Case Accelerator: Word Count 1

Word Count with DMX-h IX 2 Word Count with DMX-h IX The Word Count solution in DMX-h ETL consists of a job, J_WordCount.dxj, with one subjob followed by an aggregate task. 2.1 J_SplitWords Subjob This subjob consists of a single copy task responsible for tokenizing the input file and loading it to HDFS. DMX-h IX validation currently requires copy tasks to be placed inside of a subjob. 2.1.1 T_WordCountPivotTokens Task This task uses a copy to read the file and replaces spaces between words with record terminators. DMX-h ETL Use Case Accelerator: Word Count 2

Word Count with DMX-h IX The source in this example is an unstructured text file consisting of lines of text broken up by UNIX linefeed terminators. Although we specify a pipe ( ) as a delimiter, no predetermined fields exist in the free-form line or record, so we describe a layout that consists of one field that extends to the end of the record. A copy task is sufficient to read the source file, as no ordering requirements exist for the first step. We accomplish the pivoting by inserting record terminators between the words. Using the regular expression '([^ ]+) +', we search for a group of one or more non-space characters (indicated with parentheses) followed by one or more spaces. When we find this, we replace it with the preserved non-space group of characters followed by a newline ('\n') instead of a space. The RegExReplace function finds and replaces all occurrences of the pattern. This introduces line breaks or record terminators between all previously space-separated tokens and words as well as between punctuation. The RegExReplace expression is defined in the WordTokens value, which is included in the target via a reformat. Whether the overall job is run in or outside of Hadoop, this task will run on the edge node. 2.2 T_WordCountAggregateWords Task This task reads the output of the J_SplitWords subjob and performs two processing steps: 1. Empty records are filtered out. 2. Records are trimmed to remove whitespace, and aggregated to obtain a final count. DMX-h ETL Use Case Accelerator: Word Count 3

Word Count with DMX-h IX The HDFS source and target files are defined with a remote file connection to the Hadoop cluster. The source, a pipe-delimited UNIX text file, is the target of the previous subjob. The source record contains only one field, WordTokens, whose field values consist of the single token or word produced by the previous task. Before any other processing, we use a bulk filter to eliminate empty lines since they contain no words. We also want to eliminate lines that consist of only spaces. To do this, first we create a value, TrimmedKey, that uses the Trim() function to eliminate leading and trailing spaces, leaving just the word itself, if any. We then use a condition, ContainsWord, to retain only records that contain words (TrimmedKey!= ). A sorted aggregation is performed on the remaining records, grouping by the field TrimmedKey, which is the word being counted. The count() function is used to obtain the final occurrence count for each key. A record is produced as output for each unique word, followed by the tab character and the total count of occurrences of that word in the original source file. DMX-h ETL Use Case Accelerator: Word Count 4

Word Count with DMX-h IX DMX-h ETL Use Case Accelerator: Word Count 5

Word Count with DMX-h User-defined MapReduce Appendix A Word Count with DMX-h User-defined MapReduce This user-defined MapReduce solution is provided as a reference in the event that particular knowledge of your application s data would benefit from manual control of the MapReduce process. The Word Count job in DMX-h contains a map step and a reduce step, separated by a MapReduce data flow connector between the final map task and the initial reduce task, as follows: A.1 Map Step The map step first reads an input file containing the full text of the novel Little Women by Louisa May Alcott and pivots the data to tokenize individual words. It then assigns partition IDs to the data such that all identical words will go to the same reducer. Each mapper also aggregates the words known to the mapper, producing a partial count of those words. The aggregation is partial since each mapper is passed only a subset of the input data. The partial map-side aggregation is useful, however, considering the frequency of words such as a, the, and and. This partial aggregation subsequently reduces the amount of data that must be processed by the reducers. The reduce step receives partial word counts from the mappers. Since for any given word, all partial counts for that word go to the same reducer, the reducers total all incoming counts of each word into a final count for that word. Finally, the reduce step writes all words and corresponding word counts to the target. The Word Count map step in DMX-h consists of a task that pivots the data to create one record per word, and another task that assigns a partition ID to each record based on the key and performs a partial count of each word before sending the partitioned data to the reduce step. DMX-h ETL Use Case Accelerator: Word Count 6

Word Count with DMX-h User-defined MapReduce A.1.1 MT_PivotWordTokens.dxt This task uses a copy to read the file and replaces spaces between words with record terminators. The source in this example is an unstructured text file. The source consists of lines of text broken up by UNIX line terminators (linefeed). Although we specify a pipe ( ) as a delimiter, no predetermined fields exist in the free form line or record, so we describe a layout that consists of one field that extends to the end of the record. A copy task is used to read the source file as no ordering requirements exist for the first step. We now accomplish the pivoting by inserting record terminators between the words. Using the expression '([^ ]+) +', we search for a group of one or more non-space characters (indicated with parentheses) followed by one or more spaces. When we find this, we replace it with the preserved non-space group of characters followed by a newline ('\n') instead of a space. The RegExReplace function finds and replaces all occurrences of the pattern. This introduces line breaks or record terminators between all previously space-separated tokens and words as well as between punctuation. The RegExReplace expression is defined in the WordTokens value, which is included in the target via a reformat. A.1.2 MT_MapWordAgg.dxt The second task performs an aggregation on the words and partitions the data in preparation for the reduce step. DMX-h ETL Use Case Accelerator: Word Count 7

Word Count with DMX-h User-defined MapReduce The source, a pipe-delimited UNIX text file, is the target of the previous task. Only one field, which is named WordTokens, exists in the source record. The field values consist of the single token or word produced by the previous task. Before any other processing, we use a bulk filter to eliminate empty lines since they contain no words. Instead of empty lines, we could also have lines that only consist of spaces. To eliminate trailing and leading spaces, we create a new value, TrimmedKey, which is the word coming in with spaces removed using the Trim() function. We then filter using a named condition, ContainsWord, to filter all records that when trimmed contain no words (TrimmedKey!= ). Since all records with the same primary key value (PartitionKey + TrimmedKey) must go to the same reducer in order for all occurrences of each word to be totaled, the partition ID is determined by creating a CRC32() hash value based on this key. In addition, the environment variable DMX_HADOOP_NUM_REDUCERS (automatically set to the configured number of reducers when running in Hadoop) is passed to the function to limit the range of partition ID values to the number of reducers invoked for your MapReduce job. For the aggregation, we group by the PartitionKey as well as the new value, TrimmedKey. We count the occurrences of the same word by introducing the count() aggregation function. This is only a partial count for each word, since the input to each mapper is a subset of the overall source data. The count format is set to 4-byte unsigned binary integer to improve performance. DMX-h ETL Use Case Accelerator: Word Count 8

Word Count with DMX-h User-defined MapReduce The option to sort aggregated records by group by fields is selected so that the data is sorted by the partition ID, a requirement to ensure the data is routed to the correct reducer. Further, the partition ID is output as the first field in the Reformat, also a requirement for correct MapReduce operation. A.2 Reduce Step The Word Count reduce step in DMX-h ETL consists of one task that aggregates the partial word counts calculated during the map step into a final count for each word. As the partition IDs were assigned by the map job based on individual words, partial counts from different mappers for each word are routed to the same reducer, ensuring that all occurrences of each word are accounted for in the reduce step. DMX-h ETL Use Case Accelerator: Word Count 9

Word Count with DMX-h User-defined MapReduce A.2.1 RT_FinalReduceAgg.dxt The input to this task is made up of pairs of words and partial word counts from different mappers. These partial word counts will be summed into a total word count for each word. A sorted aggregation is performed, grouping by the field Key, which is the word being counted. The field Value, containing partial counts of each word, is summed for records with matching word keys. A record is produced as output for each unique word, which is followed by the tab character and the total count of occurrences of that word in the original source file. The Hadoop framework will create a directory with the target name specified here, and each reducer s output will be written to its own part-reducer_id file within that directory. DMX-h ETL Use Case Accelerator: Word Count 10

About Syncsort Syncsort provides enterprise software that allows organizations to collect, integrate, sort, and distribute more data in less time, with fewer resources and lower costs. Thousands of customers in more than 85 countries, including 87 of the Fortune 100 companies, use our fast and secure software to optimize and offload data processing workloads. Powering over 50% of the world s mainframes, Syncsort software provides specialized solutions spanning Big Iron to Big Data, including next gen analytical platforms such as Hadoop, cloud, and Splunk. For more than 40 years, customers have turned to Syncsort s software and expertise to dramatically improve performance of their data processing environments, while reducing hardware and labor costs. Experience Syncsort at www.syncsort.com. Syncsort Inc. 50 Tice Boulevard, Suite 250, Woodcliff Lake, NJ 07677 201.930.8200