DMX-h ETL Use Case Accelerator. Web Log Aggregation



Similar documents
DMX-h ETL Use Case Accelerator. Word Count

Speeding ETL Processing in Data Warehouses White Paper

Big Data and Scripting map/reduce in Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Considerations for Mainframe Application Modernisation

Metalogix Storage Expert Installation Guide

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Connector for CA Unicenter Asset Portfolio Management Product Guide - On Premise. Service Pack

Deltek Costpoint Process Execution Modes

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

CA Nimsoft Monitor. Probe Guide for Apache HTTP Server Monitoring. apache v1.5 series

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

5 Steps To. Free Up Your Data Warehouse and Budget WITH HADOOP A QUICK START GUIDE

CA Nimsoft Monitor. Probe Guide for Active Directory Response. ad_response v1.6 series

Constructing a Data Lake: Hadoop and Oracle Database United!

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

SOLUTION BRIEF BIG DATA MANAGEMENT. How Can You Streamline Big Data Management?

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Example Use Cases. Solving the Need for Speed in Data Ops. Doc Version 2.1

BIG DATA TECHNOLOGY. Hadoop Ecosystem

See the Big Picture. Make Better Decisions. The Armanta Technology Advantage. Technology Whitepaper

Actian Vortex Express 3.0

What's New in SAS Data Management

Big Fast Data Hadoop acceleration with Flash. June 2013

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

End-to-end Big Data Discovery with Platfora Date: May 2015 Author: Mike Leone, ESG Lab Analyst

Introduction to Hadoop

COURSE CONTENT Big Data and Hadoop Training

CA Nimsoft Monitor. Probe Guide for URL Endpoint Response Monitoring. url_response v4.1 series

Step-by-Step Guide for Monitoring in Windows HPC Server 2008 Beta 2

CA Nimsoft Monitor. Probe Guide for Cloud Monitoring Gateway. cuegtw v1.0 series

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Deltek Touch Time & Expense for GovCon. User Guide for Triumph

Ganzheitliches Datenmanagement

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Big Data, SAP HANA. SUSE Linux Enterprise Server for SAP Applications. Kim Aaltonen

Dollar Universe SNMP Monitoring User Guide

SQL Server 2012 Performance White Paper

SharePlex for SQL Server

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

BIG DATA SOLUTION DATA SHEET

Redefining Microsoft SQL Server Data Management

Clickstream Data Warehouse Initiative

Implement Hadoop jobs to extract business value from large and varied data sets

Data Domain Profiling and Data Masking for Hadoop

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Actian Analytics Platform Express Hadoop SQL Edition 2.0

Near-Instant Oracle Cloning with Syncsort AdvancedClient Technologies White Paper

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Best Practices for Hadoop Data Analysis with Tableau

Big Data and Its Impact on the Data Warehousing Architecture

Deltek Vision 7.0 LA. Technical Readiness Guide

Multi-Datacenter Replication

HADOOP AND MAINFRAMES CRAZY OR CRAZY LIKE A FOX? Mike Combs, VP of Marketing mcombs@veristorm.com

HP SiteScope. Hadoop Cluster Monitoring Solution Template Best Practices. For the Windows, Solaris, and Linux operating systems

SQL Server and MicroStrategy: Functional Overview Including Recommendations for Performance Optimization. MicroStrategy World 2016

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

RSA Security Analytics Virtual Appliance Setup Guide

Lesson 7 Pentaho MapReduce

How to Create User-Defined Fields and Tables

BrightStor ARCserve Backup for Windows

Avaya Aura Contact Center Integration with salesforce.com for Access to Knowledge Management

Hadoop and Big Data Research

How To Choose A Data Flow Pipeline From A Data Processing Platform

Big Data With Hadoop

INTRODUCTION THE EVOLUTION OF ETL TOOLS A CHECKLIST FOR HIGH-PERFORMANCE ETL: DEVELOPMENT PRODUCTIVITY DYNAMIC ETL OPTIMIZATION

Testing 3Vs (Volume, Variety and Velocity) of Big Data

CA Clarity PPM. Resource Management User Guide. v

X_TRADER Spread Matrix

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Bentley CONNECT Dynamic Rights Management Service

Customizing Asset Manager for Managed Services Providers (MSP) Software Asset Management

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Connecting Hadoop with Oracle Database

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Data Mining in the Swamp

Customized Report- Big Data

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Important Notice. (c) Cloudera, Inc. All rights reserved.

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Improve your IT Analytics Capabilities through Mainframe Consolidation and Simplification

Chapter 7. Using Hadoop Cluster and MapReduce

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Presenters: Luke Dougherty & Steve Crabb

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

can you effectively plan for the migration and management of systems and applications on Vblock Platforms?

Actian SQL in Hadoop Buyer s Guide

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Why Big Data in the Cloud?

A Performance Analysis of Distributed Indexing using Terrier

Parallel Processing of cluster by Map Reduce

Transcription:

DMX-h ETL Use Case Accelerator Web Log Aggregation

Syncsort Incorporated, 2015 All rights reserved. This document contains proprietary and confidential material, and is only for use by licensees of DMExpress. This publication may not be reproduced in whole or in part, in any form, except with written permission from Syncsort Incorporated. Syncsort is a registered trademark and DMExpress is a trademark of Syncsort, Incorporated. All other company and product names used herein may be the trademarks of their respective owners. The accompanying DMExpress program and the related media, documentation, and materials ("Software") are protected by copyright law and international treaties. Unauthorized reproduction or distribution of the Software, or any portion of it, may result in severe civil and criminal penalties, and will be prosecuted to the maximum extent possible under the law. The Software is a proprietary product of Syncsort Incorporated, but incorporates certain third-party components that are each subject to separate licenses and notice requirements. Note, however, that while these separate licenses cover the respective third-party components, they do not modify or form any part of Syncsort s SLA. Refer to the Third-party license agreements topic in the online help for copies of respective third-party license agreements referenced herein.

Table of Contents Table of Contents 1 Introduction... 1 2 Web Log Aggregation with DMX-h IX... 2 2.1 T_WebLogAggregation Task... 2 Appendix A Web Log Aggregation with DMX-h User-defined MapReduce... 3 A.1 Map Step... 3 A.1.1 MT_WebLogAggregationMapTask.dxt... 4 A.2 Reduce Step... 5 A.2.1 RT_WebLogAggregationReduceTask.dxt... 5 DMX-h ETL Use Case Accelerator: Web Log Aggregation i

Introduction 1 Introduction Web log aggregation is a process in which web log data is aggregated down into a smaller, more easily analyzed dataset. This use case accelerator implements a web log aggregation in Hadoop that counts the number of occurrences of each unique URL, using a single DMX-h ETL aggregation task. The use case accelerators are developed as standard DMExpress jobs with just minor accommodations that allow them to be run in or outside of Hadoop: When specified to run in the Hadoop cluster, DMX-h Intelligent Execution (IX) automatically converts them to be executed in Hadoop using the optimal Hadoop engine, such as MapReduce. In this case, some parts of the job may be run on the Linux edge node if not suitable for Hadoop execution. Otherwise, they are run on a Windows workstation or Linux edge node, which is useful for development and testing purposes before running in the cluster. While the method presented in the next section is the simplest and most efficient way to develop this example, the more complex user-defined MapReduce solution is provided as a reference in Appendix A. For guidance on setting up and running the examples both outside and within Hadoop, see Guide to DMExpress Hadoop Use Case Accelerators. For details on the requirements for IX jobs, see Developing Intelligent Execution Jobs in the DMExpress Help. DMX-h ETL Use Case Accelerator: Web Log Aggregation 1

Web Log Aggregation with DMX-h IX 2 Web Log Aggregation with DMX-h IX The web log aggregation solution in DMX-h ETL consists of a job, J_WebLogAggregation.dxj, with a single aggregation task. 2.1 T_WebLogAggregation Task In this aggregation task, the web log data in the HDFS source WebLogInput.log is cleansed of comments (rows that start with # ) via a conditional source filter. It is then grouped by the URL field and summarized by the count of unique URL s in the Aggregate dialog. Changing the format to binary integer, unsigned 4 bytes improves performance over default format of decimal unsigned. The HDFS target, AggURL, is reformatted to include only the URL and the count of unique occurrences. The HDFS source and target files are defined with a remote file connection to the Hadoop cluster. DMX-h ETL Use Case Accelerator: Web Log Aggregation 2

Web Log Aggregation with DMX-h User-defined MapReduce Appendix A Web Log Aggregation with DMX-h User-defined MapReduce This user-defined MapReduce solution is provided as a reference in the event that particular knowledge of your application s data would benefit from manual control of the MapReduce process. The Web Log Aggregation job in DMX-h contains a map step and a reduce step, separated by a MapReduce data flow connector between the map task and the reduce task as follows: A.1 Map Step The map step reads the web log input file and counts the number of occurrences of each record s URL after filtering out the comments from the web log data. It then assigns partition IDs to each record such that all records with the same key will go to the same reducer. The reduce step groups by the URL as the key, and totals the counts per URL produced by the mapper. The Web Log Aggregation map step in DMX-h consists of one aggregation task that counts the number of times a URL occurs in a given input file and a filter that further reduces the amount of data that goes to the reducer by removing the comments. A partition ID is also added to the beginning of each record to direct records with the same key to the same reducer. To optimize performance, the first phase of aggregation occurs in the map step as the aggregation reduces the amount of data going over the network to the reducer. As the mappers are passed only a subset of the data, the aggregation here is partial. DMX-h ETL Use Case Accelerator: Web Log Aggregation 3

Web Log Aggregation with DMX-h User-defined MapReduce A.1.1 MT_WebLogAggregationMapTask.dxt This task filters comments from the web log based on the occurrence of #. It then counts the number of occurrences of each URL. The count format is set to 4-byte unsigned binary integer to improve performance. The task further assigns the partition ID (partition) as the first field in each target record and as the first field of the group by fields. All records that share the same key must go to the same reducer. The partition ID determines the reducer to which the data goes. In this case, the records with the same URL key value must have the same partition ID; thus, we create a CRC32() hash value based on this key. DMX-h ETL Use Case Accelerator: Web Log Aggregation 4

Web Log Aggregation with DMX-h User-defined MapReduce In addition to the value to be hashed, the CRC32() function allows you to provide a number to determine a range of hash values. For partition, the range is defined as the environment variable DMX_HADOOP_NUM_REDUCERS, which provides the number of reducers invoked for your MapReduce job. As a result, the CRC32() function returns a number from 0 to the value of DMX_HADOOP_NUM_REDUCERS. The data must be ordered based on this partition ID so that the framework can properly distribute the data to the correct reducer. We include the partition ID as the first group by field and sort records based on the group by fields to ensure the data is in the correct sorted order. A.2 Reduce Step The Web Log Aggregation reduce step in DMX-h ETL consists of one aggregate task that groups by the URL as the key and totals the counts per URL produced by the mapper. A.2.1 RT_WebLogAggregationReduceTask.dxt This task groups each record by the URL as the aggregate key and then totals the counts per URL provided by the mapper. The resulting output produced by this task is the number of times each URL occurred in the set of web logs. DMX-h ETL Use Case Accelerator: Web Log Aggregation 5

Web Log Aggregation with DMX-h User-defined MapReduce DMX-h ETL Use Case Accelerator: Web Log Aggregation 6

About Syncsort Syncsort provides enterprise software that allows organizations to collect, integrate, sort, and distribute more data in less time, with fewer resources and lower costs. Thousands of customers in more than 85 countries, including 87 of the Fortune 100 companies, use our fast and secure software to optimize and offload data processing workloads. Powering over 50% of the world s mainframes, Syncsort software provides specialized solutions spanning Big Iron to Big Data, including next gen analytical platforms such as Hadoop, cloud, and Splunk. For more than 40 years, customers have turned to Syncsort s software and expertise to dramatically improve performance of their data processing environments, while reducing hardware and labor costs. Experience Syncsort at www.syncsort.com. Syncsort Inc. 50 Tice Boulevard, Suite 250, Woodcliff Lake, NJ 07677 201.930.8200