Offline Image Viewer Guide



Similar documents
HDFS File System Shell Guide

Command Line Interface User Guide for Intel Server Management Software

Hadoop Shell Commands

Hadoop Shell Commands

File System Shell Guide

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Command Line - Part 1

Integrating VoltDB with Hadoop

Hadoop Basics with InfoSphere BigInsights

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

Hadoop Basics with InfoSphere BigInsights

DistCp Guide. Table of contents. 3 Appendix Overview Usage Basic Options... 3

List of FTP commands for the Microsoft command-line FTP client

Introduction to HDFS. Prasanth Kothuri, CERN

SnapLogic Tutorials Document Release: October 2013 SnapLogic, Inc. 2 West 5th Ave, Fourth Floor San Mateo, California U.S.A.

Microsoft Windows PowerShell v2 For Administrators

This presentation explains how to monitor memory consumption of DataStage processes during run time.

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Cloudera Certified Developer for Apache Hadoop

Hadoop Distributed File System (HDFS)

Hadoop Streaming. Table of contents

NWADMIN Operations. Updated: 15 June 2011 Nintex Workflow 2010 v connect.nintex.com

CAPIX Job Scheduler User Guide

WebPublish User s Manual

Step-by-Step Guide to Active Directory Bulk Import and Export

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User's Guide

Adding a DNS Update Step to a Recovery Plan VMware vcenter Site Recovery Manager 4.0 and later

Hadoop Hands-On Exercises

RECOVER ( 8 ) Maintenance Procedures RECOVER ( 8 )

IDS 561 Big data analytics Assignment 1

Extreme computing lab exercises Session one

TECHNICAL REFERENCE GUIDE

DiskPulse DISK CHANGE MONITOR

TECHNICAL REFERENCE GUIDE

Hadoop Forensics. Presented at SecTor. October, Kevvie Fowler, GCFA Gold, CISSP, MCTS, MCDBA, MCSD, MCSE

Easily parallelize existing application with Hadoop framework Juan Lago, July 2011

Table of Contents Introduction Supporting Arguments of Sysaxftp File Transfer Commands File System Commands PGP Commands Other Using Commands

SyncTool for InterSystems Caché and Ensemble.

Hadoop Tutorial GridKa School 2011

Command-Line Operations : The Shell. Don't fear the command line...

Impala Introduction. By: Matthew Bollinger

Version 1.5 Satlantic Inc.

Extreme computing lab exercises Session one

Data processing goes big

RDMA for Apache Hadoop User Guide

Linux Audit Quick Start SUSE Linux Enterprise 10 SP1

Understanding Task Scheduler FIGURE Task Scheduler. The error reporting screen.

AES Crypt User Guide

IBM Sterling Control Center

Forefront Management Shell PowerShell Management of Forefront Server Products

Tivoli Log File Agent Version Fix Pack 2. User's Guide SC

HDFS Installation and Shell

Big Data Management and NoSQL Databases

XSLT File Types and Their Advantages

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

An Introduction to Using the Command Line Interface (CLI) to Work with Files and Directories

Apache Hadoop new way for the company to store and analyze big data

End of Sale/End of Life Report Tool Usage Notes for CiscoWorks NCM 1.6

Introduction to MapReduce and Hadoop

As shown, the emulator instance connected to adb on port 5555 is the same as the instance whose console listens on port 5554.

HDFS Users Guide. Table of contents

CMSC 216 UNIX tutorial Fall 2010

American International Journal of Research in Science, Technology, Engineering & Mathematics

Utilizing SFTP within SSIS

Hadoop Training Hands On Exercise

Privilege Manager for Unix How To

Hands-On UNIX Exercise:

Package hive. January 10, 2011

Big Data : Experiments with Apache Hadoop and JBoss Community projects

MAS 500 Intelligence Tips and Tricks Booklet Vol. 1

Also on the Performance tab, you will find a button labeled Resource Monitor. You can invoke Resource Monitor for additional analysis of the system.

HYPERION DATA RELATIONSHIP MANAGEMENT RELEASE BATCH CLIENT USER S GUIDE

Integrity Checking and Monitoring of Files on the CASTOR Disk Servers

Unlocking Hadoop for Your Rela4onal DB. Kathleen Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Pig Laboratory. Additional documentation for the laboratory. Exercises and Rules. Tstat Data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

The Hadoop Distributed File System

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Getting to know Apache Hadoop

Data Intensive Computing Handout 6 Hadoop

Auditing manual. Archive Manager. Publication Date: November, 2015

Data Intensive Computing Handout 4 Hadoop

Forensic Analysis of Internet Explorer Activity Files

C HAPTER E IGHTEEN T HE PGP, MAIL, AND CGI LIBRARIES. PGP Interface Library

Lab III: Unix File Recovery Data Unit Level

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

How To Use Hadoop

Monitoring Agent for PostgreSQL Fix Pack 10. Reference IBM

ADT: Inventory Manager. Version 1.0

Package hive. July 3, 2015

Setting Up the Site Licenses

Hadoop Tutorial. General Instructions

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Microsoft. Jump Start. M3: Managing Windows Server 2012 by Using Windows PowerShell 3.0

Module 12. Configuring and Managing Storage Technologies. Contents:

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

Transcription:

Table of contents 1 Overview... 2 2 Usage... 3 2.1 Basic...3 2.2 Example... 3 3 Options...5 3.1 Option Index... 5 4 Analyzing Results... 6 4.1 Total Number of Files for Each User...6 4.2 Files That Have Never Been Accessed... 7 4.3 Probable Duplicated Files Based on File Size...8

1 Overview The Offline Image Viewer is a tool to dump the contents of hdfs fsimage files to humanreadable formats in order to allow offline analysis and examination of an Hadoop cluster's namespace. The tool is able to process very large image files relatively quickly, converting them to one of several output formats. The tool handles the layout formats that were included with Hadoop versions 16 and up. If the tool is not able to process an image file, it will exit cleanly. The Offline Image Viewer does not require an Hadoop cluster to be running; it is entirely offline in its operation. The Offline Image Viewer provides several output processors: 1. Ls is the default output processor. It closely mimics the format of the lsr command. It includes the same fields, in the same order, as lsr : directory or file flag, permissions, replication, owner, group, file size, modification date, and full path. Unlike the lsr command, the root path is included. One important difference between the output of the lsr command this processor, is that this output is not sorted by directory name and contents. Rather, the files are listed in the order in which they are stored in the fsimage file. Therefore, it is not possible to directly compare the output of the lsr command this this tool. The Ls processor uses information contained within the Inode blocks to calculate file sizes and ignores the -skipblocks option. 2. Indented provides a more complete view of the fsimage's contents, including all of the information included in the image, such as image version, generation stamp and inodeand block-specific listings. This processor uses indentation to organize the output into a hierarchal manner. The lsr format is suitable for easy human comprehension. 3. Delimited provides one file per line consisting of the path, replication, modification time, access time, block size, number of blocks, file size, namespace quota, diskspace quota, permissions, username and group name. If run against an fsimage that does not contain any of these fields, the field's column will be included, but no data recorded. The default record delimiter is a tab, but this may be changed via the -delimiter command line argument. This processor is designed to create output that is easily analyzed by other tools, such as Apache Pig. See the Analyzing Results section for further information on using this processor to analyze the contents of fsimage files. 4. XML creates an XML document of the fsimage and includes all of the information within the fsimage, similar to the lsr processor. The output of this processor is amenable to automated processing and analysis with XML tools. Due to the verbosity of the XML syntax, this processor will also generate the largest amount of output. 5. FileDistribution is the tool for analyzing file sizes in the namespace image. In order to run the tool one should define a range of integers [0, maxsize] by specifying maxsize and a step. The range of integers is divided into segments of size step: [0, s1,..., sn-1, maxsize], and the processor calculates how many files Page 2

in the system fall into each segment [si-1, si). Note that files larger than maxsize always fall into the very last segment. The output file is formatted as a tab separated two column table: Size and NumFiles. Where Size represents the start of the segment, and numfiles is the number of files form the image which size falls in this segment. 2 Usage 2.1 Basic The simplest usage of the Offline Image Viewer is to provide just an input and output file, via the -i and -o command-line switches: bash$ bin/hadoop oiv -i fsimage -o fsimage.txt This will create a file named fsimage.txt in the current directory using the Ls output processor. For very large image files, this process may take several minutes. One can specify which output processor via the command-line switch -p. For instance: bash$ bin/hadoop oiv -i fsimage -o fsimage.xml -p XML or bash$ bin/hadoop oiv -i fsimage -o fsimage.txt -p Indented This will run the tool using either the XML or Indented output processor, respectively. One command-line option worth considering is -skipblocks, which prevents the tool from explicitly enumerating all of the blocks that make up a file in the namespace. This is useful for file systems that have very large files. Enabling this option can significantly decrease the size of the resulting output, as individual blocks are not included. Note, however, that the Ls processor needs to enumerate the blocks and so overrides this option. 2.2 Example Consider the following contrived namespace: drwxr-xr-x - theuser supergroup 0 2009-03-16 21:17 /anotherdir -rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 21:15 /anotherdir/biggerfile -rw-r--r-- 3 theuser supergroup 8754 2009-03-16 21:17 /anotherdir/smallfile drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/ mapredsystem Page 3

drwx-wx-wx - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/ mapredsystem/ip.redacted.com drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one/two drwxr-xr-x - theuser supergroup 0 2009-03-16 21:16 /user drwxr-xr-x - theuser supergroup 0 2009-03-16 21:19 /user/theuser Applying the Offline Image Processor against this file with default options would result in the following output: machine:hadoop-0.21.0-dev theuser$ bin/hadoop oiv -i fsimagedemo -o fsimage.txt drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 / drwxr-xr-x - theuser supergroup 0 2009-03-16 14:17 /anotherdir drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 /user -rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 14:15 /anotherdir/biggerfile -rw-r--r-- 3 theuser supergroup 8754 2009-03-16 14:17 /anotherdir/smallfile drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/ mapredsystem drwx-wx-wx - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/ mapredsystem/ip.redacted.com drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one/two drwxr-xr-x - theuser supergroup 0 2009-03-16 14:19 /user/theuser Similarly, applying the Indented processor would generate output that begins with: machine:hadoop-0.21.0-dev theuser$ bin/hadoop oiv -i fsimagedemo -p Indented -o fsimage.txt FSImage ImageVersion = -19 NamespaceID = 2109123098 GenerationStamp = 1003 INodes [NumInodes = 12] Page 4

Inode INodePath = Replication = 0 ModificationTime = 2009-03-16 14:16 AccessTime = 1969-12-31 16:00 BlockSize = 0 Blocks [NumBlocks = -1] NSQuota = 2147483647 DSQuota = -1 Permissions Username = theuser GroupName = supergroup PermString = rwxr-xr-x remaining output omitted 3 Options 3.1 Option Index Flag [-i --inputfile] <input file> [-o --outputfile] <output file> [-p --processor] <processor> -skipblocks -printtoscreen Description Specify the input fsimage file to process. Required. Specify the output filename, if the specified output processor generates one. If the specified file already exists, it is silently overwritten. Required. Specify the image processor to apply against the image file. Currently valid options are Ls (default), XML and Indented.. Do not enumerate individual blocks within files. This may save processing time and outfile file space on namespaces with very large files. The Ls processor reads the blocks to correctly determine file sizes and ignores this option. Pipe output of processor to console as well as specified file. On extremely large namespaces, Page 5

Flag Description this may increase processing time by an order of magnitude. -delimiter <arg> [-h --help] When used in conjunction with the Delimited processor, replaces the default tab delimiter with the string specified by arg. Display the tool usage and help information and exit. 4 Analyzing Results The Offline Image Viewer makes it easy to gather large amounts of data about the hdfs namespace. This information can then be used to explore file system usage patterns or find specific files that match arbitrary criteria, along with other types of namespace analysis. The Delimited image processor in particular creates output that is amenable to further processing by tools such as Apache Pig. Pig provides a particularly good choice for analyzing these data as it is able to deal with the output generated from a small fsimage but also scales up to consume data from extremely large file systems. The Delimited image processor generates lines of text separated, by default, by tabs and includes all of the fields that are common between constructed files and files that were still under constructed when the fsimage was generated. Examples scripts are provided demonstrating how to use this output to accomplish three tasks: determine the number of files each user has created on the file system, find files were created but have not accessed, and find probable duplicates of large files by comparing the size of each file. Each of the following scripts assumes you have generated an output file using the Delimited processor named foo and will be storing the results of the Pig analysis in a file named results. 4.1 Total Number of Files for Each User This script processes each path within the namespace, groups them by the file owner and determines the total number of files each user owns. numfilesofeachuser.pig: -- This script determines the total number of files each user has in -- the namespace. Its output is of the form: -- username, totalnumfiles -- Load all of the fields from the file A = LOAD '$inputfile' USING PigStorage('\t') AS (path:chararray, replication:int, modtime:chararray, accesstime:chararray, Page 6

blocksize:long, numblocks:int, filesize:long, NamespaceQuota:int, DiskspaceQuota:int, perms:chararray, username:chararray, groupname:chararray); -- Grab just the path and username B = FOREACH A GENERATE path, username; -- Generate the sum of the number of paths for each user C = FOREACH (GROUP B BY username) GENERATE group, COUNT(B.path); -- Save results STORE C INTO '$outputfile'; This script can be run against pig with the following command: bin/pig -x local -param inputfile=../foo -param outputfile=../ results../numfilesofeachuser.pig The output file's content will be similar to that below: bart 1 lisa 16 homer 28 marge 2456 4.2 Files That Have Never Been Accessed This script finds files that were created but whose access times were never changed, meaning they were never opened or viewed. neveraccessed.pig: -- This script generates a list of files that were created but never -- accessed, based on their AccessTime -- Load all of the fields from the file A = LOAD '$inputfile' USING PigStorage('\t') AS (path:chararray, replication:int, modtime:chararray, accesstime:chararray, blocksize:long, numblocks:int, filesize:long, NamespaceQuota:int, DiskspaceQuota:int, perms:chararray, username:chararray, groupname:chararray); Page 7

-- Grab just the path and last time the file was accessed B = FOREACH A GENERATE path, accesstime; -- Drop all the paths that don't have the default assigned last-access time C = FILTER B BY accesstime == '1969-12-31 16:00'; -- Drop the accesstimes, since they're all the same D = FOREACH C GENERATE path; -- Save results STORE D INTO '$outputfile'; This script can be run against pig with the following command and its output file's content will be a list of files that were created but never viewed afterwards. bin/pig -x local -param inputfile=../foo -param outputfile=../ results../neveraccessed.pig 4.3 Probable Duplicated Files Based on File Size This script groups files together based on their size, drops any that are of less than 100mb and returns a list of the file size, number of files found and a tuple of the file paths. This can be used to find likely duplicates within the filesystem namespace. probableduplicates.pig: -- This script finds probable duplicate files greater than 100 MB by -- grouping together files based on their byte size. Files of this size -- with exactly the same number of bytes can be considered probable -- duplicates, but should be checked further, either by comparing the -- contents directly or by another proxy, such as a hash of the contents. -- The scripts output is of the type: -- filesize numprobableduplicates {(probabledup1), (probabledup2)} -- Load all of the fields from the file A = LOAD '$inputfile' USING PigStorage('\t') AS (path:chararray, replication:int, modtime:chararray, accesstime:chararray, blocksize:long, numblocks:int, filesize:long, NamespaceQuota:int, DiskspaceQuota:int, perms:chararray, username:chararray, groupname:chararray); -- Grab the pathname and filesize B = FOREACH A generate path, filesize; -- Drop files smaller than 100 MB C = FILTER B by filesize > 100L * 1024L * 1024L; Page 8

-- Gather all the files of the same byte size D = GROUP C by filesize; -- Generate path, num of duplicates, list of duplicates E = FOREACH D generate group AS filesize, COUNT(C) as numdupes, C.path AS files; -- Drop all the files where there are only one of them F = FILTER E by numdupes > 1L; -- Sort by the size of the files G = ORDER F by filesize; -- Save results STORE G INTO '$outputfile'; This script can be run against pig with the following command: bin/pig -x local -param inputfile=../foo -param outputfile=../ results../probableduplicates.pig The output file's content will be similar to that below: 1077288632 2 {(/user/tennant/work1/part-00501),(/user/tennant/work1/part-00993)} 1077288664 4 {(/user/tennant/work0/part-00567),(/user/tennant/work0/part-03980),(/user/ tennant/work1/part-00725),(/user/eccelston/output/part-03395)} 1077288668 3 {(/user/tennant/work0/part-03705),(/user/tennant/work0/part-04242),(/user/ tennant/work1/part-03839)} 1077288698 2 {(/user/tennant/work0/part-00435),(/user/eccelston/output/part-01382)} 1077288702 2 {(/user/tennant/work0/part-03864),(/user/eccelston/output/part-03234)} Each line includes the file size in bytes that was found to be duplicated, the number of duplicates found, and a list of the duplicated paths. Files less than 100MB are ignored, providing a reasonable likelihood that files of these exact sizes may be duplicates. Page 9