A Modified Key Partitioning for BigData Using MapReduce in Hadoop



Similar documents
Domain 1: Designing a SQL Server Instance and a Database Solution

Modified Line Search Method for Global Optimization

(VCP-310)

An improved partitioning mechanism for optimizing massive data analysis using MapReduce

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

Department of Computer Science, University of Otago

ODBC. Getting Started With Sage Timberline Office ODBC

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

DAME - Microsoft Excel add-in for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2

Agenda. Outsourcing and Globalization in Software Development. Outsourcing. Outsourcing here to stay. Outsourcing Alternatives

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

5 Boolean Decision Trees (February 11)

Determining the sample size

Chatpun Khamyat Department of Industrial Engineering, Kasetsart University, Bangkok, Thailand

ADAPTIVE NETWORKS SAFETY CONTROL ON FUZZY LOGIC

Research Article Sign Data Derivative Recovery

Systems Design Project: Indoor Location of Wireless Devices

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

C.Yaashuwanth Department of Electrical and Electronics Engineering, Anna University Chennai, Chennai , India..

Domain 1 - Describe Cisco VoIP Implementations

Task Scheduling Algorithm for Map Reduce To Control Load Balancing In Big Data

A Secure Implementation of Java Inner Classes

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation

MapReduce Based Implementation of Aggregate Functions on Cassandra

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

Recovery time guaranteed heuristic routing for improving computation complexity in survivable WDM networks

Lecture 2: Karger s Min Cut Algorithm

How to read A Mutual Fund shareholder report

How to use what you OWN to reduce what you OWE

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Optimize your Network. In the Courier, Express and Parcel market ADDING CREDIBILITY

Hypergeometric Distributions

A NOVEL APPROACH FOR PARTITIONING IN HADOOP USING ROUND ROBIN TECHNIQUE

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

LECTURE 13: Cross-validation

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

CHAPTER 3 THE TIME VALUE OF MONEY

CS100: Introduction to Computer Science

INVESTMENT PERFORMANCE COUNCIL (IPC) Guidance Statement on Calculation Methodology

CREATIVE MARKETING PROJECT 2016

Incremental calculation of weighted mean and variance

Comparative Analysis of Round Robin VM Load Balancing With Modified Round Robin VM Load Balancing Algorithms in Cloud Computing

5.4 Amortization. Question 1: How do you find the present value of an annuity? Question 2: How is a loan amortized?

A Balanced Scorecard

Baan Service Master Data Management

Automatic Tuning for FOREX Trading System Using Fuzzy Time Series

Soving Recurrence Relations

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

Confidence Intervals for One Mean

Reliability Analysis in HPC clusters

Study on the application of the software phase-locked loop in tracking and filtering of pulse signal

Engineering Data Management

The Canadian Council of Professional Engineers

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

A Combined Continuous/Binary Genetic Algorithm for Microstrip Antenna Design

iprox sensors iprox inductive sensors iprox programming tools ProxView programming software iprox the world s most versatile proximity sensor

Convention Paper 6764

International Journal on Emerging Technologies 1(2): 48-56(2010) ISSN :

The Forgotten Middle. research readiness results. Executive Summary

Chapter 7: Confidence Interval and Sample Size

Virtual Machine Scheduling Management on Cloud Computing Using Artificial Bee Colony

Clustering Algorithm Analysis of Web Users with Dissimilarity and SOM Neural Networks

Amendments to employer debt Regulations

Cantilever Beam Experiment

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

CHAPTER 3 DIGITAL CODING OF SIGNALS

LOAD BALANCING IN PUBLIC CLOUD COMBINING THE CONCEPTS OF DATA MINING AND NETWORKING

Evaluating Model for B2C E- commerce Enterprise Development Based on DEA

Unicenter TCPaccess FTP Server

Effective Techniques for Message Reduction and Load Balancing in Distributed Graph Computation

Shared Memory with Caching

Configuring Additional Active Directory Server Roles

AdaLab. Adaptive Automated Scientific Laboratory (AdaLab) Adaptive Machines in Complex Environments. n Start Date:

A Distributed Dynamic Load Balancer for Iterative Applications

Design and Implementation of a Publication Database for the Vienna University of Technology

A Flexible Web-Based Publication Database

1 Computing the Standard Deviation of Sample Means

Capacity of Wireless Networks with Heterogeneous Traffic

Total Program Management for High-Tech

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Domain 1: Configuring Domain Name System (DNS) for Active Directory

Floating Codes for Joint Information Storage in Write Asymmetric Memories

WindWise Education. 2 nd. T ransforming the Energy of Wind into Powerful Minds. editi. A Curriculum for Grades 6 12

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

A model of Virtual Resource Scheduling in Cloud Computing and Its

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

Effective Techniques for Message Reduction and Load Balancing in Distributed Graph Computation

Domain 1: Identifying Cause of and Resolving Desktop Application Issues Identifying and Resolving New Software Installation Issues

Application and research of fuzzy clustering analysis algorithm under micro-lecture English teaching mode

Measures of Spread and Boxplots Discrete Math, Section 9.4

Enhancing Oracle Business Intelligence with cubus EV How users of Oracle BI on Essbase cubes can benefit from cubus outperform EV Analytics (cubus EV)

Effective Data Deduplication Implementation

Output Analysis (2, Chapters 10 &11 Law)

SPC for Software Reliability: Imperfect Software Debugging Model

PUBLIC RELATIONS PROJECT 2016

How To Improve Software Reliability

Statistical inference: example 1. Inferential Statistics

Digital Enterprise Unit. White Paper. Web Analytics Measurement for Responsive Websites

Transcription:

Joural of Computer Sciece Origial Research Paper A Modified Key Partitioig for BigData Usig MapReduce i Hadoop Gothai Ekambaram ad Balasubramaie Palaisamy Departmet of CSE, Kogu Egieerig College, Erode-638052, Tamiladu, Idia Article history Received: 07-03-2014 Revised: 23-03-2014 Accepted: 21-03-2015 Correspodig Author: Gothai Ekambaram Departmet of CSE, Kogu Egieerig College, Erode- 638052, Tamiladu, Idia Email: kothaie@yahoo.co.i Abstract: I the period of BigData, massive amouts of structured ad ustructured data are beig created every day by a multitude of everpreset sources. BigData is complicated to work with ad eeds extremely parallel software executig o a huge umber of computers. MapReduce is a curret programmig model that makes simpler writig distributed applicatios which maipulate BigData. I order to make MapReduce to work, it has to divide the workload betwee the computers i the etwork. As a result, the performace of MapReduce vigorously depeds o how cosistetly it distributes this study load. This ca be a challege, particularly i the arrival of data skew. I MapReduce, workload allocatio depeds o the algorithm that partitios the data. How cosistetly the partitioer distributes the data depeds o how huge ad delegate the sample is ad o how healthy the samples are examied by the partitioig method. This study recommeds a ehaced partitioig algorithm usig modified key partitioig that advaces load balacig ad memory utilizatio. This is completed via a ehaced samplig algorithm ad partitioer. To estimate the proposed algorithm, its performace was compared agaist a high-tech partitioig mechaism employed by TeraSort. Experimetatios demostrate that the proposed algorithm is quicker, more memory efficiet ad more accurate tha the existig implemetatio. Keywords: Hadoop, Hash Code, Partitioig, MapReduce Itroductio Over the past decades, computer techology has become icreasigly ubiquitous. Computig devices have umerous uses ad are essetial for busiesses, scietists, govermets, egieers ad the everyday cosumer. What all these devices have i geeral is the probable to produce data. I essece, data ca arrive from everywhere. The majority types of data have a propesity to have their ow distictive set of characteristics over ad above how that data is dispersed. Data that is ot examied or utilized has small sigificace ad ca be a waste of space ad resources. O the cotrary, data that is executed o or examied ca be of immeasurable value. The data itself may be too huge to store o a sigle computer. As a result, i order to decrease the time it takes to execute the data ad to have the storage space to store the data, software egieers have to write dow programs that ca perform o 2 or more computers ad dispese the workload amogst them. While abstractly the computatio to execute may be straightforward, traditioally the implemetatio has bee complicated. I reactio to these extremely same matters, egieers at Google built up the Google File System (GFS) as stated by (Ghemawat et al., 2003), a distributed file system desig represetatio for major data processig ad formed the MapReduce programmig model by (Dea ad Ghemawat, 2008). Hadoop is a ope source implemetatio of MapReduce, writte i Java, iitially developed by Yahoo. Ta et al. (2009) stated that Hadoop was built i respose to the eed for a MapReduce structure that was ufettered by proprietal liceses, i additio to the icreasig eed for the techology i Cloud computig. Hive, Pig, ZooKeeper ad HBase are all examples of regularly utilized extesios to the Hadoop structure. Likewise, this study also cocetrates o Hadoop ad examies the load balacig mechaism i Hadoop s MapReduce skeleto for small-sized to medium-sized clusters. I summary, this study presets a techique for icreasig the work load distributio amog odes i the MapReduce framework, a techique to decrease the 2015 Gothai Ekambaram ad Balasubramaie Palaisamy. This ope access article is distributed uder a Creative Commos Attributio (CC-BY) 3.0 licese.

ecessary memory footprit ad improved executio time for MapReduce whe these techiques are performed o small or medium sized cluster of computers. The remaiig part of this study is plaed as follows. Sectio 2 discusses some basic iformatio o MapReduce ad its iteral workigs. Sectio 3 presets the related work ad existig methods applied for TeraSort i Hadoop. Sectio 4 cotais a proposed idea for a improved load balacig methodology ad a way to better utilize memory. Sectio 5 itroduces ivestigatioal results ad a discussio of this study s fidigs. Sectio 6 cocludes this study with a brief idea to future work. Backgroud MapReduce Dea ad Ghemawat (2008) metioed that MapReduce is a programmig represetatio created as a method for programs to hadle with huge amouts of data. It attais this objective by distributig the workload amog several computers ad after that workig o the data i parallel. Hsu et al. (2007) stated that programs that perform o a MapReduce structure eed to separate the work ito 2 phases kow as Map ad Reduce. Each phase has key-value pairs for both iput ad output. To put ito practice these phases, a programmer eeds to state 2 fuctios: A map fuctio called a Mapper ad its equivalet reduce fuctio called a Reducer. While a MapReduce program is performed o Hadoop, it is aticipated to be ru o several computers or odes. For that reaso, a master ode is ecessary to ru all the essetial services desired to orgaize the commuicatio betwee Mappers ad Reducers. A istace of MapReduce dataflow is show i Fig. 1. Kavulya et al. (2010) reported that i the MapReduce structure, the workload has to be balaced i order for resources to be utilized powerfully. HashCode Hadoop utilizes a hash code as its stadard method to partitio key-value pairs. The hash code itself ca be depicted mathematically ad is represeted by (Ke et al., 2013) as the subsequet equatio: HashCode = W *31 + W *31 +... + W *31 TotalWord = W *31 = 1 1 1 2 0 1 1 (1) The hash code give i Equatio 1 is the default hash code utilized by a strig object i Java, the programmig laguage o which Hadoop is based. A partitio fuctio ormally utilizes the hash code of the key ad modulo of reducers to decide which reducer to sed the key-value pair to. It is essetial the that the partitio fuctio uiformly distributes key-value pairs amog reducers for appropriate workload distributio. TeraSort O Malley (2008) stated that Hadoop ruied the world record i sortig a Terabyte of data by usig its TeraSort techique. Wiig first place it maaged to sort 1 TB of data i 209 sec (3.48 mi). This was the first occasio either a Java program or a ope source program had wo the cotest. TeraSort was able to step up the sortig process by distributig the workload uiformly withi the MapReduce framework. This was doe via data samplig ad the use of a Trie as stated by (Pada et al., 2010). Eve though the mai goal of TeraSort was to sort 1 TB of data as speedily as possible, it has sice bee icorporated ito Hadoop as a stadard. Fig. 1. MapReduce dataflow 491

O the whole, the TeraSort algorithm is extremely alike to the stadard MapReduce sort. Its efficiecies rely o how it distributes its data betwee the Mappers ad Reducers. To attai a excellet load balace, TeraSort uses a custom partitioer. Sice the origial goal of TeraSort was to sort data as speedily as possible, its implemetatio adopted a space for time approach. For this reaso, TeraSort utilizes a 2-level trie to partitio the data. Ke et al. (2013) has show that a trie which cofies strigs stored i it to 2 characters is kow as 2- level Trie. This 2-level Trie is built usig cut poits extracted from the sampled data. Oce the trie is costructed usig the cut poits, the partitioer ca iitiate its job of partitio strigs based o where i the trie that strig would go if it were to be icluded i the trie. Related Works Sortig is a primary cocept ad is madatory step i coutless algorithms. Heiz et al. (2002) stated that Burst Sort is a sortig algorithm developed for sortig strigs i huge data collectios. The TeraSort algorithm also utilizes these burst trie techiques as a method to sort data but does so uder the perspective of the Hadoop architecture ad the MapReduce framework. A essetial problem for the MapReduce framework is the idea of load balacig. Over the period, several researches have bee doe o the area of load balacig. Where data is situated by (Hsu ad Che, 2012), how it is commuicated by (Hsu ad Che, 2010), what backgroud it is beig located o by (Hsu ad Tsai, 2009; Hsu et al., 2008; Zaharia et al., 2008) ad the statistical allotmet of the data ca all have a outcome o a systems efficiecy. Most of these algorithms ca be foud uiversal i a variety of papers ad have bee utilized by structures ad systems earlier to the subsistece of the MapReduce structure stated by (Krisha, 2005; Stockiger et al., 2006). As stated by (Cada et al., 2010), RaKloud make use of its persoal usplit method for partitioig huge media data sets. The usplit method is required to decrease data duplicatio costs ad exhausted resources that are particular to its media based algorithms. So as to work just about perceived boudaries of the MapReduce model, various exted or chages i the MapReduce models have bee offered. BigTable was lauched by Google to hadle structured data as reported by (Chag et al., 2008). BigTable looks like a database, but does ot support a complete relatioal database model. It utilizes rows with successive keys grouped ito tables that form the etity of allocatio ad load balacig. Ad experieces from the similar load ad memory balacig troubles faced by shared othig databases. HBase of Hadoop is the ope source versio of BigTable, which imitates the similar fuctioality of BigTable. Because of its simplicity of use, the MapReduce model is pretty popular ad has umerous implemetatios as reported by (Liu ad Orba, 2011; Miceli et al., 2009). For that reaso, there has bee a diversity of research o MapReduce so as to get better performace of the structure or the performace of particular applicatios similar to graph miig as metioed by (Jiag ad Agrawal, 2011), data miig reported by (Papadimitriou ad Su, 2008; Xu et al., 2009), geetic algorithms by (Ji et al., 2008; Verma et al., 2009), or text aalysis by (Vashishtha et al., 2010) that execute o the framework. Occasioally, researchers discover the MapReduce structure to be too strict or rigid i its existig implemetatio. Fadika ad Govidaraju (2011) stated that DELMA is oe of such a framework which imitates the MapReduce model, idetical to Hadoop MapReduce. Such a system is likely to have attractive load balacig problems, which is afar the scope of our paper. Oe more differet framework to MapReduce is Jumbo as reported by (Groot ad Kitsuregawa, 2010). The Jumbo framework may be a helpful tool to research load balacig, but it is ot well-matched with existig MapReduce techologies. To work aroud load balacig problems resultig from joiig tables i Hadoop, (Lyde et al., 2011) itroduced a adaptive MapReduce algorithm for several jois usig Hadoop that works without chagig its settig. This study also attempts to do workload balacig i Hadoop without chagig the origial structure, but cocetrates o sortig text. Ke et al. (2013) stated that the XTrie algorithm preseted a method to advace the cut poit algorithm derived from TeraSort. The importat issue of the TeraSort algorithm is that to deal with the cut poits it utilizes the Quick Sort algorithm. By usig quicksort, TeraSort wats to store all the keys it samples i memory ad that decreases the probable sample size, which decreases the correctess of the preferred cut poits ad this affects load balacig metioed by (O Malley, 2008). Oe more difficulty TeraSort has is that it oly thiks the first 2 characters of a strig durig partitioig. This also decreases the efficiecy of the TeraSort load balacig algorithm: HashCode = W * 256 + W * 256 +... + W * 256 TotalWord = = 1 W * 256 1 1 2 0 1 1 (2) The mai issue derived by TeraSort ad XTrie is that they utilize a array to represet the trie. The major cocer with this method is that it teds to hold a lot of exhausted space. Ke et al. (2013) also stated that a Algorithm, the ReMap algorithm, which decreases the memory requiremets of the origial trie by decreasig the umber of elemets it believes. The ReMap chart 492

maps each oe of the 256 characters o a ASCII chart to the reduced set of elemets aticipated by the ETrie. Sice the reaso of ETrie is to imitate words foud i Eglish text ReMap relocates the ASCII characters to the 64 elemets. By droppig the umber of elemets to thik from 256 to 64 elemets per level, the total memory ecessary is reduced to 1/16 th of its origial footprit for a 2-level Trie. So as to use the ETrie, the TrieCode offered i Equatio 2 has to be customized. The EtrieCode showig i Equatio 3 is alike to the TrieCode i Equatio 2, but has bee chaged to replicate the smaller memory footprit. Eve if it is superior to XTrie, the difficulty with this method is that it teds to have a lot of exhausted space. The EtrieCode equatio is as follows: HashCode = W * 64 + W * 64 +... + W * 64 TotalWord = = 1 W *64 1 The Proposed Method 1 2 0 1 1 (3) This sectio describes the key partitioig as a alterative of hash code partitioig usig Horer s Rule which will be icorporated i TeraSort of Hadoop. Besides, this sectio discusses how memory ca be saved by meas of a ReMap techique. I accordace with ivestigatioal outcome of XTrie ad ETrie, the irregular rate is lower, lower beig improved, while a trie has more levels. This is sice the deeper a trie is the loger the prefix each key symbolizes. So, i this study, full legth key is cosidered as prefix istead of 2 or 3 ad the hash value also calculated for the full key. A trie has 2 advatages whe compared with the quick sort algorithm. First, the time complexity for isert ad search usig the trie algorithm is O (k) where k is the legth of the key. I the meatime, the quick sort algorithm best ad average case is O ( log ) ad i the worst case O ( 2 ) where is the umber of keys i its sample. Next, a trie has a predetermied memory footprit. This meas the umber of samples moved ito the trie ca be eormous if so preferred. I the proposed HTrie algorithm, the HTrie is a array accessed via a HTrie code. A HTrie code is alike to a hashcode, but the codes it geerates occur i chroological ASCII order usig Horer s Hash Key Rule. The equatio for the HTrie code is also a hash code which will use the ext prime umber as specified by Horer s Rule sice the whole key is cosidered istead of a trie structure. Equatio 2 ad 3 used 256 ad 64 respectively to get the hash code ad also provided best value sice oly 2 or 3 prefixes were cosidered. So, to get the differet as well as good result, the ext prime umber 37 istead of 31 is used. The equatio is as follows: HashCode = W *37 + W *37 +... + W *37 TotalWord = W *37 = 1 1 1 2 0 1 1 (4) Figure 2 illustrates how the hash code works for a usual partitioer. I this illustratio, there are 3 reducers ad 3 strigs. Each strig comes from a key i a (key, value) pair. The first strig ate cosists of 3 characters a, t ad e ad have the equivalet ASCII values. The specific ASCII values are the supplied to Equatio 4 to obtai the hash value 137186. Because of 3 reducers, a modulo 3 is used which provides a value 2. The the value is icreased by oe i the illustratio sice there is o reducer 0, which modifies the value to 3. This moved the key-value pair to reducer 3. Usig the similar techique, the 2 other strigs bad ad ca are allocated to reducers 2 ad 1, correspodigly. Fig. 2. Proposed Hashcode Partitioer 493

Results To estimate the performace of the proposed method, this study examies how fie the algorithms dispese the workload ad looks at how fie the memory is used. Tests performed i this study were completed usig LastFm Dataset, with each record cotaiig the user profile with fields like coutry, geder, age ad date. Usig these records as our iput, we simulated computer etworks usig VMware for Hadoop file system. The tests are carried out with a rage of size of dataset such as 1 Lakh, 3 Lakhs, 5 Lakhs, 10 Lakhs, 50 Lakhs ad 1 Crore records. Durig the first experimet, a iput file cotaiig 1 lakh records is cosidered. As metioed i the MapReduce Framework, the iput set is divided ito various splits ad forwarded to Map Phase. Here for this iput file, oly oe mapper is cosidered sice the umber of mappers is depeds o the size of the iput file. After mappig, partitio algorithm is used to reduce the umber of output records by groupig records based o Htrie value o the coutry attribute which is assumed as a key here. After groupig, 4 partitios are created usig the procedure Geder- Group-by-Coutry. All the correspodig log files ad couters are aalyzed to view the performace. I the other 5 experimets, iput files with 3 Lakhs, 5 Lakhs, 10 Lakhs, 50 Lakhs ad 1 Crore records are cosidered. As per the above said method, all the iput files are partitioed ito 4 partitios. I order to compare the differet methodologies preseted i this study ad determie how balaced the workload distributios are, this study uses various metrics such as Effective CPU, Rate ad Skew amog various metrics like clock time, CPU, Bytes, Memory, Effective CPU, Rate ad Skew sice oly the said 3 parameters shows the sigificat differece i outcomes. Rate displays the umber of bytes from the Bytes colum divided by the umber of secods elapsed sice the previous report, rouded to the earest kilobyte. No umber appears for values less tha oe KB per secod. Effective CPU displays the CPU-secods cosumed by the job betwee reports, divided by the umber of secods elapsed sice the previous report. The result is expressed i uits of CPU-secods per secod-a measure of how process or itesive the job is from each report to the ext. The skew of a data or flow partitio is the amout by which its size deviates from the average partitio size: partitio size average partitio size skewof a data = *100 sizeof largest partitio Discussio The Tables 1-3 shows the results whe usig various sized iput files for the compariso of the performace of ETrie, XTrie ad HTrie with the parameters Skew, Effective CPU ad Rate respectively. Similarly, the Fig. 3-5 shows compariso chart of the results of the above. From the tables ad figures for results, it is show that the proposed method (HTrie) is performig better tha XTrie ad ETrie based o all the 3 parameters said above. Fig. 3. Compariso chart of skew 494

Fig. 4. Compariso chart of effective CPU Fig. 5. Compariso chart of rate Table 1. Compariso of skew No. of records XTrie (%) ETrie (%) HTrie (%) 100000 14.24 15.27 12.96 300000 13.79 12.34 11.63 500000 12.18 14.44 12.50 1000000 12.43 12.11 11.93 5000000 13.48 14.29 11.96 10000000 14.52 14.78 11.96 Table 2. Compariso of effective CPU No. of records XTrie ETrie HTrie 100000 0.054 0.061 0.047 300000 0.068 0.076 0.061 500000 0.078 0.087 0.070 1000000 0.079 0.088 0.073 5000000 0.075 0.084 0.071 10000000 0.077 0.086 0.074 Table 3. Compariso of rate No. of records XTrie ETrie HTrie 100000 9653 8995 8218 300000 13032 11694 11147 500000 16551 14033 13099 1000000 18206 15436 14127 5000000 18388 15899 14439 10000000 18204 15422 14200 Coclusio This study preseted HTrie, comprehesive partitioig techique, to improve load balacig for distributed applicatios. By meas of improvig load balacig, MapReduce programs ca tur out to be more proficiet at maagig tasks by reducig the overall computatio time spet processig data o each ode. The 495

TeraSort was developed based o arbitrarily geerated iput data o a extremely huge cluster of 910 odes. I that specific computig settig ad for that data cofiguratio, every partitio created by MapReduce became visible o simply oe or 2 odes. But i cotrast, our work cocetrates at small-sized to medium-sized clusters. This study chages their model ad boosts it for a smaller eviromet. A sequece of experimetatios have exposed that give a skewed data sample, the HTrie architecture was capable to safeguard more memory, was capable to distribute more computig resources o average ad do so with a lesser amout of time complexity. After this, additioal research ca be made to itroduce ew partitioig mechaisms so that it ca be icorporated with Hadoop for applicatios usig differet iput samples sice Hadoop file system is ot havig ay partitioig mechaism except key partitioig. Ackowledgemet The authors ackowledged Last.fm for providig the access to this data via their web services. Fudig Iformatio The authors have ot approached ay fudig agecies for fudig this work though there various fudig agecies were ready to fud this work. Author s Cotributios Gothai Ekambaram: Plaed ad desiged all the experimets, collected all the ecessary data sets, orgaized the study, implemeted all the experimets ad cotributed i writig this mauscript. Balasubramaie Palaisamy: Plaed ad desiged all the experimets, collected all the ecessary data sets, orgaized the study, implemeted all the experimets ad cotributed i writig this mauscript alog with Gothai Ekambaram as research supervisor. Ethics The authors have cofirmed that there will ot be ay ethical issues after publicatio of this work. Refereces Cada, K.S., J.W. Kim, P. Nagarkar, M. Nagedra ad R. Yu, 2010. RaKloud: Scalable multimedia data processig i server clusters. IEEE MultiMed, 18: 64-77. DOI: 10.1109/MMUL.2010.70 Chag, F., J. Dea, S. Ghemawat, W.C. Hsieh ad D.A. Wallach et al., 2008. BigTable: A distributed storage system for structured data. ACM Tras. Comput. Syst., DOI: 10.1145/1365815.1365816 Dea, J. ad S. Ghemawat, 2008. MapReduce: Simplified data processig o large clusters. ACM Commu., 51: 107-113. DOI: 10.1145/1327452.1327492 Fadika, Z. ad M. Govidaraju, 2011. DELMA: Dyamically ELastic MapReduce framework for CPU-itesive applicatios. Proceedigs of the 11th IEEE/ACM Iteratioal Symposium o Cluster, Cloud ad Grid Computig, May 23-26, IEEE Xplore press, Newport Beach, CA., pp: 454-463. DOI: 10.1109/CCGrid.2011.71 Ghemawat, S., H. Gobioff ad S.T. Leug, 2003. The Google file system. Proceedigs of the 19th ACM Symposium o Operatig Systems Priciples, (OSP 03), New York, USA, pp: 29-43. DOI: 10.1145/945445.945450 Groot, S. ad M. Kitsuregawa, 2010. Jumbo: Beyod mapreduce for workload balacig. Proceedigs of the VLDB PhD Workshop, (PPW 10), Sigapore, pp: 7-12. Heiz, S., J. Zobel ad H.E. Williams, 2002. Burst tries: A fast, efficiet data structure for strig keys. ACM Tras. Iform. Syst., 20: 192-223. DOI: 10.1145/506309.506312 Hsu, C.H. ad B.R. Tsai, 2009. Schedulig for atomic broadcast operatio i heterogeeous etworks with oe port model. J. Supercomput, 50: 269-288. DOI: 10.1007/s11227-008-0261-6 Hsu, C.H. ad S.C. Che, 2010. A two-level schedulig strategy for optimisig commuicatios of data parallel programs i clusters. It. J. Ad Hoc Ubiq. Comput., 6: 263-269. DOI: 10.1504/IJAHUC.2010.035537 Hsu, C.H. ad S.C. Che, 2012. Efficiet selectio strategies towards processor reorderig techiques for improvig data locality i heterogeeous clusters. J. Supercomput., 60: 284-300. DOI: 10.1007/s11227-010-0463-6 Hsu, C.H., S.C. Che ad C.Y. La, 2007. Schedulig cotetio-free irregular redistributios i parallelizig compilers. J. Supercomputig, 40: 229-247. DOI: 10.1007/s11227-006-0024-1 Hsu, C.H., T.L. Che ad J.H. Park, 2008. O improvig resource utilizatio ad system throughput of master slave job schedulig i heterogeeous systems. J. Supercomput., 45: 129-150. DOI: 10.1007/s11227-008-0211-3 Jiag, W. ad G. Agrawal, 2011. Ex-MATE: Data itesive computig with large reductio objects ad its applicatio to graph miig. Proceedigs of the 11th IEEE/ACM Iteratioal Symposium o Cluster, Cloud ad Grid Computig, May 23-26, IEEE Xplore Press, Newport Beach, CA., pp: 475-484. DOI: 10.1109/CCGrid.2011.18 496

Ji, C., C. Vecchiola ad R. Buyya, 2008. MRPGA: A extesio of mapreduce for parallelizig geetic algorithms. Proceedigs of the IEEE 4th Iteratioal Coferece o e-sciece, Dec. 7-12, IEEE Xplore press, Idiaapolis, IN, pp: 214-221. DOI: 10.1109/eSciece.2008.78 Kavulya, S., J. Ta, R. Gadhi ad P. Narasimha, 2010. A aalysis of traces from a productio mapreduce cluster. Proceedigs of the 10th IEEE/ACM Iteratioal Coferece o Cluster, Cloud ad Grid Computig, May 17-20, IEEE Xplore Press, Melboure, VIC., pp: 94-103. DOI: 10.1109/CCGRID.2010.112 Ke, S., C.H. Hsu, Y.C. Chug ad D. Zhag, 2013. A improved partitioig mechaism for optimizig massive data aalysis usig mapreduce. J. Supercomput., 66: 539-555. DOI: 10.1007/s11227-013-0924-9 Krisha, A., 2005. GridBLAST: A globus-based highthroughput implemetatio of blast i a grid computig framework. Cocurr Comput., 17: 1607-1623. DOI: 10.1002/cpe.906 Liu, H. ad D. Orba, 2011. Cloud mapreduce: A mapreduce implemetatio o top of a cloud operatig system. Proceedigs of the 11th IEEE/ACM Iteratioal Symposium o Cluster, Cloud ad Grid Computig, May 23-26, IEEE Xplore Press, Newport Beach, CA., pp: 464-474. DOI: 10.1109/CCGrid.2011.25 Lyde, S., Y. Taimura, I. Kojima ad A. Matoo, 2011. Dyamic data redistributio for mapreduce jois. Proceedigs of the IEEE 3rd Iteratioal Coferece o Cloud Computig Techology ad Sciece, Nov. 29-Dec. 1, IEEE Xplore press, Athes, pp: 717-723. DOI: 10.1109/CloudCom.2011.111 Miceli, C., M. Miceli, S. Jha, H. Kaiser ad A. Merzky, 2009. Programmig abstractios for data itesive computig o clouds ad grids. Proceedigs of the 9th IEEE/ACM Iteratioal Symposium o Cluster Computig ad the Grid, May 18-21, IEEE Xplore Press, Shaghai, pp: 478-483. DOI: 10.1109/CCGRID.2009.87 O Malley, O., 2008. TeraByte sort o Apache Hadoop. Pada, B., M. Riedewald ad D. Fik, 2010. The modelsummary problem ad a solutio for trees. Proceedigs of the IEEE 26th Iteratioal Coferece o Data Egieerig, Mar. 1-6, IEEE Xplore Press, Log Beach, CA, pp: 449-460. DOI: 10.1109/ICDE.2010.5447912 Papadimitriou, S. ad J. Su, 2008. DisCo: Distributed co-clusterig with map-reduce: A case study towards petabyte-scale ed-to-ed miig. Proceedigs of the 8th IEEE Iteratioal Coferece o Data Miig, Dec. 15-19, IEEE Xplore Press, Pisa, pp: 512-521. DOI: 10.1109/ICDM.2008.142 Stockiger, H., M. Pagi, L. Cerutti ad L. Falquet, 2006. Grid approach to embarrassigly parallel CPUitesive bioiformatics problems. Proceedigs of the 2d IEEE Iteratioal Coferece o e-sciece ad Grid Computig, Dec. 4-6, IEEE Xplore Press, Amsterdam, Netherlads, pp: 58-58. DOI: 10.1109/E-SCIENCE.2006.261142 Ta, J., X. Pa, S. Kavulya, R. Gadhi ad P. Narasimha, 2009. Mochi: Visual log-aalysis based tools for debuggig hadoop. Proceedigs of the USENIX Workshop o Hot Topics i Cloud Computig (HotCloud), (TCC 09), USENIX, Sa Diego, CA. DOI: 10.1.1.149.881 Vashishtha, H., M. Smit ad E. Stroulia, 2010. Movig text aalysis tools to the cloud. Proceedigs of the 6th World Cogress o Services, Jul. 5-10, IEEE Xplore Press, Miami, FL., pp: 107-144. DOI: 10.1109/SERVICES.2010.91 Verma, A., X. Llora, D.E. Goldberg ad R.H. Campbell, 2009. Scalig geetic algorithms usig mapreduce. Proceedigs of the 9th Iteratioal Coferece o Itelliget Systems Desig ad Applicatios, Nov. 30-Dec. 2, IEEE Xplore Press, Pisa, pp: 13-18. DOI: 10.1109/ISDA.2009.181 Xu, W., L. Huag, A. Fox, D. Patterso ad M.I. Jorda, 2009. Detectig large-scale system problems by miig cosole logs. Proceedigs of the 22d Symposium o Operatig Systems Priciples, Oct. 11-14, New York, pp: 117-132. DOI: 10.1145/1629575.1629587 Zaharia, M., A. Kowiski, A.D. Joseph, R. Katz ad I. Stoica, 2008. Improvig mapreduce performace i heterogeeous eviromets. Proceedigs of the 8th USENIX Symposium o Operatig Systems Desig ad Implemetatio, USENIX, Sa Diego, Califoria, USA, pp: 29-42. 497