MapReduce Based Implementation of Aggregate Functions on Cassandra

Iteratioal Joural of Electroics Commuicatio ad Couter Techology (IJECCT) MapReduce Based Ilemetatio of Aggregate Fuctios o Cassadra Aseh Daesh Arasteh Departmet of Couter ad IT Islamic Azad Uiversity of Zaja Zaja, IRAN Davud Mohammadpur Departmet of Couter Egieerig Uiversity of Zaja Zaja, IRAN Majid Meghdadi Faculty of Couter Egieerig Uiversity of Zaja Zaja, Ira Abstract MapReduce is a sile ad powerful processig model that allows parallel scalable programs to ru o large volume of data o massive cluster of couters Besides, Cassadra is a popular database of NoSQL solutios Accordig to scietific kowledge, still there are o geeral suitable procedures to perform arbitrary calculatios i this database o MapReduce model So, i this paper we propose some procedures based o MapReduce model that are eeded geerally to perform variety of aggregate operatios o Cassadra Our evaluatio, coare with the most commo methods shows sigificat irovemet i performace o multicore couters ad a set of peer machies Keywords- Cassadra; MapReduce; NoSQL; Processig model; Large volume of data I INTRODUCTION I the past decades, developmet ad icreased use of social techologies, ew web techologies ad mobile applicatios caused the eed for maagig large volume of data that cause ew geeratio of database, as is NoSQL Mai target of NoSQL systems is maagig ustructured data that are aturally dyamic ad have huge volume [1] Curretly there are variety of NoSQL databases for differet purposes Amog these, Cassadra is oe of the popular databases of NoSQL movemets that is provided to work o large volume of data o a set of peer machies O the other had, growig of coutatioal techologies has created a massive flow of data that are chage coutig architecture ad the mechaism of massive data processig Also, MapReduce is a sile ad powerful processig model that eables parallel scalable programs to process large volume of data o large cluster of machies MapReduce framework separates programs from details like data distributig, schedulig ad fault tolerace [2] Ulike RDBMSs, processes that are defied for NoSQL databases do t have ay geeral model ad accordig to their variety of processes ad their use cases, there are various types of processig models ad methods Hadoop is oe of these processig models that provide distributed reliable ad scalable calculatios based o MapReduce Apache Hadoop library allows distributed processig o large set of data ad ca be scalable from oe to several machies Ay of machies provide its local calculatio ad storage system [3] Accordig to scietific kowledge, still there are o geeral suitable procedures to perform arbitrary calculatios i NoSQL databases o MapReduce model So, i this paper we propose some procedures that are eeded to geerally perform variety of aggregatio operatios o Cassadra based o MapReduce model Our proposed method is writte i java ad is provided i both case of usig Cassadra i usual way (without MapReduce) ad usig Cassadra with MapReduce model Our evaluatios ad simulatios performed o a 8-core processig system with 4G RAM II CASSANDRA Apache Cassadra project offered secod geeratio of distributed, ustructured, scalable ad coatible databases Cassadra is a key value store ad combies Amazo Dyamo s ifrastructure ad BigTable s data model [4] Cassadra first defied at Facebook ad developed to store large volume of ustructured data o large commodity servers [1] Cassadra is the most popular choice i NoSQL databases Here are some of Cassadra advatages [5]: High scalability ad high availability without ay sigle poit of failure family High performace i Write ad good performace i Read Tuable cosistecy ad flexible replicatio strategy Flexible scheme These properties make Cassadra a good choice i storage ad processig large volume of data [5] Cassadra store data i somethig like a multi-level hash table I this database we ca retrieval data base o keys ad cosider it as a key-value elemet Also, Multidimesioal structure of Cassadra makes it suitable for Super to store multiple items of the ISSN:2249-7838 IJECCT wwwijecctorg 604

Iteratioal Joural of Electroics Commuicatio ad Couter Techology (IJECCT) same type ad at the lowest level, provide key-value store to esure high flexibility [6] Cassadra has a peer to peer desig that makes it easy to add ew servers to its cluster This feature specially is suitable for rapid growth of dataset ad MapReduce model Sice, this way, more coutatioal odes ca itroduce to cluster to icreasig the power [1] A Cassadra Data Model Cassadra uses a chord based Distributed Hash Table to lookup data However, its data model is a sigificatly differet from covetioal relatioal databases Here we have a Cassadra butto-up data model as show i the figure 1[7]: : iclude a ame, a value ad a timesta Family: is a cotaier for colums ad holds a sorted list of colums Family is similar to Table i RDBMSs Row: i Cassadra each Family is stored i a separate file ad these files are sorted as ROWs A ROW does t hold related Families But, related colums that have access together should be store i a Family Row key determie which data will store i which machie ad use i a Hash algorithm to defie each ode ID store i which machie Super: These are s that cotais s ad Cassadra supports it KeySpace: Is the highest storage level i Cassadra ad is similar to Database i RDBMSs Apache HBase is a distributed ope source colum family store after Google BigTable [8] HBase iheretly supports MapReduce ad automatically rus o top of a Hadoop cluster I other words, HBase was created i top of HDFS ad Hadoop that meas usig MapReduce i it is easy HBase Basic classes allow Hadoop MapReduce jobs to perform properly with HBase BigTables [8] HBase ad Cassadra sigificatly have differet operatioal structures ad this iflueces their performace i every area HBase ad its required back-up systems are derived from BigTable Google ad Google File System desig O the other had, Cassadra is a free ope source database that first was itroduced i Facebook ad ilemet BigTable data model with Amazo Dyamo storage systems [9] These differeces cause HBase to be a good data warehouse ad be a suitable oe for aalysis ad process large volume data ad Cassadra to be suitable for real time ad iteractive dataoperatios Aother reaso that make Cassadra has attracted attetios is that it has the most ifluece i today society [11] Figure 2 shows NoSQL database performace with icreasig umber of odes The mai differeces betwee these two databases that partly affected Cassadra performace i processig large volume of data is their use i MapReduce process Cassadra supports MapReduce but ot MapReduce Streamig That meas MapReduce jobs that usig data stores i Cassadra less use of MapReduce cocepts So, Hadoop-Base databases like HBase here act stroger Cl us Key ter Spac e Family Ro w1 Key1 Ro w N Key N III Figure 1 Cassadra Data Model HBASE VS CASSANDRA TimeS ta Amog NoSQL databases, HBase is a distributed, scalable, Hadoop-Base database that ca store large volume of data HBase target is hostig hug tables that cotais millios of rows ad colums i top level of commodity servers cluster Figure 2 NoSQL Database Performace [10] IV MAPREDUCE MapReduce first was itroduced by Google as a distributed programmig model that uses huge server clusters to process massive data set (multi-terabytes) MapReduce model ca be apply to may of calculatio programs with large volumes ad have iterestig properties like automatic parallelism, load balacig ad good maagig i machie failure MapReduce divides a large program ito smaller splits ad process each split i a parallel way ad the combie these itermediate result to obtai the fial result [12] MapReduce ru o top of a special file system like Google File System (GFS) or Hadoop File System (HDFS) First data load ad split (geerally i 64 MB splits) so that each split is replicable A key property of MapReduce is that data storage ISSN:2249-7838 IJECCT wwwijecctorg 605

Iteratioal Joural of Electroics Commuicatio ad Couter Techology (IJECCT) order is proportioal with data processig ad as a distributed processig model has some advatages i cotrast with traditioal approaches [12]: Scalability Reliability Highly Fault Tolerace Silicity Performace All of these beefits are associated with cloud coutig where processig of huge amout of data usig distributed resource is the mai task No woder, because MapReduce is a programmig model that has wide use i cloud eviromet for processig large amout of data i parallel [12] MapReduce is ispired by Map ad Reduce fuctios that come from fuctioal programmig laguages like LISP I MapReduce fuctios are ot defied explicitly Programmers defie their calculatios i the form of two fuctios: Map ad Reduce that ca be ru i a subset of data i a parallel way Map fuctio produces pairs of itermediate key/value ad reduce fuctio combies all the values correspod to a special key [12] I p u t D Spli tti g Dat a ad ge erat ig Key /Va Map () Map () ys Figure 3 MapReduce Structure [11] a a Whe all lue the tasks were doe, the d fial result back to user I tmapreduce Pair iput data distributed over worker machies Re that are ruig i parallel Itermediate values are Map worker a s machies output ad sed par to a set of Reduce worker machies as iput (also there will be titi other steps like sortig) It may be better to thik of MapReduce oi as a data flow represetig as it show i Figure 3 User sed tasks to a idle g cetral ode that selects machies ad assig each a MapReduce task to do i parallel Trasmissio process of Map ad Reduce fuctios is kow as shufflig [12] 5-Steps MapReduce Map () Red uce( ) Red uce( ) Preparig Map() iput: MapReduce system determies map tasks ad assigs K1 iput key value to each processor ad provides the processor with all of its eeded data [12] Ruig the provided Map() fuctio by user:map fuctio rus for each key/value pair exactly oce, ad creates output with the K2 key/value [12] Sedig Map() fuctio output to Reduce processors: MapReduce system determies Reduce() processors ad assig Me rgi g a d gro upi g Ke O ut p ut D at a K2 itermediate key/value to each processorad provide the processor with all of its eeded data [12] Ruig the provided Reduce() fuctio by user:reduce fuctio rus for each K2 key/value pair from Map() fuctio exactly oce [12] Produce fial output: MapReduce system collects all the Reduce() outputs ad sort them accordig to their key to produce fial results [12] V RELATED WORKS There are a lot of related works i the field of MapReduce processig model or Cassadra processig operatios Here we will discuss some of them Robao Lee ad etal proposed a correlatio aware SQL to MapReduce traslator Their system detects three kids of correlatio betwee queries: Iput Correlatio, Trasit Correlatio ad Job Flow Correlatio ad after determiig such correlatios a set of rules are applied to produce a optimized MapReduce code Their ilemetatio shows that their proposed system, coare with curret traslators, faced with large volume of data performed better up to 4 times [13] Tartara ad etal proposed a MapReduce-base method to irove machie learig approaches I their proposed method, they assiged a cadidate code to a map Fuctio Map coiles the cadidate code ad creates a biary code, the rus it multi times ad the assigs it a score The Map fuctio seds the biary code ad its score to Reduce fuctio ad reduce returs the best code Their ilemetatios show that with icrease of data volume performace iroves liearly [14] Elifdedes ad etal evaluate Cassadra i differet coditios Result of their evaluatios shows that usig Radom Partitioer i Cassadra iroves Hadoop performace i retrieve data from Cassadra up to three times The they add MapReduce ad their evaluatios show that i Cpu ad Memory loads perform better up to two or three times i the case Hadoop-ative For large volume writeshadoop- Cassadra-Fs performs better but Hadoop-Cassadra- Cassadra is Coarable with Hadoop ative But i the case of processig large volume of data usig Hadoop-Cassadra- Cassadra has bee more successful [1] VI PROPOSED METHOD As we kow Cassadra is a BigData Database ad extractio is the most iortat operatio i BigData s world So we must have miimum umber of retrievals from database I this case, processig overhead was icreased Otherwise, overhead of readig from database may icrease ad this may cause program failure i NoSQL eviromet I this coditio we wat to reach to best performace i database geeral operatios I the begiig we have two kids of extractios: Extractig all fields (SELECT * FROM ) Coditioal extractio (SELECT FROM WHERE ) ISSN:2249-7838 IJECCT wwwijecctorg 606

Iteratioal Joural of Electroics Commuicatio ad Couter Techology (IJECCT) After extractio, we have some aggregatio fuctios that their MapReduce-based ilemetatios are ot supported i Cassadra Fuctios like SUM, AVG, DISTINCT do ot have a stadard defiitio based o MapReduce ad results of their operatio i o-mapreduce way shows a iefficiecy i BigData The mai poit i MapReduce-based processes is defied of Map ad Reduce fuctios ad also determies their iput ad output data I all this paper ad i all of these fuctios we do as follows: All of iputs are from Cassadra s tables Map phase reads all the records of Cassadra tables ad groupig values accordig to the defied key I Reduce phase the desired operatio performed o values correspodig to each key Based o the desired operatio we ca use oe Map ad Oe reduce or multiple Map ad Reduce Accordig to desired output i a special operatio we ca have o Reduce Fial output will store i a table i Cassadra or i file system Various types of processig methods ad patters were defied i Cassadra to silify queries ad processig operatios for the users I this paper, we propose some geeral procedures ad algorithms usig Java programmig laguage ad MapReduce processig model that are eeded i most of Cassadra data processig ad BigData coutig I fact, proposig a approach to easier use of BigData ad also irovig the power of applyig query are our mai goal i this paper Accordig to above, the mai problem i use of Cassadra is its o-use of MapReduce processig model i operatios So, ext we proposed some of the most commoly used fuctios i operatios i MapReduce model for Cassadra A SUM() SUM() is oe of the most commoly used fuctios that has a wide use i most of applicatios ad programs, but Cassadra does ot support it i MapReduce model ad this is a fault i Cassadra Besides, its correct ad fast ru i operatios with large volume of data is a iortat factor that should be cosidered I our proposed approach Map ad Reduce fuctios are performed the mai tasks Map fuctio reads colum values of Cassadra tables ad Reduce fuctio perform sum operatio public void map(map<type, ByteBuffer> keys, Map<Type, ByteBuffer> colums, Cotext cotext) throws IOExceptio, IterruptedExceptio{ for (Etry<Type, ByteBuffer> colum : columsetryset))({ if (!"columame"equals(columgetkey())) cotiue; val2 = ByteBufferUtiltoType(columget()); cotextwrite(val1, ew TypeWritable(val2)); public void reduce(type key, Iterable<TypeWritable> values, Cotext cotext) throws IOExceptio, IterruptedExceptio{ Type sum = 00; for (TypeWritableval : values) sum += valget(); cotextwrite(key, ew TypeWritable(sum)); B SELECT DISTINCT() SELECT DISTINCT() is aother most commoly used fuctios that Cassadra does ot support it So we proposed its MapReduce-based ilemetatio Our ilemetatio takes three parameters: colum ame, table ame ad if exists a coditio The output values will be geerated without repetitio I the form of MapReduce model, Map fuctio reads records of Cassadra ad groups it as key/value I the Map phase, the field that we wat to have its distict value, cosider as key ad for value we cosider NULL I Reduce phase, key/values of Map outputs are take ad results uiquely show i output Here is the code: public void map(map<type, ByteBuffer> keys, Map<Type, ByteBuffer> colums, Cotext cotext) throws IOExceptio, IterruptedExceptio { for (Etry<Type, ByteBuffer> key : keysetryset()) { if (!"colum"equals(keygetkey())) cotiue; Type value = ByteBufferUtiltype(keyget()); StrigTokeizeritr = ew StrigTokeizer(value); while (itrhasmoretokes()) { wordset(itrexttoke()); cotextwrite(word, NW); ISSN:2249-7838 IJECCT wwwijecctorg 607

Iteratioal Joural of Electroics Commuicatio ad Couter Techology (IJECCT) public void reduce(type key, Iterable<NullWritable> values, Cotext cotext) throws IOExceptio, IterruptedExceptio{ cotextwrite(key, NullWritableget()); C AVG() Aother most commoly used fuctio i operatios is average fuctio ad Cassadra does ot have support of this fuctio AVG() fuctio used i most of programs but Cassadra does ot support this fuctio So we ilemet it usig MapReduce for Cassadra I our proposed approach Map ad Reduce fuctios are performed the mai tasks Map fuctio reads colum values of Cassadra tables ad Reduce fuctio perform average operatio public void map(map<type, ByteBuffer> keys, Map<Type, ByteBuffer> colums, Cotext cotext) throws IOExceptio, IterruptedExceptio{ for (Etry<Type, ByteBuffer> colum : columsetryset))({ if (!"columame"equals(columgetkey())) cotiue; val2 = ByteBufferUtiltoType(columget()); cotextwrite(val1, ew TypeWritable(val2)); public void reduce(type key, Iterable<TypeWritable> values, Cotext cotext) throws IOExceptio, IterruptedExceptio{ It cout = 0; Type sum = 00; Double AVG =0; for (TypeWritableval : values){ sum += valget(); cout = cout +1; AVG = AVG/cout; cotextwrite(key, ew TypeWritable(sum)); VII CREATING AN OPERATIONAL LIBRARY BASED ON MAPREDUCE FOR CASSANDRA After preparig required procedures based o MapReduce for processig data stored i Cassadra ad due to availability of these processig structures ad also user facility i usig proposed processig structures, we provide these proposed methods as a Java library so that it ca be used i every MapReduce programs This library cotais Map ad Reduce fuctios ad also processig operatios for each of SUM(), SELECT DISTINCT() ad AVG() fuctios Users oly eed to iort fuctio argumets like table ame ad colum ame ad the ru the mai method This will be explaied i the followig steps: Creatig a Jar file from fuctios:to do this we perform followig steps i Eclipse: Select Export from File Select Jar file from list Select the class or method Addig to path:after creatig the Jar file add it i Hadoop class path Iort to project:iort the created Jar file as a library i the project Creatig ew object: Create a object of type the added library with ew method Callig method: Call the mai method of the program usig Method Ilemetatio ad Evaluatio The results of our ilemetatios o differet data volumes o Cassadra ad its commo approach ad also Cassadra with our proposed method show i figures 4, 5, ad 6 As these evaluatio shows, i each of these ilemetatios i face with large volume of data, commo approaches fails but our MapReduce based proposed approaches shows better performace This is because of MapReduce ature that is for BigData processig (Tera Bytes ad Peta Bytes) Of course MapReduce performace greatly depeds o machie hardware Whatever system hardware ad specially umber of processors is more ad powerful, better results will be achieved I some cases with a low hardware icreased data volume will cause a poor performace So, our ilemetatio results shows that for large volume of data commo approaches i Cassadra (usig JDBC) do ot support this large volume of data ad fail to ru each of these fuctios This is because of o-mapreduce problems (JDBC) i massive processig But our proposed method based o MapReduce shows better results i processig of large volume of data VIII COMPARISON OF SUM OPERATION IMPLEMENTATION Sectio A, B ad C shows the results of our ilemetatios o differet data volumes A SUM() As you ca see i the figure, commo approaches for SUM() operatio show better results i face with volume of data but, i face with large volume of data commo approaches fails but our MapReduce based method s respose time is reasoable ISSN:2249-7838 IJECCT wwwijecctorg 608

Executio Time ( Secods) Executio Time ( Secods) Executio Time ( Secods) Iteratioal Joural of Electroics Commuicatio ad Couter Techology (IJECCT) Figure 4 Ruig SUM() o differet data volumes o commo approaches ad proposed approach B SELECT DISTINCT() For SELECT DISTINCT(), as you ca see i the figure 5, commo approaches show better results for low volume data but, i face with large volume of data commo approaches caot resposive ad fails with TTrasport Exceptio error, but our MapReduce based method here was successful Figure 5 Ruig SELECT DISTINCT() o differet data volumes o commo approaches ad proposed approach C AVG() Like previous fuctios, agai commo approaches show better results i face with low volume of data but, i with large volume of data commo approaches fails ad our MapReduce based method s respose time is reasoable IX CONCLUSION I this paper we study ew geeratio of storage system, NoSQL Amog these we select the most widely used database, Cassadra due to its popularity But Cassadra has a problem i processig large volume of data ad that is there are o stadard ad geeral procedures based o MapReduce to perform ay arbitrary calculatios i Cassadra So, oe of the mai reasos of Cassadra, coare with other NoSQL colum family databases is that it does ot have full support of MapReduce ad this is a big loss i BigData processig So, the best way is usig MapReduce to process large volume of data stores i Cassadra Our goal is to speed up colex queries executio usig proposed approaches Our ilemetatio results show that o-mapreduce methods (usig JDBC) do ot support this volume of data ad i all of three cases fails without ay result This is because the weakess of JDBC driver ad as a result o-mapreduce structures i massive processig But our proposed MapReduce based procedures shows better results o large volume of data REFERENCES [1] Dede, E, Sedir, B, Kuzlu, P, Hartog, J, & Govidaraju, M, A Evaluatio of Cassadra for Hadoop, I Cloud Coutig (CLOUD), 2013 IEEE Sixth Iteratioal Coferece o (pp 494-501), IEEE, 2013, Jue [2] White, Tom Hadoop: The defiitive guide " O'Reilly Media, Ic", 2012 [3] Maitrey J Soparia, Apache Cassadra (Distributed Hash Table) [4] Perera S, Cosider the Apache Cassadra database, Jul 2012 [5] Lerer, Reuve M "At the forge, Cassadra views" Liux Joural 2010196 (2010): 7 [6] Dhairya Gala, Cassadra (DHT), I Departmet of Couter Sciece Idiaa Uiversity, Bloomigto [7] Hewitt, E, Cassadra: the defiitive guide O'Reilly Media, 2010 [8] George, Lars HBase: the defiitive guide O'Reilly Media, Ic, 2011 [9] Williams, Domiic "HBase vs Cassadra: why we moved" (2010) [10] Joatha Ellis, 2012 i review: Performace, Available From: http://wwwdatastaxcom/dev/blog/2012-i-review-performace [11] Hill, R, Hirsch, L, Lake, P, & Moshiri, S, Guide to cloud coutig: priciples ad practice, I Spriger, 2012 [12] http://ewikipediaorg/wiki/mapreduce [13] Lee, R, Luo, T, Huai, Y, Wag, F, He, Y, & Zhag, X (2011, Jue), Ysmart: Yet aother sql-to-mapreduce traslator, I Distributed Coutig Systems (ICDCS), 2011 31st Iteratioal Coferece o (pp 25-36) IEEE [14] Tartara, M, & Crespi Reghizzi, S, Parallel iterative coilatio: usig MapReduce to speedup machie learig i coilers, I Proceedigs of third iteratioal workshop o MapReduce ad its Applicatios Date (pp 33-40), 2012, ACM Figure 6 Ruig AVG() o differet data volumes o commo approaches ad proposed approach ISSN:2249-7838 IJECCT wwwijecctorg 609