Acta Polytechica Hugarica Vol. 13, No. 2, 2016 Coceptualizatio with Icremetal Bro- Kerbosch Algorithm i Big Data Architecture László Kovács 1, Gábor Szabó 2 1 Uiversity of Miskolc, Istitute of Iformatio Techology, H-3515 Miskolc- Egyetemváros, Hugary, e-mail: kovacs@iit.ui-miskolc.hu 2 Uiversity of Miskolc, Hatvay József Doctoral School of Iformatio Scieces, H-3515 Miskolc-Egyetemváros, Hugary, e-mail: szabo84@iit.ui-miskolc.hu Abstract: The paper itroduces a ovel coceptualizatio algorithm optimized for a distributed, Big Data eviromet. The proposed method uses a cocept geeratio module based o clique detectio i the cotext graph. The preseted work proposes a ovel icremetal versio of the Bro-Kerbosch maximal clique detectio method. The efficiecy of the method is evaluated with radom cotext tests. The preseted icremetal model is eve comparable with the usual batch methods. The aalysis of the clique detectio algorithm i MapReduce architecture provides efficiecy compariso for large scale cotexts. Keywords: otologizatio; clique detectio; icremetal clique geeratio; mapreduce architecture 1 Itroductio Oe of the big challeges of curret iformatio techology is the efficiet iformatio maagemet ad kowledge egieerig i Big Data eviromet. With the spread of ew techologies like the Iteret of Thigs, the amout of gathered data steadily icreases ad ew data repository techiques are eeded to provide a efficiet data maagemet. Aother tred to be witessed is the icreased demad o itelliget smart applicatios. The adaptio of kowledge egieerig methods o Big Data collectio is a real challege for the IT commuity. The term otology i iformatio sciece is defied as a formal, explicit specificatio of a shared coceptualizatio [10]. The term coceptualizatio refers to determiatio of the cocept classes ad cocept relatioships at a abstract model level for the pheomeo i the domai world. The otology models cover ot oly cocepts, but they also iclude the correspodig costraits o their usage, too. The otologies emphasize aspects, such as, iter-aget commuicatio ad iteroperability [25]. From the viewpoit 139
L. Kovács et al. Coceptualizatio with Icremetal Bro-Kerbosch Algorithm i Big Data Architecture of egieerig, the term 'cocept' is used as a idetifier or a descriptor for a cluster of objects. I this sese, the cocept describes besides the amig also the properties of the cluster. I the history of kowledge egieerig, a great variety of data structures was developed to represet the meaig of cocepts. Nowadays, these models exist parallel ad are used for differet purposes. Accordig to [19], otology models ca be classified ito very differet model categories, based o the purpose, specificity ad expressiveess. Applicatio otology is used to cotrol some computatioal applicatios ad has a methodological emphasis o fidelity. Referece otology has a theoretical focus o represetatio ad it is used primarily to reduce termiology ambiguity amog members of a commuity. Based o the abstractio level, geeric (upper level) ad core ad domai otologies ca be distiguished. Accordig to [22], the mai represetatio levels of otologies are ƒ - Taxoomy: Objects are hierarchically classified, e.g. A is child of B. - Thesaurus: Objects are related (e.g. A is a B; A is related with B). ƒ - Logic-mathematical represetatio: Object relatios are preseted i formal otatios (e.g. syoym(a, b):=syoym(b, a);). Aalogously to a database, wherei structure ad data form the whole, a otology cosists of rules ad cocepts. Laguages for the descriptio of otologies are RDF-S, DAML+OIL, F-Logic, OWL, WSML or XTM. Usig rulebased represetatios i deductive databases esures further facts ca be deduced from stored relatios. Curret otology laguages, like OWL, provide efficiet tools to perform complex operatios o the otology database, like cosistecy verificatio, rule iductio or reasoig process. The mai applicatio area of otology frameworks is the area of kowledge egieerig [11], where the otology egie is itegrated ito iteral module of expert systems. The efficiecy of the otology egies is based primarily o the correctess, completeess ad itegrity of the otology database. From this poit of view, the key elemet i otology maagemet is applicatio of efficiet ad proper otology database costructio methods. There are may difficulties i otology costructio, for istace, huge amouts of iformatio to be collected, huge diversity of iformatio sources ad icosistecy withi the differet iformatio sources. The otology database costructio usually cotais the followig steps [23, 42]: 1. Otology scope 2. Otology capture 3. Otology ecodig 4. Otology itegratio 140
Acta Polytechica Hugarica Vol. 13, No. 2, 2016 5. Otology evaluatio 6. Otology documetatio. The curret otology costructio methods are maily based o automated otology costructio methods. I the automated otology costructio methods, the iformatio is usually extracted from documets i atural laguage. This step requires a complex kowledge extractio egie icludig amog others a atural laguage processig (NLP) module ad a cocept idetificatio module. The mai goal of the paper is to itroduce a ovel coceptualizatio algorithm optimized for a distributed, Big Data eviromet. The ext sectio provides a survey o the developmet of the text to otology methods. The mai goal of text to otology module is to assig the best matchig cocept cluster to the ew words foud i the source text. The third sectio itroduces some approaches for cocept assigmet. The most sophisticated ad most complex oe is based o clique detectio mechaism i the cotext graph. The paper presets a ovel icremetal versio of Bro-Kerbosch maximal clique detectio method. The efficiecy of the method is evaluated with radom cotext test. The fourth sectio cotais the aalysis of the distributed Big Data architecture. This architecture provides a efficiet implemetatio framework to process a large amout of heterogeeous text documet sources. The proposed architecture ad map-reduce processig models are implemeted i a test eviromet where the efficiecy of the prototype system could be evaluated. 2 Process of Word to Cocept Mappig The Word Sese Disambiguatio (WSD) [18] is a method to select the appropriate meaig for a give cotext. The Word Category Map method [13] clusters the words based o their sematic similarity. The words havig similar cotexts belog to the same cluster or they appear close to each other o the map. The cotext of a word ca be costructed i may differet ways. I the simplest case, the cotext is equal to the set of eighborig words withi the seteces [21]. I this model, the cotext is coverted ito a vector represetatio calculatig the average eighborhood vector. The mai beefit of this method is the cost efficiecy ad the simplicity. O the other had, this model caot maage the ambiguity of the words, i.e. a word ca carry may differet meaigs. I this model, the sematic similarity is measured usig the stadard vector distace methods. Beside the simple vector similarity methods, there are approaches to use more sophisticated clusterig methods, like the SOM method [21]. I the other group of approaches, the cotext is give with a graph istead of a simple vector. I otology maagemet, the thesaurus graphs are used to deote the higher level sematics. 141
L. Kovács et al. Coceptualizatio with Icremetal Bro-Kerbosch Algorithm i Big Data Architecture The WSD otologizatio process ca work i oe the followig modes [4]: dictioary based, supervised, semi-supervised ad usupervised. I the first mode, a dictioary cotaiig the defiitios of the differet cocepts is used as backgroud kowledge. The similarity of two words is measured with the overlap of their dictioary defiitios [16]. I the case of supervised mode, a backgroud otology is refereced to get iformatio o existig sematic clusters. I the otology database, a word ca be assiged to several meaigs. I order to distiguish the differet cocepts related to a word, a specific compoet is itroduced which represets the meaig. The most widely used implemetatio of this compoet is the syset compoet defied i a wordet otology database. Wordets are usually based o architecture preseted first i [8] ad they orgaize sematic-lexical kowledge ito a graph kowledge base. Nowadays, Wordet kowledge bases are available for may differet laguages. The syset ca be cosidered as the set of words carryig a commo meaig. The elemets of a syset are syoyms ad a word ca be a elemet of differet sysets. Usig the wordet kowledge base, the otologizatio process performs a patter matchig task, where the matchig method locates the syset most similar to the give word. Withi this method, the key parameters refer to the similarity measure applied to compare a word ad a syset. I most approaches, the similarity measure betwee the word ad the syset is aggregated from the similarity values betwee the target word ad the words i the syset. I the literature there are differet approaches to measure similarity betwee a word ad a syset. The extesio of the related proportio approach [20] yields the eighborhood similarity measured with: Where w: the target word C: the syset v: word i the kowledge base r: relatio i the sematic graph d Crw ( w, C). r r v Nwr 1 log( C ) Crv : the umber of edges from elemets of C to v alog with a edge type r N wr : the set of words coected to w alog with a edge of type r r : the weight factor of the relatioship r. The average cosie method (see for example [3]) calculates distace as the average cosie value betwee the descriptio vectors: cos( N c, N w) d( w, C) c C. C 142
Acta Polytechica Hugarica Vol. 13, No. 2, 2016 Where N v : the adjacecy vector for word v (describig the words coected to v). Havig a otology graph, a importat step i text to sematic coversio is to select the proper meaig of the word. The supervised approach requires the developmet of a backgroud kowledge base i the form of a dictioary or of a aotated text corpus. The quality of the WSD process depeds o the quality of the backgroud kowledge base. The mai problem of this approach is the high cost i the geeratio of a comprehesive ad valid kowledge base. The usupervised methods provide automatic approaches for costructio of the kowledge base. The first importat result o the field of usupervised WSD was the semisupervised proposal i [27]. The proposed method is based o two mai properties of huma laguages: - Nearby words provide strog ad cosistet clues as to the sese of a target word - The sese of a target word is highly cosistet withi ay give documet. The algorithm first geerates a small set of seed represetative of the differet seses of a word. This step is a supervised phase to assig the sematic label to some of the word occurreces. I the ext step, a supervised classificatio algorithm is used to lear the differeces betwee the cotexts of the differet word seses. I the last step, all word occurreces are classified with the geerated classifier to oe of the sese labels. The usupervised method of [4] requires oly the WordNet sese ivetory ad uaotated text to determie the meaig category of a target word. The algorithm icludes the followig processig steps. First, a pool of applicatio cotext is collected from the web. I the ext step, the seteces cotaiig the target word are parsed with the help of a depedecy parser i a parser tree. The third phase is used to merge these trees ito a depedecy graph. I the last step a graph matchig algorithm is applied to fid the appropriate meaig. This phase is based o the idea that if a word is sematically coheret with its cotext, the at least oe sese of this word is sematically coheret with its cotext. I the case of fully usupervised methods, o backgroud kowledge base is available, oly a uaotated source text ca be used. I this case, oly the usupervised clusterig method ca be used to create sysets for the words i the documet pools. Clusterig methods are used to partitio objects ito groups where the objects withi the same group are similar to each other while objects from differet groups are dissimilar. I geeral, the metioed clusterig methods do ot usually meet this basic costrait of clusterig, as they allow buildig of large clusters cotaiig object pairs with low similarity. Oe way to provide 143
L. Kovács et al. Coceptualizatio with Icremetal Bro-Kerbosch Algorithm i Big Data Architecture better clusterig is the applicatio of Quality Threshold Clusterig [12]. The QTC method esures that the distace betwee ay two elemets withi a cluster should be below a give threshold. All of the metioed clusterig algorithms geerate o-overlappig clusters. There are some applicatio areas where the costrait that every elemet must belog to oly oe cluster that is ot met. I the sematic clusterig, for example, a word may belog to differet clusters, i.e. to differet meaigs at the same time. Clusterig i social etworks or i distributed etworkig are other applicatio areas where o-overlappig clusterig yields i sigificat loss of iformatio. It is show i [15] that overlappig improves the approximatio algorithms sigificatly for miimizig graph coductace. Overlappig clusterig ca be cosidered as a geeralizatio of the stadard clusterig methods. The first importat approach for overlappig clusterig is give i [14]. I this approach, the iput structure is a graph where the edges deote object pairs havig a similarity value above a give threshold. A cluster is cosidered as a maximal complete subgraph, i.e. all of the members are coected to each other. The proposed cluster detectio method is based o the k- ultarmetrics. Aother group of approaches is based o the extesio of the classical clusterig methods. I [5], the k-meas method is adjusted for overlappig clusterig. Iitially, radom cluster ceters are specified. I the ext phase, the elemets are assiged to a subset of clusters. For a give elemet x, the set of cotaier clusters is geerated i the followig way. First, sort the cluster ceters based o the icreasig distaces from the give elemet. Calculate the set A of the first k cotaier clusters for which: is met, where x x ( x) 2 mi m c A c ( x), A m c : the cluster prototype for cluster c. I the third step, the ew positios of the cluster ceters are calculated. The performed tests show that this method is a good alterative of the more sophisticated complex techiques. The third importat category of overlapped clusterig methods is based o the mixture model. I geeral, the key iput source for a mixture model is the observatio matrix X. The ukow parameter set is deoted with Θ. The basic assumptio is that each data X i poit belogs to the followig probability desity [1]: i k p( X ) p ( X ) h 1 h h i h 144
Acta Polytechica Hugarica Vol. 13, No. 2, 2016 where k: the umber of mixture compoets Θ h : the parameters of the h-th mixture compoet h : the probability of the h-th mixture compoet. The parameter values are usually calculated with the EM method where the goal is to maximize likelihood of the give observatio set: i 1 p( ). X i 3 Icremetal Bro-Kerbosch Algorithm Our ivestigatio focuses o the clusterig with the clique detectio approach origiated from the work of [14]. The observatio graph ca be cosidered as a similarity graph. The odes are objects of the problem domai ad there is a udirected edge betwee two vertices (v i,v j ) if ad oly if the d(v i,v j ) < ε for a give threshold ε. A clique, i.e. maximal complete subgraph correspods to a cluster. A cluster symbolizes a cocept i the target otology kowledge base. The goal of the ivestigated algorithm is to detect all maximal cliques i the observatio graph. The Bro-Kerbosch algorithm is oe of the most widely kow ad most efficiet algorithms for maximal clique detectio. The algorithm was preseted first i [2]. The Bro Kerbosch algorithm uses a recursive backtrackig method to search for all maximal cliques i a give graph G(V,E). The pseudocode of the algorithm: BroKerbosch(R, P, X): if P = X = 0 report R as a maximal clique v P: BroKerbosch(R {v}, P N(v), X N(v)) P = P \ {v} X = X {v} I the pseudocode, the followig otatios are used: P: the subset of V that ca have some commo elemets with the ivestigated clique R: the subset of V that share all of its elemets with the ivestigated clique 145
L. Kovács et al. Coceptualizatio with Icremetal Bro-Kerbosch Algorithm i Big Data Architecture X: the subset of V that is disjoit with the ivestigated clique N(v): the set of vertices coected to v. Iitially, the set P cotais all of vertices of G ad both R ad X are empty sets. I every iteratio call, a elemet of P is processed ad the sets P, R, X are refied based o the curret eighborhood set. For the recursive call, the set of possible clique members are restricted to the elemets i the eighborhood set. Similarly, the set of possible excluded elemets are reduced to a subset of the eighborhood set. I the mai loop, the elemets of P are tested i a predefied order. After testig a elemet v from P, it will be removed from the set of cadidate vertices. The basic versio of the Bro-Kerbosch algorithm uses a large umber of recursive calls, resultig i a executio complexity of worst-case ruig time O(3 /3 ) [7] [24]. The updated method uses a specific pivotig strategy to cut computatioal braches. The pivot elemet is selected as the elemet with highest umber of eighbors. TomitaBroKerbosch(R, P, X): if P = X = 0 report R as a maximal clique let u P X: N(u) P max v P\N(u) TomitaBroKerbosch(R {v}, P N(v), X N(v)) P = P \ {v} X = X {v} The preseted methods geerate the output structure for a fixed give iput cotext. This approach is used for static ivestigatio whe there is o chage i the iput cotext. The icremetal costructio method is used for such problems where the iitial cotext is exteded icremetally. The icremetal methods are used for applicatios where cotext chages from time to time. The modelig of cogitive learig processes is a good example of such dyamic problem domais. Our ivestigatio focuses o the icremetal clique geeratio method. For the aalysis of the proposed method, the followig iitial otatios are used: G(V,E) : cotext graph V : set of vertices i G E : set of edges i G N : umber of vertices i V. 146
Acta Polytechica Hugarica Vol. 13, No. 2, 2016 It is assumed that there exists a total orderig of the vertices,.i.e. V { vi i [1.. N]}. Takig oly the first elemets of V, we get a reduced cotext graph G (V,E ), where V E { v v V, i } i i {( vi, v j ) ( vi, v j ) E, vi, v j V}. Let C(G ) deote the set of maximal cliques related to G. The cliques i C(G ) ca be separated ito two disjoit groups, depedig o the property whether they are ew cliques or old cliques related to C(G -1 ). u C( G ) C ( G ) C ( G ) o C ( G ) { c c C( G ) C( G 1)} u C ( G ) C( G o o ) \ C ( G ) It ca be easily verified that for every c C u (G ), v c. Regardig C u (G ), the followig propositio ca be used i the icremetal clique geeratio method. Propositio 1: The set of ew cliques of G is equal to the clique set geerated for the iclusive eighborhood of the ew vertex. where u C ( G ) C( S ) S (x) = (V', E' ) V' = {v i v i V, i <, (v i, v ) E} {v } E' = {(x,y) (x,y) E, x V', y V' }. The iclusive eighborhood set S is defied as the eighborhood of v exteded with the elemet v. Proof. Let us take a ew clique i G : c C u ( G ). 147
L. Kovács et al. Coceptualizatio with Icremetal Bro-Kerbosch Algorithm i Big Data Architecture It ca be see that v c, i.e. v is coected to all other elemets of c. Thus, all elemets of c are i the iclusive eighborhood of v ad all the edges i G are also edges i S. This meas that c is a complete subgraph of S, too. As the assumptio that c is maximal i G ad ot maximal i S, imply a cotradictio, the clique c is a maximal complete subgraph i S. Thus C u ( G ) C( S ). O the other had, takig a c clique from C(S ), c will be complete i G, because all the edges i S are also edges i G. If c is ot maximal i G the there exists a vertex v i such that v c, v V,( v, v ) E. i Thus, i < ad i v V. i This meas that c is ot maximal i S, ad this is a cotradictio, i.e. u ( C G ) C( S ). Based o the cosideratios show i the proof, we get: u ( C G ) C( S ). Propositio 1 ca be used to geerate the ew cliques i a effective way, but ot oly the isertio of the ew cliques is the required update operatio o C(G -1 ). Namely, some of the cliques i C(G -1 ) may become ivalid as they are ot maximal aymore. For example, i Figure 1, the clique set of G 4 is {(1,3),(2,3),(3,4)}. After addig v 5, S 5 is equal to ({2,3,4,5}, {(2,3), (2,5), (3,4), (3,5), (4,5)}). The clique set for S 5 is {(2,3,5),(3,4,5)}. The resultig clique set for G 5 is {(1,3), (2,3,5), (3,4,5)}. Thus, the clique (2,3) is covered by (2,3,5) ad (3,4) is covered by (3,4,5). Thus, the update of C(G -1 ) after geeratio of C(S ) icludes the removal of some existig cliques ad the isertio of some ew cliques. As the covered clique is the same as the ew clique except the v vertex, it is worth to reduce S to the elemets of the eighborhood ad to exclude v. This ew graph is deoted with S -.The icremetal clique geeratio algorithm ca be summarized i the followig listig: IcremetalBroKerbosch(G -1, C(G -1 ), S ): C(S - ) = BroKerbosch(S - ) C(G ) = C(G -1 ) 148
Acta Polytechica Hugarica Vol. 13, No. 2, 2016 c C(S - ): C(G ) = C(G ) \ {c} c = c {v } C(G ) = C(G ) {c } retur C(G ) The lookup operatio i the set C(G ) has a cetral role i optimizatio of the algorithm. Istead of a aive sequetial lookup operatio havig a cost O(D) where D deotes the umber of elemets i the set, a prefix tree structure is used. I the prefix tree structure, the clique sets are represeted with a ordered list of vertex idetifiers. The tree stores these ordered lists where the lists havig the same prefix part share the same prefix segmets i the tree. Figure 1 Extesio of the similarity graph The results of the performed direct tests show that the proposed method provides a efficiet solutio for icremetal clique detectio tasks. I Figure 2, the compariso of the aive icremetal method ad the proposed icremetal method is preseted. The time shows the executio time i logarithmic scale. As the result shows, the proposed algorithm is about 100 times faster tha the aive method. I the aive method, for every ew icomig object, the whole clique set is recalculated from scratch. I Figure 2, the data set NI deotes the aive method ad symbol I is for the proposed icremetal method. Figure 2 The compariso of the aive ad the proposed methods 149
L. Kovács et al. Coceptualizatio with Icremetal Bro-Kerbosch Algorithm i Big Data Architecture The method outperforms ot oly the other icremetal approaches but it has better cost characteristics tha the stadard batch method. For larger iput graphs, the proposed method is sigificatly faster tha the basic Bro-Kerbosch method (see Table 1, Table 2). The rutime is displayed i secods. Table 1 Compariso of the basic Bro-Kerbosch ad the proposed methods for edge probability 0.5 Vertices Cliques BK Rutime Icremetal Rutime 100 5407 0.41 0.32 150 23440 2.47 1.91 200 76838 10.42 8.36 250 214457 38.1 28.6 300 496754 103.5 82.4 Table 2 Compariso of the basic Bro-Kerbosch ad the proposed methods for edge probability 0.3 Vertices Cliques BK Rutime Icremetal Rutime 100 749 0.04 0.04 150 2277 0.12 0.08 200 5126 0.18 0.12 250 9566 0.41 0.24 300 16858 0.80 0.51 The efficiecy of the proposed icremetal algorithm is based o the divide ad coquer paradigm. The algorithm geerates the clique set oly for the eighborhood graph durig each iteratio. The size of the eighborhood graph depeds o the edge probability i the iput graph. If the cost fuctio of the basic Bro-Kerbosch algorithm is deoted with O b ( C ( )) the the complexity of the proposed icremetal method belogs to where O( ( C ( ') D ( '))) b b : the average size of the eighborhood graph, D b (): the umber of cliques i a graph cotaiig vertices. 150
Acta Polytechica Hugarica Vol. 13, No. 2, 2016 The first term i the formula correspods to the geeratio of the cliques for the eighborhood graph while the secod term relates to the elimiatio of the covered cliques from the result set. The preseted formula provides a acceptable approximatio of the measured cost values but further ivestigatio is eeded to work out a more accurate cost fuctio ivolvig the additioal parameters, too. 4 Implemetatio i the MapReduce Architecture Although the speed of computers is icreasig rapidly, the fact that usig several machies i a itercoected system ca be more effective i certai cases was recogized may years ago. There are may solutios i the literature that ca be used i Big Data aalysis. MPP (Massively Parallel Processig) systems [29] store data after splittig it accordig to its features, for example we could store regioal data split by the coutry or state. I-memory database systems [30] are very similar, except that they store data i memory, thus speedig up the retrieval of the records. This area is the topic of active research by Big Database vedors such as Oracle. BSP (Bulk Sychroous Parallel) systems [31] use multiple trasformatio processes that ru i parallel o differet odes. Each process gets data from a master ode ad the seds the result back. After this barrier-like sychroizatio the ext iteratio ca be executed. The big breakthrough came whe Google published their scietific article ivolvig a ew architecture for processig huge amouts of data. This architecture is called MapReduce. The article called MapReduce: Simplified Data Processig o Large Clusters [28] created a whole ew cocept that became the basis for the most popular framework of Big Data aalysis to date. The cocept itself is very simple. We all kow that parallel tasks are most effective whe we ca ru i parallel for a log time without ay sychroizatio barriers. This meas that if we ca divide our algorithms i such a maer, the result ca outperform the sigle-threaded versio. Coversely, if we eed to commuicate betwee the threads, these sychroizatio poits ca slow dow the algorithm. MapReduce got its ame from its two most importat phases: map ad reduce. The map phase produces key-value pairs which become the iput of the reducer phase that yields the fial results. Although Google's implemetatio is ot ope source, there are multiple ope source implemetatios amog which the most popular oe is Apache Hadoop. This framework is actively developed, costatly improvig ad has may techologies built o top of it like Apache Pig (SQL-like iterface), Apache Hive (data warehouse ifrastructure), Apache HBase (distributed NoSQL database), etc. that specializes the Hadoop platform to more specific problems. 151
L. Kovács et al. Coceptualizatio with Icremetal Bro-Kerbosch Algorithm i Big Data Architecture The base cocept of MapReduce ad Apache Hadoop [32] is to split the icomig data to multiple odes ad process them locally. This way after the iitial data copies the mapper tasks ca work o their local data istead of retrievig them durig processig. To make sure that the data is ot lost o ode failures, every data chuk is stored o multiple odes, but i case of ormal work without failures every ode works o its local hard disk. These low-level iteractios are abstracted by the Hadoop Distributed File System (HDFS), ispired by the Google File System (GFS) [9]. This is the most importat compoet of Hadoop as it provides a distributed file system for the MapReduce tasks. The iput data must be copied to this file system which makes sure that every data chuk is persisted o multiple odes, the the mappers get data from this file system ad the reducers write the results back to HDFS. Figure 3 displays a overall view of a very simplified MapReduce task. O the left side of the image we ca see the iput records that come from HDFS. Next to them there are some mapper tasks that receive oe iput record at a time ad produce key-value pairs from them. For simplicity the figure oly has two types of keys (gree ad blue). The reducers receive objects with commo keys ad create the fial output of the applicatio that is persisted back to HDFS. Figure 3 The MapReduce architecture There are may scietific research areas that are directly or idirectly related to MapReduce ad Apache Hadoop. Durig literature research we ca fid scietific articles from the area of agriculture [33] through telecommuicatio [34] to AIbased recommedatio systems [35] that use Hadoop as the applicatio platform. It is ot surprisig that graph algorithms ca be ported to this parallel architecture ad thus speeds up differet search methods. The mai research areas of Hadoop based graph processig are the sematic web related problems ad processig of social etwork data. This work [36] tries to solve the problem of maitaiig ad queryig huge RDF graphs by usig Hadoop, ad provides a iterface o top of it to aswer SPARQL queries. (SPARQL [37] is the W3C stadard laguage for queryig otologies based o 152
Acta Polytechica Hugarica Vol. 13, No. 2, 2016 RDF triplets.) This is a classic otology related problem that ca be itegrated with MapReduce to make the processig distributed. There are also proposals [38] o a Hadoop based solutio to the problem of processig huge graphs i parallel. Ulike most solutios that try to separate the odes i the form of classic MapReduce jobs, this article presets a system that has a highly flexible self-defied message passig iterface that makes it easier to port graph algorithms o top of the MapReduce platform. GPS (Graph Processig System) [39] is a complete ope-source system similar to Google's Pregel [40], extedig it to provide a iterface for processig huge graphs i parallel with global commuicatio amog the odes. As we ca see, these two research results combie Big Data aalysis with graph processig, thus coectig to our research area ad results preseted i this article. The Spider system [41], tries to solve the problem of processig data o the sematic web. The system has two modules: oe that loads the graph ad oe that ca query the previously loaded graph leveragig the Hadoop framework. Cosiderig the implemetatio of the clique detectio algorithm, two differet approaches were tested. I the first approach, the task uit is the processig of the eighborhood graph of a ode. Here, for every ode a separate task is geerated. The mapper calculates the clique set i the eighborhood. The drawback of this approach is that it geerates the same clique several times. The mai beefit of the method is that every ode ca work separately with a smaller amout of local data. I the secod approach, all of the odes work with the global data set ad, share the global graph. The work is separated i such way that every clique is geerated oly oce. Here, the mergig process ca be executed with low cost. Alterately, every ode requires the whole dataset. Regardig the first approach (A), the full graph is divided by its odes. After the GraphRecordReader reads the iput file from HDFS ad recostructs the graph object, it takes each ode from the graph ad all of its eighbors ad yields a ew subgraph from these odes. This way if we fid a maximal clique i the subgraph, it will be a maximal clique of the full graph as well. The Hadoop framework distributes these subgraphs based o the system cofiguratio ad it makes sure that every mapper ode gets approximately the same umber of subgraphs. Each mapper process gets oe subgraph at a time with a ull key, ad executes oe of the Bro-Kerbosch variatios o it. After gettig the maximal cliques, it calculates the hash code of the resultig cliques ad yields key-value pairs cosistig of the hash code as the key ad the clique as the value. Sice the mapper ode might produce the same clique multiple times, a combier is used o each ode that drops every repeated result to optimize speed so that these do t have to be set over the etwork to the reducer odes. 153
L. Kovács et al. Coceptualizatio with Icremetal Bro-Kerbosch Algorithm i Big Data Architecture The reducer the gets oe hash code at a time ad the list of cliques that have that hash code. Ideally this list oly cotais oe clique, but if multiple mappers produced the same clique, the size of the list ca be more tha oe. Therefore the reducer yields oly the first elemet of the list. The Hadoop framework the writes the resultig cliques from the reducer s output to HDFS. I the secod approach (B), the set of cliques is divided by the smallest ode idex value withi the clique. This method yields i a disjoit partitioig, thus the merge phase ca be implemeted with a miimal cost. I the implemetatio of the method a shared HDFS storage is used to store the iput similarity graph. The map compoet performs a Bro-Kerbosch clique detectio algorithm where the mai loop is restricted to the odes assiged to this mapper process. The key value i the MapReduce framework is equal to the set of miimum idex values assiged to a ode for processig. Every mapper ode works with the commo shared iput graph to geerate the correspodig clique set. We tested the two approaches o radomly geerated graphs with fixed ode couts ad edge probabilities to verify their correctess i practice. Although we did t have a full Hadoop cluster up ad ruig, we simulated such eviromets with the help of Docker, a lightweight virtualizatio software. We set up two docker images ad iitiated oe master ad related slave cotaiers, both executig map ad reduce tasks. I the future we would like to set up a physical Hadoop cluster ad verify our assumptios i a real distributed eviromet. The cost models of the proposed methods ca be approximated with the followig formulas: where cos t cos A t B N C( N) N C( N) N L C( N) L N L N N: the umber of odes i the iput graph N : the umber of odes i a eighborhood graph L: the umber of mapper odes C() : the umber of cliques for a graph havig vertices. The formula is based o the fact that the cost of clique detectio algorithm is a liear fuctio of the umber of geerated cliques. The experimetal fuctio of C(N) is show i Figure 4, while Figure 5 presets the correspodig executio costs. These formulas ad the test results show that method A is suitable for those architectures where the eighborhood graph is relatively small (sparse iput graph) ad the L value is large. The method B is optimal for the cases where L is small ad the graph is relatively dese. 154
Acta Polytechica Hugarica Vol. 13, No. 2, 2016 Figure 4 The umber of geerated cliques Coclusios Figure 5 The cost fuctio of the proposed iterative method Oe optio to geerate cocepts i a otology framework, is to build clusters of similar objects. The cliques i a similarity graph ca be cosidered as overlappig quality threshold clusters. The proposed icremetal clique detectio algorithm provides a efficiet clusterig method for dyamic cotexts chagig from time to time. The preseted icremetal model is comparable eve with the usual batch methods. The cost aalysis for large cotexts shows that the optimal implemetatio i the MapReduce architecture depeds o the basic parameters of the iput similarity graph. 155
L. Kovács et al. Coceptualizatio with Icremetal Bro-Kerbosch Algorithm i Big Data Architecture Refereces [1] Baerjee A., Krumpelma C., Basu S., Mooey R.: Model-based Overlappig Clusterig, Proc. of Eleveth ACM SIGKDD, pp. 532-537 [2] Bro C., Kerbosch J., "Algorithm 457: Fidig All Cliques of a Udirected Graph", Commuicatios of ACM 16 (9): 1973, pp. 575-577 [3] Caraballo S.: Automatic Costructio of Hyperym-Label Nou Hierarchy from Text, Proc. of Aual Meetig ACL, 1999, pp. 120-126 [4] Che P, Dig W, Bowes C, Brow D: A Fully Usupervised Word Sese Disambiguatio Method Usig Depedecy Kowledge, The 2009 Aual Coferece of the North America Chapter of the ACL, pp. 28-36 [5] Cleuziou G.: A Exteded Versio of the k-meas Method for Overlappig Clusterig, Patter Recogitio, ICPR 2008, pp. 1-4 [6] Dea J., Ghemawat S.: MapReduce: Simplified Data Processig o Large Clusters, Proc. of OSDI 2004, pp. 10-20 [7] Eppstei D., Löffler M., Strash D.: Listig all Maximal Cliques i Sparse Graphs i Near-Optimal Time, Proc. of ISAAC 2010, pp. 403-414 [8] Fellbaum C.: WordNet: A Electroic Lexical Database (Laguage, Speech ad Commuicatio), MIT Press, 1998 [9] Ghemawat S., Gobioff H., Leug S.: The Google File System. Proc. of the 19 th SOSP 2003, pp. 29-43 [10] Gruber T.: A Traslatio Approach to Portable Otology Specificatios, Kowledge Acquisitio, 5(2): 1993, pp. 199-220 [11] Happel H., Seedorf S.: Applicatios of Otologies i Software Egieerig. Proc. of SWESE 2006, Athes, GA, USA [12] Heyer L., Ramakrishaa R., Livy M.: BIRCH: A Efficiet Data Clusterig Method for Very Large Databases, Geome Research, Vol. 9, 1999, pp. 1106-1115 [13] Hokela T: Comparisos of Self-Orgaized Word Category Maps, I Proceedigs of WSOM'97, pp. 298-303 [14] Jardie N., Sibso R.: Mathematical Taxoomy, Joh Wiley ad Sos Publ. [15] Khadekar R, Kortsarz G., Mirroki V.: O the Advatage of Overlappig Clusters for Miimizig Coductace. Algorithmica, 69(4), 2014, pp. 844-863 [16] Lesk M.: Automatic Sese Disambiguatio usig Machie Readable Dictioaries: How to Tell a Pie Coe from a Ice Cream Coe. I Proc. of SIGDOC 86 156
Acta Polytechica Hugarica Vol. 13, No. 2, 2016 [17] Moo, J. W., Moser, L., "O Cliques i Graphs", Israel J. Math. 3:, 1965, pp. 23-28 [18] Navigli R.: Word Sese Disambiguatio: A Survey, ACM Computig Surveys, 41(2), pp. 1-69 [19] Oberle D.: Sematic Maagemet of Middleware, Volume I of The Sematic Web ad Beyod Spriger, New York (2006) [20] Peacchiotti M., Patel P.: Otologizig Sematic Relatios, Proc. of 21 st COLING ACL, 2006, pp. 793-800 [21] Ritter H., Kohoe T.: Learig 'Sematotopic Maps' from Cotext. I Proc. IJCNN- 90-WASH-DC, It. Cof. o Neural Networks, Vol. I, pp. 23-26 [22] Sieber T., Kovács L.: Mappig XML-Documets ito Object-Model, Proc. of CINTI 2005, pp. 343-354 [23] Subhashii R., Akiladeswari J.: A Survey o Otology Costructio Methodologies, Iteratioal Joural of Eterprise Computig ad Busiess Systems, Vol. 1, Issue 1, 2011 [24] Tomita E., Taaka A., Takahashi H., The Worst-Case Time Complexity for Geeratig all Maximal Cliques ad Computatioal Experimets, Theoretical Computer Sciece 363 (1): 2006, pp. 28-42 [25]. Uschold M., Gruiger M.: Otologies: Priciples, Methods, ad Applicatios. Kowledge Egieerig Review 11 (1996) pp. 93-155 [26] Wood D.: O the Maximum Number of Cliques i a Graph, Graphs Combi. 23 (2007) pp. 337-352 [27] Yarowsky D.: Usupervised Word Sese Disambiguatio Rivalig Supervised Methods, Proc. of ACL 95, 1995, pp. 189-196 [28] Dea J., Ghemawat S.: Mapreduce: Simplified Data Processig o Large Clusters. Commu. ACM 2008, 51(1), pp. 107-113 [29] Simo, H. D.: Partitioig of Ustructured Problems for Parallel Processig. Computig Systems i Egieerig, 2(2), 1991, pp. 135-148 [30] Berkowitz, B. T., Simhadri, S., Christofferso, P. A., ad Mei, G. I- Memory Database System. US Patet 6, 2002, 457,021 [31] Gerbessiotis, A. V. ad Valiat, L. G.: Direct Bulk-Sychroous Parallel Algorithms. Joural of Parallel ad Distributed Computig, 22(2):1994, pp. 251-267 [32] Wadkar, S., Siddaligaiah, M., ad Veer, J.: Pro Apache Hadoop. Apress, 2014 157
L. Kovács et al. Coceptualizatio with Icremetal Bro-Kerbosch Algorithm i Big Data Architecture [33] Yag, F., Wu, H.-R., Zhu, H.-J., Zhag, H.-H., ad Su, X.: Massive Agricultural Data Resource Maagemet Platform Based o Hadoop [J]. Computer Egieerig, 2011, 12:083 [34] Yag, G., Shu, M., Wag, X., ad Xiog, A.: Applicatio Practice of Hadoop Platform for Telecommuicatio Eterprise. Digital Commuicatio, 4:019, 2014 [35] Wag, C., Zheg, Z., ad Yag, Z.: The Research of Recommedatio System Based o Hadoop Cloud Platform. I Computer Sciece & Educatio (ICCSE), 2014 9 th It. Coferece o, pages 193-196. IEEE [36] Husai, M. F., Doshi, P., Kha, L., ad Thuraisigham, B.: Storage ad Retrieval of Large RDF Graph Usig Hadoop ad MapReduce. 2009, pp. 680-686 [37] Quilitz, B. ad Leser, U.: Queryig Distributed RDF Data Sources with SPARQL. I The Sematic Web: Research ad Applicatios, Vol. 5021 of Lecture Notes i Computer Sciece, 2008, pp. 524-538 [38] Pa, W.; Li, Z.-H.; Wu, S.; Che, Q.:Evaluatig Large Graph Processig i MapReduce Based o Message Passig. Chiese Joural of Computers., 2011, pp. 1768-1784 [39] Salihoglu, S. ad Widom, J.: GPS: A Graph Processig System. I Proceedigs of the 25 th Iteratioal Coferece o Scietific ad Statistical Database Maagemet, SSDBM, 2013, pp. 22:1-22:12 [40] Malewicz, G., Auster, M. H., Bik, A. J., Dehert, J. C., Hor, I., Leiser, N., ad Czajkowski, G. (2010) Pregel: A System for Large-Scale Graph Processig. I Proc. of ACM SIGMOD 2010, pp. 135-146 [41] Choi, H., So, J., Cho, Y., Sug, M. K., ad Chug, Y. D.: Spider: A System for Scalable, Parallel/Distributed Evaluatio of Large-Scale RDF Data. I Proceedigs of ACM Coferece o Iformatio ad Kowledge Maagemet, 2009, pp. 2087-2088 [42] Furdik K., Tomasek M., Hreo J.: A WSMO-based Framework Eablig Sematic Iteroperability i e-govermet Solutios, Acta Polytechica Hugarica, Vol. 8, No. 2, 2011, pp. 61-79 158