A Parallel Spatial Co-location Mining Algorithm Based on MapReduce
|
|
|
- Bruno Harvey
- 10 years ago
- Views:
Transcription
1 214 IEEE International Congress on Big Data A Parallel Spatial Co-location Mining Algorithm Based on MapReduce Jin Soung Yoo, Douglas Boulware and David Kimmey Department of Computer Science Indiana University-Purdue University Fort Wayne, Indiana yooj,[email protected] Air Force Research Laboratory Rome Research Site, Rome, New York [email protected] Abstract Spatial association rule mining is a useful tool for discovering correlations and interesting relationships among spatial objects. Co-locations, or sets of spatial events which are frequently observed together in close proximity, are particularly useful for discovering their spatial dependencies. Although a number of spatial co-location mining algorithms have been developed, the computation of co-location pattern discovery remains prohibitively expensive with large data size and dense neighborhoods. We propose to leverage the power of parallel processing, in particular, the MapReduce framework to achieve higher spatial mining processing efficiency. MapReduce-like systems have been proven to be an efficient framework for large-scale data processing on clusters of commodity machines, and for big data analysis for many applications. The proposed parallel co-location mining algorithm was developed on MapReduce. The experimental result of the developed algorithm shows scalability in computational performance. Keywords spatial data mining; co-location pattern; spatial association analysis; cloud computing; MapReduce I. INTRODUCTION So-called Big data is a fact of today s world and brings not only large amounts of data but also various data types that previously would not have been considered together. Richer data with geolocation information and date and time stamps is collected from numerous sources including mobile phones, personal digital assistants, social media, vehicles with navigation, GPS tracking systems, wireless sensors, and outbreaks of disease, disaster and crime. The spatial and spatiotemporal data are considered nuggets of valuable information [1]. Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful patterns from large spatial data and spatiotemporal data [2]. As one of spatial data mining tasks, spatial association mining has been widely studied for discovering certain association relationships among a set of spatial attributes and possibly some nonspatial attributes [3]. Spatial co-location represents a set of events (or features) which are frequently observed together in a nearby area [4]. Although many spatial association mining techniques [4] [12] have been developed, the computation involved in searching spatially associated objects from large data is inherently too demanding of both processing time and memory requirements. Furthermore, explosive growth in the spatial and spatiotemporal data emphasizes the need for developing new and computationally efficient methods tailored for analyzing big data. As a solution, parallel/distributed data processing techniques are becoming a necessity to deal with the massive amounts of data. We propose to leverage the power of parallel processing to achieve higher spatial data mining processing efficiency. Modern frameworks facilitating the distributed execution of massive tasks are becoming increasingly popular since the introduction of MapReduce programming model, and Hadoop s run-time environment and distributed file systems [13]. For large-scale data processing on clusters of commodity machines, MapReduce-like systems have been proven to be an efficient framework for big data analysis for many applications, e.g., machine learning [14] and graph processing [15]. In this paper, we present a MapReduce-based co-location mining algorithm. Applying the MapReduce framework to spatial co-location mining presents several challenges. Because of limited communication among nodes in a cluster, all steps of co-location mining should be paralleled inter-dependently. However, for the parallel processing, it is non-trivial to divide the search space of co-location instances, i.e., a set of spatial objects which have neighbor relationships with each other. Explicit search space partitioning may lose co-location instances across different partitions and generate incomplete and incorrect mining results. When a client submits a job task, MapReduce automatically partitions the input data into physical blocks and distributes the blocks to the distributed file systems. However, this explicit data partition and distribution may lose some spatial relationships among objects because spatial objects are embedded on continuous space, and make various relationships with each other. Moreover, assigning balanced loads among nodes is a difficult problem. In this paper, we address these problems and propose a parallel/distributed algorithm for spatial co-location mining on MapReduce. The remainder of this paper is organized as follows. Section II presents the basic concept of spatial co-location mining and the MapReduce paradigm, and describes the related work. Section III shows our search space partition strategy and Section IV presents the proposed MapReduce based co-location mining algorithm. Experimental results are reported in Section V and the paper will conclude in Section VI. II. PRELIMINARIES This section presents the basic concept of spatial colocation mining and the MapReduce programming model, and describes the related work. A. Spatial Co-location Mining Boolean spatial events (features) describe the presence of spatial events at different locations in geographic space. Exam /14 $ IEEE DOI 1.119/BigData.Congress
2 B.5 B.2 A.1 B.4 A B A.2 A.1 A.2 B.4 A.4 A.3 A.3 3/4 3/5 E.i : instance i of event type E Figure 1. Spatial co-location mining Computation of prevalence measures A 4/5 co location instances C A.1 A.2 A.3 A.4 2/3 prevalence measures B C B.4 2/5 3/3 A B C A.2 B.4 A.3 2/3 2/5 2/3 Participation Ratio (PR)s Participation Index (PI) =min(pr, PR, PR) =min(2/3, 2/5, 2/3)=2/5 If PI > min prevalence threshold, {A, B, C} is a co location pattern ples of such data include disease outbreaks, crime incidents, traffic accidents, mobile service types, climate events, plants and species in ecology, and so on. Let E = {e 1,...,e m } be a set of different spatial events. A spatial data set S = S 1... S m, where S i (1 i m) is a set of spatial objects of event e i, where an object o S i can be represented with a data record <event type e i, instance id j, location x, y >, where 1 j S i. Figure 1 shows an example where each data object is represented by its event type and unique instance id, e.g., A.1. Spatial neighbor relationship R can be defined with metric relationship (e.g., Euclidean distance), topology relationship (e.g., within, nearest), and direction relationship (e.g, North, South). This work uses a distance-based neighbor relationship. Two spatial objects, o i S and o j S, are neighbors if the distance between them is not greater than a given distance threshold d; that is, R(o i,o j ) distance(o i,o j ) d. In Figure 1, identified neighbor objects are connected by a solid line. A co-location C = {e 1,...,e k } is a subset of spatial events C E whose instance objects are frequently observed in a nearby area. A co-location instance I S of a co-location C is defined as a set of objects which includes all event types in C and forms a clique under the neighbor relationship R. In Figure 1, {A.2, B.4, } is a co-location instance of {A, B, C}. The prevalence strength of co-locations is often measured by participation index [4]. Definition 1: The participation index PI(C) of a colocation C = {e 1,...,e k } is defined as PI(C) = min ei C{PR(C,e i )}, where 1 i k, and PR(C,e i ) is the participation ratio of event type e i in the colocation C that is the fraction of objects of event e i in the neighborhood of instances of co-location C {e i }, i.e., PR(C,e i )= Number of distinct objects of ei in instances of C Number of objects of e i. The prevalence measure indicates wherever an event in C is observed, with a probability of at least PI(C), all other events in C can be observed in its neighborhood. If the participation index of an event set is greater than a userspecified minimum prevalence threshold min prev, the event set is called a co-location or co-located event set. In the example of Figure 1, the participation index of an event set C = {A, B, C} is PI(C)=min{PR(C, A), PR(C, B),PR(C, C)} = min{ 2 3, 2 5, 2 3 }= 2 5. Figure 2. B. MapReduce MapReduce Program and Execution Models MapReduce [16] is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers. MapReduce simplifies parallel processing by abstracting away the complexities involved in working with distributed systems, such as computational parallelization, work distribution, and dealing with unreliable hardware and software. The MapReduce model abstraction is presented in Figure 2. A MapReduce job is executed in two main phases of user defined data transformation functions, namely, map and reduce. When a job is launched, the input data is split into physical blocks and distributed among nodes in the cluster. Such division and distribution of data is called sharding, and each part is called a shard. Each block in the shard is viewed as a list of key value pairs. In the first phase, the key value pairs are processed by a mapper, and are provided individually to the map function. The output of the map function is another set of intermediate key value pairs. The values across all nodes, that are associated with the same key, are grouped together and provided as input to the reduce function in the second phase. The intermediate process of moving the key value pairs from the map tasks to the assigned reduce tasks is called a shuffle phase. At the completion of the shuffle, all values associated with the same key are located at a single reducer, and processed according to the reduce function. Each reducer generates a third set of key value pairs considered as the output of the job. MapReduce can also refer to the execution framework that coordinates the execution of programs written in this particular style. MapReduce decomposes a job submitted by a client into small parallelized workers. A significant feature of MapReduce is its built-in fault tolerance. When a failure at a particular map or reduce task is detected, the keys assigned to that task are reassigned to an available node in the cluster and the crashed task is re-executed without the re-execution of the other tasks. MapReduce was originally developed by Google and has since enjoyed widespread adoption via an open-source implementation called Hadoop [13]. Currently, MapReduce is considered the actual solution for many data analytics tasks over large distributed clusters [17], [18]. 26
3 C. Related Work Since Koperski et al. [3] introduced the problem of mining association rules based on spatial relationships (e.g., proximity, adjacency), spatial association mining has been popularly studied in data mining literature [4] [7], [9], [12], [19], [2]. Shekhar et al. [4] defines the spatial co-location pattern and proposes a join-based co-location mining algorithm. The instance join operation for generating co-location instances is similar to apriori gen [21]. Morimoto [5] studies the same problem to discover frequent neighboring service class sets but a space partitioning and non-overlap grouping scheme is used for finding neighboring objects. Xiao et al. [7] proposes a density based approach for identifying clique neighbor instances. Eick et al. [6] works on the problem of finding regional co-location patterns for sets of continuous variables in spatial data sets. Mohan et al. [2] proposes a graph based approach to regional co-location pattern discovery. Zhang et al. [11] presents a problem of finding a reference feature centric longest neighboring feature sets. Yoo et al. [22] [25] proposes variant co-location mining problems to discover compact sets of co-locations. Wang et al. [1] presents techniques to find star-like and clique topological patterns from spatiotemporal data. General data mining literature has formerly employed parallel methods since its very early days [26]. Zaki [27] presents the survey work of parallel and/or distributed algorithms for association rules mining. There are many novel parallel mining methods as well as proposals that parallelize existing frequent itemset mining techniques such as APriori [21] and FP-growth [28]. However, the number of algorithms that are adapted to the MapReduce framework is rather limited. Some works [29], [3] suggest counting methods to compute the support of every itemset in a single MapReduce round. Different adaptations of APriori to MapReduce are also shown in [31], [32]. Lin et al. [31] presents three different pass counting strategies that are adaptations of Apriori on multiple MapReduce rounds. An adaptation of FP-Growth to MapReduce is presented in [28], [33], [34]. The PARMA algorithm by Riondato et al. [35] finds approximate collections of frequent itemsets. Unfortunately, none of these frameworks have been directly adopted to spatial co-location mining. They are mainly developed for finding frequent itemsets and association rules from transaction databases like market-basket transaction data. However, spatial data does not have the natural transaction concept. Most of these algorithms feed input transaction records to mappers and execute the frequency (e.g., support) counting step in parallel. Whereas, our prevalence measures based on spatial neighborhoods cannot be simply computed with the frequency counting. To the best of our knowledge, our work is the first parallel algorithm of mining spatial co-location patterns on MapReduce. III. SEARCH SPACE PARTITION FOR PARALLEL PROCESSING Spatial data objects and their neighbor relationships can be represented in various ways. Figure 3 (a) represents them with a neighbor graph G =(V,E) where a node, v V, represents a spatial object and an edge e E represents a neighbor relationship between two spatial objects of different event types. Note that we do not consider relationships among same type events. Hence for every edge (u, v) E where u, v V, v.type u.type. A subset C V is a clique in graph G if for every pair of vertexes u, v C, the edge (u, v) E. For the discovery of co-location patterns, we need to find all co-location instances forming cliques from the neighbor graph. However, this process is a computationally expensive operation. Even though we enumerate all maximal cliques in a graph, the number of maximal cliques is exponential in the number of vertexes [36]. In addition, the maximal clique enumeration problem has been proven to be NP-hard [37]. Furthermore, the neighbor graph of dense data may not fit in the memory of a single machine. To take advantage of parallelism, the search space of colocation patterns should be divided into independent partitions so workers can process a given task with each partitioned data synchronously. However, splitting all neighbor relations into disjoint sets is not easy because spatial relationships are continuous in space. Furthermore, we do not want to miss any neighbor relationship during the partitioning process, and the partition cost needs to be inexpensive. In fact, partitions with minimal duplicate information are preferable. We use a strategy to partition edges in the neighbor graph to divide the search space of co-location patterns. The edges of the neighbor graph are divided according to the following simple rule: each vertex v (a spatial object) keeps the relationship edge with other vertex u when v.type < u.type. Here, we assume there is a total ordering of the event type (i.e., lexicographic). This partition strategy divides neighbor relations without duplicating or missing any relationships needed for co-location mining as shown in our previous work [12]. Figure 3 (b) shows the status after the edge partitioning. For neighborhoods divided by the edge-based partitioning method, we define the following term. Definition 2: Let G(V,E) be a neighbor graph. For v V, the conditional neighborhood of v, N (v), is defined as the set of all the vertexes u adjacent to v including v itself which satisfies a certain condition {v, u V : (u, v) E and v.type <u.type } Figure 3. A.1 B.5 A.1 B.4 B.2 A.4 A.4 (b) Edge based partitioning A.3 (a) A neighbor graph B.4 A.2 A.3 Spatial neighborhood partitioning A.2 B.4 A.1,, A.2, B.4, A.3,, A.4, B.2,, B.4, B.5 (c) Conditional neighborhood records 27
4 The conditional neighborhood of a spatial object v is a set of its neighbor objects whose event types are greater than the event type of v including the object itself, v. All objects in N (v) are called star neighbors of v. Figure 3 (c) represents the spatial objects and their neighbor relations with conditional neighborhood records. The conditional neighborhood of A.3, N (A.3), is {A.3,, }, N () is {,, }, and so on. When the neighborhood records are provided to the MapReduce job of co-location mining, MapReduce partitions the records into equal sized blocks and distributes the blocks evenly to the distributed file systems. Each mapper collects colocation instances from assigned neighborhood records, and the outputs are merged for the reducer which determines frequent co-located event sets. IV. A MAPREDUCE BASED CO-LOCATION ALGORITHM Given a spatial data set, a neighbor relationship, and a minimum prevalence threshold, the proposed algorithm discovers prevalent co-located event sets through two main tasks: Spatial neighborhood partition which divides the colocation search space. Parallel co-located event set search, where co-location instances are searched by each map worker synchronously, and then merged so that reducers find prevalent co-located event sets based on the merged instance sets. Figure 4 shows the overall algorithmic framework of colocation mining on MapReduce. A. Spatial Neighborhood Partition For the spatial neighborhood partition task, we use two MapReduce jobs. The first job searches all neighboring pairs, and the second job generates the conditional neighborhood records from the neighboring pairs. Figure 5 and 6 show the pseudo codes. The MapReduce framework splits the input data records into physical blocks without considering the geolocation of a data point. However, through the map function, we rearrange the data points for parallel neighbor search in the partitioned space. Space partitioning is the process of dividing a space into non-overlapping/overlapping regions. According to the geographic location of data point and the space partitioning strategy used, a grid (partition) number is assigned to each data point. The mapper then outputs a key-value pair key = gridno, value = o where o S is a data point object. After all mapper instances have finished, for each key assigned by the mappers, the MapReduce infrastructure collects corresponding values, [value ] (here, a set of o), and feeds the reducers with key-value pairs key = gridno, [value = o]. The reduce function loads a given neighbor distance threshold (dist), finds all neighbor object pairs in the [value ], and outputs key = o i,value = o j, where o i,o j [value ] and distance(o i, o j ) dist. The reduce uses a computational geometry algorithm, e.g., a plane-sweep algorithm [38], for the neighbor search. The Reduce function of Figure 5 shows the plane sweep pseudo code of the neighbor search. The next job is to generate the list of neighbors of each object according to the definition of the conditional neighborhood Figure 4. A MapReduce algorithmic framework for co-location mining (Definition 2). As shown in Figure 4, the map function uses the output from the previous job. When each mapper instance starts, the instance is fed with the shards of neighbor pairs, key = o i,value = o j.ifo j and o i are the same object, or o j s event type is greater than o i s event type, the mapper outputs a key-value pair key = o i,value = o j. A reducer sorts the objects in the value list [o j ] by their event type. The 1: procedure MAPPER(key, value=o) 2: gridno findregion(o) 3: emit(gridno, o) 4: end procedure 5: procedure REDUCER(key=gridno, value=[o]) 6: dist load( neighbor distance threshold ) 7: objectset [o] 8: sort(objectset, x ) 9: activeset 1: j 1 11: n length(objectset) 12: for i 1, n do 13: while (objectset[i].x-objectset[j].x) > dist do 14: remove(activeset, objectset[j]) 15: j j+1 16: end while 17: range subset(activeset, objectset[i].y-dist, object- Set[i].y+dist) 18: for all obj range do 19: if distance(objectset[i], obj) dist then 2: emit(objectset[i], obj) 21: end if 22: end for 23: add(activeset, objectset[i]) 24: emit(objectset[i], objectset[i]) 25: end for 26: end procedure Figure 5. Neighbor search 28
5 1: procedure MAPPER(key=o i, value=o j) 2: if o i==o j or o i.type <o j.type then 3: emit(o i, o j) 4: end if 5: end procedure 6: procedure REDUCER(key=o i, value=[o j]) 7: nrecord o i sort([o j]) 8: emit(o i, nrecord) 9: end procedure Figure 6. Neighbor grouping sorted list becomes the conditional neighborhood of o i, N (o i ). The reducer finally outputs key = o i,value = N (o i ). B. Parallel Co-located Event Set Search After the preprocess task finishes, the parallel co-location mining task starts. Before this work, we have a job that counts and saves the number of object instances per event type for future prevalence calculation (Figure 7). The prevalent co-located event set search is conducted in a level-wise manner. It begins with size k=2 co-location discovery. The proposed approach finds co-location patterns without the candidate set generation. When each mapper instance starts, the mapper is fed with the shards of neighborhood records, key, value = N (o i ). Per each N (o i )= {o i,o 1,...,o m }, the mapper performs the following steps: 1) For each event object o j N(o i ) where o i o j and 1 j m, remove o j from N (o i ) if the o j s type is not included in size k-1 co-located patterns, 2) collect all size k candidate instances CI which start with o i, i.e., {o i,o 1...,o k 1 } from N (o i ), and 3) filter true co-location instances from the candidate instances, and outputs each instance with a key-value pair key = eventset, value = instance. Note that the first object o i of the candidate instance CI={o i,o 1...,o k 1 } is the same as the first object o i in N (o i ), and has a neighbor relationship with the other objects in CI. However, we cannot guarantee CI-o i ={o 1...,o k 1 } has neighbors with each other. We check the cliqueness of the subset with the information from instances of size k-1 co-location patterns. If its sub-instance {o 1,...,o k 1 } is a co-location instance, the candidate instance {o i,o 1,...,o k 1 } becomes a true colocation instance. After all mapper instances have finished, the MapReduce collects the set of corresponding values per each key, and feeds the reducers with key-value pairs key = eventset, [value = instance], where [value ] is a list of co-location instances of the event set, key. A reducer computes the participation ratio of each event type with the instance set, and then computes the participation index of the event set. If the participation index satisfies a given prevalence threshold, the reducer outputs the frequent co-located event set key = eventset, value = participation index. The reducer also saves the co-location instances of the co-located event set for the next size of pattern mining. Figure 8 represents the pseudo code of this MapReduce job. This job is repeated with the increase of pattern size k=k+1. V. EXPERIMENTAL EVALUATION The proposed algorithm was implemented in Java, MapReduce library functions and HBase library functions. HBase [39] is a column-oriented database management system that runs on top of the Hadoop Distributed File System (HDFS). HBase was used to store the intermediate result such as prevalent colocated event types and co-location instances. The performance evaluation of the developed algorithm was conducted on Amazon Web Services (AWS) Elastic MapReduce platform [4], which provides resizable computing capacity in the cloud. For this experiment, we used instance type m1.large in the AWS. The version of Hadoop used was 1..3, and the version of HBase was We used real world data as well as synthetic data for the experiment. The synthetic data was generated using a spatial data generator used in [12]. In the first experiment, we compared the execution times of two main tasks: neighborhood partition and co-location search using synthetic datasets that differ in the number of data points. The number of distinct feature types in the data was 5. The neighbor distance was fixed to 1. We used a single node cluster for this experiment. Figure 9 (a) shows the result. There was no significant difference in the execution time of preprocessing spatial neighborhoods. However, the co-location pattern search time was increased with almost the same ratio of increased data points when the minimum prevalence threshold was.3. Although all the data points were included in a 1: procedure MAPPER(key=o i, value=n ) 2: emit(o i.type, 1) 3: end procedure 4: procedure REDUCER(key=event, value=[1]) 5: count sum(value) 6: save(event, count) 7: emit(event, count) 8: end procedure Figure 7. Count of instance objects of event 1: procedure MAPPER(key=o i, value=n ) 2: for all o N do 3: if IsPrevalentType(o s type)!=true then 4: N = N - o 5: end if 6: end for 7: k load( current pattern size ) 8: candiinstsets scanntransaction(n, k) 9: for all instance candiinstsets do 1: if checkcliqueness(instance)==true then 11: eventset eventtypesof(instance) 12: emit(eventset, instance) 13: end if 14: end for 15: end procedure 16: procedure REDUCER(key=eventset, value=[instance]) 17: θ Load( min prevalence threshold ) 18: PI compparticipationindex([instance]) 19: if PI θ then 2: emit(eventset, PI) 21: save(eventset, [instance]) 22: end if 23: end procedure Figure 8. Co-location pattern search 29
6 Execution time (sec) Execution time (sec) Co-location search Neighborhood partition 25K 5K 75K Data size(# of points) (a) neighbor distance=9 neighbor distance=1 neighbor distance=11 neighbor distance= Number of cluster nodes (c) Execution time (sec) Execution time (sec) K 5K 75K Number of cluster nodes 5 (b) neighbor distance=.1 mile neighbor distance=.15 mile neighbor distance=.2 mile Number of cluster nodes Figure 9. Experiment result: (a) Comparison of subtasks, (b) By number of data points, (c) By neighbor distance, and (d) Real-world data single partition during the data preprocess, the plane sweeping method adopted to search all neighbor pairs showed a good performance. Next, we changed the number of nodes in the cluster to evaluate the speed increase of our algorithm. As shown in Figure 9 (b), the execution time was decreased with the increased number of nodes. The performance gain due to the parallel processing was high when the number of nodes was increased 1 to 5. After that, the gain was significantly decreased in this experimental setting; although, the overall execution time was decreased with the increase of nodes. In the next experiment, we examined the effect of neighborhood size in the parallel processing performance. The neighborhood size was controlled with different neighbor distances. Figure 9 (c) shows the experimental result on 5K synthetic datasets with distances of 9, 1, 11, and 12. With an increase of the neighbor distance, the execution time of co-location mining is dramatically increased when the cluster has a single node. However, with an increase in the number of cluster nodes, the run time performance was significantly improved in all the neighbor distance settings. In the last experiment, we used real-world data which was a set of 17K points of interest in the Washington D.C. area. The number of feature types was 87. Figure 9 (d) shows the results of the experiments with distances of.1,.15, and.2 miles, respectively. The minimum prevalence threshold was fixed to.4. When the neighborhood size is large, the experiment shows the computational performance of co-location mining is greatly improved with the parallel processing. VI. CONCLUSION In this work, we have proposed to parallelize co-location pattern mining to deal with large-scale spatial data. We have developed a parallel/distributed co-location mining algorithm on Hadoop s MapReduce infrastructure. The proposed framework partitions the spatial neighborhood without any missing (d) and duplicate neighbor relationships for co-location discovery. Each worker independently conducts the co-location mining process with a shard of neighborhood records. The co-location patterns are searched in a level-wise manner by re-using previously processed information and without the generation of candidate sets. The experimental results show that our algorithmic design approach is overall parallelizable and follows a significant increase in speed, with respect to an increase in nodes, when data size is large and the neighborhood is dense. ACKNOWLEDGMENT This work is partially supported by Air Force Research Laboratory, Griffiss Business and Technology Park, and SUNY Research Foundation. REFERENCES [1] R. R. Vatsavai, A. Ganguly, V. Chandola, A. Stefanidis, S. Klasky, and S. Shekhar, Spatiotemporal Data Mining in the Era of Big Spatial Data: Algorithms and Applications, in Proceedings of ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, 212, pp [2] S. Shekhar and S. Chawla, Spatial Databases: A Tour. Prentice Hall, ISBN , 23. [3] K. Koperski and J. Han, Discovery of Spatial Association Rules in Geographic Information Databases, in Proceedings of the International Symposium on Large Spatial Data bases, 1995, pp [4] S. Shekhar and Y. Huang, Co-location Rules Mining: A Summary of Results, in Proceedings of International Symposium on Spatio and Temporal Database, 21, pp [5] Y. Morimoto, Mining Frequent Neighboring Class Sets in Spatial Databases, in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 21, pp [6] C. F. Eick, R. Parmar, W. Ding, T. F. Stepinski, and J. Nicot, Finding Regional Co-location Patterns for Sets of Continuous Variables in Spatial Datasets, in Proceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 28, pp [7] X. Xiao, X. Xie, Q. Luo, and W. Ma, Density based Co-location Pattern Discovery, in Proceedings of ACM SIGSPATIAL international Conference on Advances in Geographic Information Systems, 28, pp [8] Y. Huang, J. Pei, and H. Xiong, Mining Co-Location Patterns with Rare Events from Spatial Data Sets, Geoinformatica, 26, pp [9] J. S. Yoo and S. Shekhar, A Partial Join Approach for Mining Colocation Patterns, in Proceedings of the ACM International Symposium on Advances in Geographic Information Systems, 24, pp [1] J. Wang, W. Hsu, and M. L. Lee, A Framework for Mining Topological Patterns in Spatio-temporal Databases, in Proceedings of ACM International Conference on Information and Knowledge Management, 25, pp [11] X. Zhang, N. Mamoulis, D. Cheung, and Y. Shou, Fast Mining of Spatial Collocations, in Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 24, pp [12] J. S. Yoo and S. Shekhar, A Join-less Approach for Mining Spatial Co-location Patterns, IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 1, 26, pp [13] Apache Hadoop, [14] A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan, SystemML: Declarative Machine Learning on MapReduce, in Proceedings of International Conference on Data Engineering, 211, pp [15] Giraph, [16] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, 28, pp
7 [17] T. Elsayed, J. Lin, and D. W. Oard, Pairwise Document Similarity in Large Collections with MapReduce, in Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, 28, pp [18] A. Ene, S. Im, and B. Moseley, Fast Clustering using MapReduce, in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 211, pp [19] W.Ding, R. Jiamthapthaksin1, R. Parmar, D. Jiang, T. F. Stepinski, and C. F. Eick, Towards Region Discovery in Spatial Datasets, in Proceedings of International Pacific-Asia Conference on Knowledge Discovery and Data Mining, 28, pp [2] P. Mohan, S. Shekhar, J. Shine, J. ROgers, Z. Jiang, and N. Wayant, A Neighborhood Graph based Approach to Regional Co-location Pattern Discovery: A Summary of Results, in Proceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 211, pp [21] R. Agarwal and R. Srikant, Fast Algorithms for Mining Association Rules in Large Databases, in Proceedings of International Conference on Very Large Data Bases, 1994, pp [22] J. S. Yoo and M. Bow, Mining Top-k Closed Co-location Patterns, in Proceedings of IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services, 211, pp [23] J. S. Yoo and M. Bow, Finding N-Most Prevalent Colocated Event Sets, in Proceedings of International Conference on Data Warehousing and Knowledge Discovery, 29, pp [24] J. S. Yoo and M. Bow, Mining Spatial Colocation Patterns: A Different Framework, Data Mining and Knowledge Discovery, 212, pp [25] J. S. Yoo and M. Bow, Mining Maximal Co-located Event Sets, in Proceedings of Pacific-Asia International Conference on Knowledge Discovery and Data Mining, 211, pp [26] R. Agrawal and J. Shafer, Parallel Mining of Association Rules, IEEE Transactions on Knowledge and Data Engineering, 1996, pp [27] M. Zaki, Parallel and Distributed Association Mining: A Survey, Concurrency, IEEE, vol. 7, no. 4, 1999, pp [28] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns Without Candidate Generation, SIGMOD Record, vol. 29, 2, pp [29] L. Li and M. Zhang, The Strategy of Mining Association Rule based on Cloud Computing, in Proceedings of International Conference on Business Computing and Global Information, 211, pp [3] X. Y. Yang, Z. Liu, and Y. Fu, MapReduce as a Programming Model for Association Rules Algorithm on Hadoop, in Proceedings of International Conference on Information Sciences and Interaction Sciences (ICIS), 21, pp [31] M. Y. Lin, P. Y. Lee, and S. C. Hsueh, Apriori-based Frequent Itemset Mining Algorithms on MapReduce, in Proceedings of the International Conference on Ubiquitous Information Management and Communication, 212, pp [32] N. Li, L. Zeng, W. He, and Z. Shi, Parallel Implementation of Apriori Algorithm based on MapReduce, in Proceedings of ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, 212, pp [33] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, PFP: Parallel FP-Growth for Query Recommendation, in Proceedings of ACM Conference on Recommender systems, 28, pp [34] L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng, Balanced Parallel FP-Growth with MapReduce, in Proceedings of International Conference on Information Computing and Telecommunications, 21, pp [35] M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal, PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce, in Proceedings of ACM international Conference on Information and knowledge management, 212, pp [36] J. Moon and L. Moser, On Cliques in Graphs, Israel Journal of Mathematics, vol. 3, 1965, pp [37] J. L. E. Lawler and A. R. Kan, Generating All Maximal Independent Sets: NP-Hardness and Polynomial-time Algorithms, Israel Journal of Mathematics, vol. 3, 1965, pp [38] L. Arge, O. Proceedingspiuc, S. Ramaswamy, T. Suel, and J. Vitter, Scalable Sweeping-Based Spatial Join, in Proceedings of International Conference on Very Large Data Bases, 1998, pp [39] Apache HBase, [4] Amazon Elastic MapReduce (Amazon EMR), 31
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster
Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster Sudhakar Singh Dept. of Computer Science Faculty of Science Banaras Hindu University Rakhi Garg Dept. of Computer
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window
ISSN(Print): 2377-0430 ISSN(Online): 2377-0449 JOURNAL OF COMPUTER SCIENCE AND SOFTWARE APPLICATION In Press Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive
Static Data Mining Algorithm with Progressive Approach for Mining Knowledge
Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Mining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from
Improving Apriori Algorithm to get better performance with Cloud Computing
Improving Apriori Algorithm to get better performance with Cloud Computing Zeba Qureshi 1 ; Sanjay Bansal 2 Affiliation: A.I.T.R, RGPV, India 1, A.I.T.R, RGPV, India 2 ABSTRACT Cloud computing has become
Implementing Graph Pattern Mining for Big Data in the Cloud
Implementing Graph Pattern Mining for Big Data in the Cloud Chandana Ojah M.Tech in Computer Science & Engineering Department of Computer Science & Engineering, PES College of Engineering, Mandya [email protected]
A Partial Join Approach for Mining Co-location Patterns
A Partial Join Approach for Mining Co-location Patterns Jin Soung Yoo Department of Computer Science and Engineering University of Minnesota 200 Union ST SE 4-192 Minneapolis, MN 55414 [email protected]
Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
A Survey on Co-location and Segregation Patterns Discovery from Spatial Data
A Survey on Co-location and Segregation Patterns Discovery from Spatial Data Miss Priya S. Shejwal 1, Prof.Mrs. J. R. Mankar 2 1 Savitribai Phule Pune University,Department of Computer Engineering, K..K..Wagh
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce
, pp.231-242 http://dx.doi.org/10.14257/ijsia.2014.8.2.24 A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce Wang Jin-Song, Zhang Long, Shi Kai and Zhang Hong-hao School
Searching frequent itemsets by clustering data
Towards a parallel approach using MapReduce Maria Malek Hubert Kadima LARIS-EISTI Ave du Parc, 95011 Cergy-Pontoise, FRANCE [email protected], [email protected] 1 Introduction and Related Work
ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
Distributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
Frequent Itemset Mining for Big Data
Frequent Itemset Mining for Big Data Sandy Moens, Emin Aksehirli and Bart Goethals Universiteit Antwerpen, Belgium Email: [email protected] Abstract Frequent Itemset Mining (FIM) is one
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Selection of Optimal Discount of Retail Assortments with Data Mining Approach
Available online at www.interscience.in Selection of Optimal Discount of Retail Assortments with Data Mining Approach Padmalatha Eddla, Ravinder Reddy, Mamatha Computer Science Department,CBIT, Gandipet,Hyderabad,A.P,India.
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains
A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains Dr. Kanak Saxena Professor & Head, Computer Application SATI, Vidisha, [email protected] D.S. Rajpoot Registrar,
Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model
Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model Subramanyam. RBV, Sonam. Gupta Abstract An important problem that appears often when analyzing data involves
Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
Cloud Storage Solution for WSN Based on Internet Innovation Union
Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
Regional Co-locations of Arbitrary Shapes
Regional Co-locations of Arbitrary Shapes Song Wang 1, Yan Huang 2, and X. Sean Wang 3 1 Department of Computer Science, University of Vermont, Burlington, USA, [email protected], 2 Department of Computer
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING
Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING R.Kohila
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System
Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
How To Use Hadoop For Gis
2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.
The WAMS Power Data Processing based on Hadoop
Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore The WAMS Power Data Processing based on Hadoop Zhaoyang Qu 1, Shilin
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: [email protected] November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering
ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:
ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,
UPS battery remote monitoring system in cloud computing
, pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network
, pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Graph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Building A Smart Academic Advising System Using Association Rule Mining
Building A Smart Academic Advising System Using Association Rule Mining Raed Shatnawi +962795285056 [email protected] Qutaibah Althebyan +962796536277 [email protected] Baraq Ghalib & Mohammed
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
Comparison of Data Mining Techniques for Money Laundering Detection System
Comparison of Data Mining Techniques for Money Laundering Detection System Rafał Dreżewski, Grzegorz Dziuban, Łukasz Hernik, Michał Pączek AGH University of Science and Technology, Department of Computer
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
Binary Coded Web Access Pattern Tree in Education Domain
Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: [email protected] M. Moorthi
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Cloud Storage Solution for WSN in Internet Innovation Union
Cloud Storage Solution for WSN in Internet Innovation Union Tongrang Fan, Xuan Zhang and Feng Gao School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, 050043, China
Cloud Computing based on the Hadoop Platform
Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm.
Cloud computing doesn t yet have a
The Case for Cloud Computing Robert L. Grossman University of Illinois at Chicago and Open Data Group To understand clouds and cloud computing, we must first understand the two different types of clouds.
International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014)
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE N.Alamelu Menaka * Department of Computer Applications Dr.Jabasheela Department of Computer Applications Abstract-We are in the age of big data which
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients
