Incremental Map Reduce & Keyword-Aware for Mining Evolving Big Data

Transcription

1 Incremental Map Reduce & Keyword-Aware for Mining Evolving Big Data Rahul More 1, Sheshdhari Shete 2, SnehalJagtap 3,Prajakta Shete 4 1 Student,Department of Computer Engineering,Genba Sopanrao Moze College of Engineering,Balewadi,Savitribai Phule Pune University,Pune India.rahulsmore909@gmail.com 2 Student, Department of Computer Engineering,Genba Sopanrao Moze College of Engineering,Balewadi,Savitribai Phule Pune University,Pune India.shete.sheshadri@gmail.com 3 Student, Department of Computer Engineering,Genba Sopanrao Moze College of Engineering,Balewadi,Savitribai Phule Pune University,Pune India.snehaljagtap46@yahoo.com 4 Assistant Professor, Department of Computer Engineering,Genba Sopanrao Moze College of Engineering,Balewadi,Savitribai Phule Pune University,Pune India.prajakta_shete@yahoo.com ABSTRACT Big data is large volume, heterogeneous, distributed data. Big data applications where data collection has grown continuously, it is expensive to manage, capture or extract and process data using existing software tools. For example Weather Forecasting, Electricity Demand Supply, social media and so on. With increasing size of data in data warehouse it is expensive to perform data analysis. Data cube commonly abstracting and summarizing databases. It is way of structuring data in different n dimensions for analysis over some measure of interest. For data processing Big data processing framework relay on cluster computers and parallel execution framework provided by Map-Reduce. Extending cube computation techniques to this paradigm. MR-Cube is framework (based on mapreduce)used for cube materialization and mining over massive datasets using holistic measure. MR-Cube efficiently computes cube with holistic measures over billion-tuple datasets. Keywords: Recommender System, Preferences, keyword, Big Data, Map-Reduce,Hadoop INTRODUCTION In Big data the information comes from multiple, heterogeneous, autonomous sources with complex relationship and continuously growing. upto 2.5 quintillion bytes of data are created daily and 90 percent data in the world today were produced within past two years.for example Flicker, a public picture sharing site, where in an average 1.8 million photos per day are receive from February to march 2012.this shows that it is very difficult for big data applications to manage, process and retrieve data from large volume of data using existing software tools. It s become challenge to extract knowledgeable information for future use.there are different challenges of Data mining with Big Data. We overlook it in next section. Currently Big Data processing depends upon parallel programming models like MapReduce, as well as providing computing platform of Big Data services. Data mining algorithms need to scan through the training data for obtaining the statistics for solving or optimizing model parameter. Due to the large size of data it is becoming expensive to analysis data cube. The Map-Reduce based approach is used for data cube materialization and mining over massive datasets using holistic (non algebraic) measures like TOP-k for the top-k most frequent queries. MR-Cube approach is used for efficient cube computation. Our paper is organized as follows: first we will see key challenges of Big Data Mining then we overlook some methods like cube materialization, MapReduce and MR-cube approach. Map-Reduce are a distributed parallel programming model introduced by Google to support massive data processing. First version of the Map Reduce library was written in February The programming model is inspired by the map and reduces primitives found in Lisp and other functional languages. This model processes large amount of data faster than relation database management system (RDBMS). Big Data also brings new opportunities and critical challenges to industry and academia. Similar to most big data applications, the big data tendency also poses heavy impacts on service recommender systems. With the growing number of alternative services, effectively recommending services that users preferred has become an important research issue. Service recommender systems have been shown as valuable tools to help users deal with services overload and provide appropriate recommendations to them. Examples of such practical applications include CDs, books, web pages and various other products now use recommender systems.the last decade, there has been much research done both in industry and academia on developing new approaches for service recommender systems.

2 2. PROBLEM STATEMENT The problem is determine personalized service recommendation list and recommending the most appropriate services to the users effectively. Specifically, keywords are used to indicate users' preferences, and a user-based Collaborative Filtering algorithm is adopted to generate appropriate recommendations. The challenges at tier I focus on low-level data accessing and arithmetic computing procedures, Challenges on information sharing and privacy. Big Data often stored on different location and it is continuously growing that s why an effective computing platform to take distributed large scale data storage into consideration for computing. Tier II concentrate on high-level semantics, application domain knowledge for different applications of big data and the user privacy issues. This information provides benefits to Big data access but also add a technical barriers to Big Data access (Tier I) and mining algorithms (Tier II). The Outmost tier is tier III which challenges the actual mining algorithms. At this tier III the mining challenges concentrate on algorithm designs in tacking the difficulties which is raised by the big data volumes, distributed data distribution, complex and dynamic characteristics. 3. LITERATURE SURVEY 3.1 MapReduce: Simplified Data Processing on Large Clusters MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, scheduling the program s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google s clusters every day. 3.2 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing We present Resilient Distributed Datasets (RDDs), distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks. 4. RELATED WORK Related work is Bayesian-inference-based recommendation is better than the existing trust-based recommendations and is comparable to Collaborative Filtering recommendation. 5. EXISTING SYSTEM Someexisting works focus on retrieving individual objects by specifying a query consisting of a query location and a set of query keywords (or known as document in some context). Each retrieved object is associated with keywords relevant to the query keywords and is close to the query location.

3 When the number of query keywords increases, the performance drops dramatically as a result of massive candidate keyword covers generated. The inverted index at each node refers to a pseudo-document that represents the keywords under the node. Therefore, in order to verify if a node is relevant to a set of query keywords, the inverted index is accessed at each node to evaluate the matching between the query keywords and the pseudo-document associated with the node. Do all users actually need the relationship on the social networks to recommend items? Does the relationship submerge user s personality, especially for the experienced users? It is still a great challenge to embody user s personality in RS, and it is still an open issue that how to make the social factors be effectively integrated in recommendation model to improve the accuracy of RS. 6. PROPOSED SYSTEM We propose i2mapreduce, a novel incremental processing extension to MapReduce, the most widely used framework for mining big data. Compared with the state-of-the-art work on In coop, i2mapreduce (i) performs key-value pair level incremental processing rather than task level re-computation, (ii) supports not only one-step computation but also more sophisticated iterative computation, which is widely used in data mining applications, and (iii) incorporates a set of novel techniques to reduce I/O overhead for accessing preserved fine-grain computation states.this paper investigates a generic version of query, called Keyword query, which considers inter-objects distance as well as keyword rating. It is motivated by the observation of increasing availability and importance of keyword rating in businesses/services/features around the world havebeen rated by customers through online business review sites. In this paper, three social factors, personal recommendation, inter-personal interest similarity, and interpersonal influence, fuse into a unified personalized recommendation model based on probabilistic matrix factorization. The personality is denoted by user-item relevance of user recommendation to the topic of item. To embody the effect of user s personality, we mine the topic of item based on the natural item category tags of rating datasets. Thus, each item is denoted by a category distribution or topic distribution vector, which can reflect the characteristic of the rating datasets. Moreover, we get user interest based on his/her rating behavior. We then assign to the effect of user s personality in our personalized recommendation model proportional to their expertise levels. On the other hand, the user-user relationship of social network contains two factors: interpersonal influence and interpersonal interest similarity. We apply the inferred trust circle of Circle-based Recommendation model to enforce the factor of interpersonal influence. Similarly, for the interpersonal interest similarity, we infer interest circle to enhance the intrinsic link of user latent feature. The factor of user personal interest makes direct connections between user and item latent feature vectors. And the two other social factors make connections between previous user and active user latent feature vectors. Personal unique interest is modeled to get an accurate model for the cold start user and user with very few friends and rated items. The impacts of the three factors to the recommendation systematically compared. The effect of proposed model to solve the user cold start and sparsityproblem. Also, Preferences keyword set of active user as well Preferences keyword set of previous user.

4 7.SYSTEM ARCHITECTURE Fig-1 System Architecture This application is used to known the user in a personalized way to interesting or useful services in a large space of possible options. Current recommendation methods usually can be classified into three main categories: content-based, collaborative, and hybrid recommendation approaches. Content-based approaches recommend services similar to those the user preferred in the past.collaborative filtering (CF) schema recommend services to the user that users with similar tastes preferred in the past. Hybrid approaches combine content-based and CF methods in several different ways. The personalized recommendation algorithm, which is widely used in many commercial recommender systems. In CF based systems, users receive recommendations based on people who have similar tastes and preferences, which can be further classified into item-based CF and user-based CF. In item-based systems, the predicted rating depends on the ratings of other similar items by the same user. While in user-based systems, the prediction of the rating of an item for a user depends upon the ratings of the same item rated by similar users. And in this work, we will take advantage of a user-based CF algorithm to deal with our problem. 8. FUTURE SCOPE We will do further research in how todeal with the case where term appears in different categories of a domain thesaurus from context and how to distinguish the positive and negative preferences of user from their reviews to make the predictions more accurate. 9. CONCLUSION We can conclude that i2mapreduce combines a fine-grain incremental engine, a general-purpose iterative model, and a setofeffectivetechniquesforincrementaliterativecomputation. keyword-aware service recommendation method.

5 10. REFERENCES [1] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, in Proc. of OSDI 04, [2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. Mc-Cauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for. in-memory cluster computing, in Proc. of NSDI 12, [3] R. Power and J. Li, Piccolo: Building fast, distributed programs with partitioned tables, in Proc. of OSDI 10, 2010, pp [4] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, Pregel: a system for large-scale graph processing, in Proc. of SIGMOD 10, [5] S. R. Mihaylov, Z. G. Ives, and S. Guha, Rex: recursive, delta based data-centric computation, PVLDB, vol. 5, no. 11, pp , [6] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, Distributed graphlab: a framework for machine learning and data mining in the cloud, PVLDB, vol. 5, no. 8, pp , [7] S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl, Spinning fast iterative data flows, PVLDB, vol. 5, no. 11, pp , [8] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, Haloop: efficient iterative data processing on large clusters, PVLDB, vol. 3, no. 1-2, pp , [9] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox, Twister: a runtime for iterative mapreduce, in Proc. of MAPREDUCE 10, [10] Y. Zhang, Q. Gao, L. Gao, and C. Wang, imapreduce: A distributed computing framework for iterative computation, J. Grid Comput., vol. 10, no. 1, 2012.