Incremental Map Reduce & Keyword-Aware for Mining Evolving Big Data



Similar documents
From GWS to MapReduce: Google s Cloud Technology in the Early Days

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Hadoop MapReduce using Cache for Big Data Processing

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

SECURED BIG DATA MINING SCHEME FOR CLOUD ENVIRONMENT

Big Data and Scripting Systems beyond Hadoop

Big Data Analytics Hadoop and Spark

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Survey on Load Rebalancing for Distributed File System in Cloud

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 11

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Introduction to Parallel Programming and MapReduce

Map-Based Graph Analysis on MapReduce

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

CloudClustering: Toward an iterative data processing pattern on the cloud

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data and Scripting Systems build on top of Hadoop

Log Mining Based on Hadoop s Map and Reduce Technique

Twister4Azure: Data Analytics in the Cloud

Task Scheduling in Hadoop

Big Data With Hadoop

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Query and Analysis of Data on Electric Consumption Based on Hadoop

Efficient Analysis of Big Data Using Map Reduce Framework

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data with Rough Set Using Map- Reduce

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Survey on Scheduling Algorithm in MapReduce Framework

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Securing Health Care Information by Using Two-Tier Cipher Cloud Technology

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MapReduce (in the cloud)

Big Data Analytics. Lucas Rego Drumond

Lecture Data Warehouse Systems

A Study on Data Analysis Process Management System in MapReduce using BPM

Large data computing using Clustering algorithms based on Hadoop

RECOMMENDATION METHOD ON HADOOP AND MAPREDUCE FOR BIG DATA APPLICATIONS

International Journal of Innovative Research in Computer and Communication Engineering

In-memory Distributed Processing Method for Traffic Big Data to Analyze and Share Traffic Events in Real Time among Social Groups

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Big Data and Apache Hadoop s MapReduce

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

CSE-E5430 Scalable Cloud Computing Lecture 2

BIG DATA TECHNOLOGY. Hadoop Ecosystem

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Evaluating partitioning of big graphs

How To Balance In Cloud Computing

Approaches for parallel data loading and data querying

Mobile Storage and Search Engine of Information Oriented to Food Cloud

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

A Survey of Cloud Computing Guanfeng Octides

Comparison of Different Implementation of Inverted Indexes in Hadoop

Brave New World: Hadoop vs. Spark

IMPROVED MASK ALGORITHM FOR MINING PRIVACY PRESERVING ASSOCIATION RULES IN BIG DATA

Architectures for Big Data Analytics A database perspective

Rakam: Distributed Analytics API

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Challenges in Bioinformatics

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Apache Flink Next-gen data analysis. Kostas

VariantSpark: Applying Spark-based machine learning methods to genomic information

A Survey on Classification of Data Mining Using Big Data

Big Data Frameworks Course. Prof. Sasu Tarkoma

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Advanced In-Database Analytics

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Big Data Begets Big Database Theory

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Data-Intensive Computing with Map-Reduce and Hadoop

Presto/Blockus: Towards Scalable R Data Analysis

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

An Advanced Bottom up Generalization Approach for Big Data on Cloud

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

UPS battery remote monitoring system in cloud computing

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Distributed file system in cloud based on load rebalancing algorithm

Transcription:

Incremental Map Reduce & Keyword-Aware for Mining Evolving Big Data Rahul More 1, Sheshdhari Shete 2, SnehalJagtap 3,Prajakta Shete 4 1 Student,Department of Computer Engineering,Genba Sopanrao Moze College of Engineering,Balewadi,Savitribai Phule Pune University,Pune India.rahulsmore909@gmail.com 2 Student, Department of Computer Engineering,Genba Sopanrao Moze College of Engineering,Balewadi,Savitribai Phule Pune University,Pune India.shete.sheshadri@gmail.com 3 Student, Department of Computer Engineering,Genba Sopanrao Moze College of Engineering,Balewadi,Savitribai Phule Pune University,Pune India.snehaljagtap46@yahoo.com 4 Assistant Professor, Department of Computer Engineering,Genba Sopanrao Moze College of Engineering,Balewadi,Savitribai Phule Pune University,Pune India.prajakta_shete@yahoo.com ABSTRACT Big data is large volume, heterogeneous, distributed data. Big data applications where data collection has grown continuously, it is expensive to manage, capture or extract and process data using existing software tools. For example Weather Forecasting, Electricity Demand Supply, social media and so on. With increasing size of data in data warehouse it is expensive to perform data analysis. Data cube commonly abstracting and summarizing databases. It is way of structuring data in different n dimensions for analysis over some measure of interest. For data processing Big data processing framework relay on cluster computers and parallel execution framework provided by Map-Reduce. Extending cube computation techniques to this paradigm. MR-Cube is framework (based on mapreduce)used for cube materialization and mining over massive datasets using holistic measure. MR-Cube efficiently computes cube with holistic measures over billion-tuple datasets. Keywords: Recommender System, Preferences, keyword, Big Data, Map-Reduce,Hadoop. --------------------------------------------------------------------------------------------------------------------------- 1.INTRODUCTION In Big data the information comes from multiple, heterogeneous, autonomous sources with complex relationship and continuously growing. upto 2.5 quintillion bytes of data are created daily and 90 percent data in the world today were produced within past two years.for example Flicker, a public picture sharing site, where in an average 1.8 million photos per day are receive from February to march 2012.this shows that it is very difficult for big data applications to manage, process and retrieve data from large volume of data using existing software tools. It s become challenge to extract knowledgeable information for future use.there are different challenges of Data mining with Big Data. We overlook it in next section. Currently Big Data processing depends upon parallel programming models like MapReduce, as well as providing computing platform of Big Data services. Data mining algorithms need to scan through the training data for obtaining the statistics for solving or optimizing model parameter. Due to the large size of data it is becoming expensive to analysis data cube. The Map-Reduce based approach is used for data cube materialization and mining over massive datasets using holistic (non algebraic) measures like TOP-k for the top-k most frequent queries. MR-Cube approach is used for efficient cube computation. Our paper is organized as follows: first we will see key challenges of Big Data Mining then we overlook some methods like cube materialization, MapReduce and MR-cube approach. Map-Reduce are a distributed parallel programming model introduced by Google to support massive data processing. First version of the Map Reduce library was written in February 2003. The programming model is inspired by the map and reduces primitives found in Lisp and other functional languages. This model processes large amount of data faster than relation database management system (RDBMS). Big Data also brings new opportunities and critical challenges to industry and academia. Similar to most big data applications, the big data tendency also poses heavy impacts on service recommender systems. With the growing number of alternative services, effectively recommending services that users preferred has become an important research issue. Service recommender systems have been shown as valuable tools to help users deal with services overload and provide appropriate recommendations to them. Examples of such practical applications include CDs, books, web pages and various other products now use recommender systems.the last decade, there has been much research done both in industry and academia on developing new approaches for service recommender systems.

2. PROBLEM STATEMENT The problem is determine personalized service recommendation list and recommending the most appropriate services to the users effectively. Specifically, keywords are used to indicate users' preferences, and a user-based Collaborative Filtering algorithm is adopted to generate appropriate recommendations. The challenges at tier I focus on low-level data accessing and arithmetic computing procedures, Challenges on information sharing and privacy. Big Data often stored on different location and it is continuously growing that s why an effective computing platform to take distributed large scale data storage into consideration for computing. Tier II concentrate on high-level semantics, application domain knowledge for different applications of big data and the user privacy issues. This information provides benefits to Big data access but also add a technical barriers to Big Data access (Tier I) and mining algorithms (Tier II). The Outmost tier is tier III which challenges the actual mining algorithms. At this tier III the mining challenges concentrate on algorithm designs in tacking the difficulties which is raised by the big data volumes, distributed data distribution, complex and dynamic characteristics. 3. LITERATURE SURVEY 3.1 MapReduce: Simplified Data Processing on Large Clusters MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, scheduling the program s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google s clusters every day. 3.2 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing We present Resilient Distributed Datasets (RDDs), distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks. 4. RELATED WORK Related work is Bayesian-inference-based recommendation is better than the existing trust-based recommendations and is comparable to Collaborative Filtering recommendation. 5. EXISTING SYSTEM Someexisting works focus on retrieving individual objects by specifying a query consisting of a query location and a set of query keywords (or known as document in some context). Each retrieved object is associated with keywords relevant to the query keywords and is close to the query location.

When the number of query keywords increases, the performance drops dramatically as a result of massive candidate keyword covers generated. The inverted index at each node refers to a pseudo-document that represents the keywords under the node. Therefore, in order to verify if a node is relevant to a set of query keywords, the inverted index is accessed at each node to evaluate the matching between the query keywords and the pseudo-document associated with the node. Do all users actually need the relationship on the social networks to recommend items? Does the relationship submerge user s personality, especially for the experienced users? It is still a great challenge to embody user s personality in RS, and it is still an open issue that how to make the social factors be effectively integrated in recommendation model to improve the accuracy of RS. 6. PROPOSED SYSTEM We propose i2mapreduce, a novel incremental processing extension to MapReduce, the most widely used framework for mining big data. Compared with the state-of-the-art work on In coop, i2mapreduce (i) performs key-value pair level incremental processing rather than task level re-computation, (ii) supports not only one-step computation but also more sophisticated iterative computation, which is widely used in data mining applications, and (iii) incorporates a set of novel techniques to reduce I/O overhead for accessing preserved fine-grain computation states.this paper investigates a generic version of query, called Keyword query, which considers inter-objects distance as well as keyword rating. It is motivated by the observation of increasing availability and importance of keyword rating in businesses/services/features around the world havebeen rated by customers through online business review sites. In this paper, three social factors, personal recommendation, inter-personal interest similarity, and interpersonal influence, fuse into a unified personalized recommendation model based on probabilistic matrix factorization. The personality is denoted by user-item relevance of user recommendation to the topic of item. To embody the effect of user s personality, we mine the topic of item based on the natural item category tags of rating datasets. Thus, each item is denoted by a category distribution or topic distribution vector, which can reflect the characteristic of the rating datasets. Moreover, we get user interest based on his/her rating behavior. We then assign to the effect of user s personality in our personalized recommendation model proportional to their expertise levels. On the other hand, the user-user relationship of social network contains two factors: interpersonal influence and interpersonal interest similarity. We apply the inferred trust circle of Circle-based Recommendation model to enforce the factor of interpersonal influence. Similarly, for the interpersonal interest similarity, we infer interest circle to enhance the intrinsic link of user latent feature. The factor of user personal interest makes direct connections between user and item latent feature vectors. And the two other social factors make connections between previous user and active user latent feature vectors. Personal unique interest is modeled to get an accurate model for the cold start user and user with very few friends and rated items. The impacts of the three factors to the recommendation systematically compared. The effect of proposed model to solve the user cold start and sparsityproblem. Also, Preferences keyword set of active user as well Preferences keyword set of previous user.

7.SYSTEM ARCHITECTURE Fig-1 System Architecture This application is used to known the user in a personalized way to interesting or useful services in a large space of possible options. Current recommendation methods usually can be classified into three main categories: content-based, collaborative, and hybrid recommendation approaches. Content-based approaches recommend services similar to those the user preferred in the past.collaborative filtering (CF) schema recommend services to the user that users with similar tastes preferred in the past. Hybrid approaches combine content-based and CF methods in several different ways. The personalized recommendation algorithm, which is widely used in many commercial recommender systems. In CF based systems, users receive recommendations based on people who have similar tastes and preferences, which can be further classified into item-based CF and user-based CF. In item-based systems, the predicted rating depends on the ratings of other similar items by the same user. While in user-based systems, the prediction of the rating of an item for a user depends upon the ratings of the same item rated by similar users. And in this work, we will take advantage of a user-based CF algorithm to deal with our problem. 8. FUTURE SCOPE We will do further research in how todeal with the case where term appears in different categories of a domain thesaurus from context and how to distinguish the positive and negative preferences of user from their reviews to make the predictions more accurate. 9. CONCLUSION We can conclude that i2mapreduce combines a fine-grain incremental engine, a general-purpose iterative model, and a setofeffectivetechniquesforincrementaliterativecomputation. keyword-aware service recommendation method.

10. REFERENCES [1] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, in Proc. of OSDI 04, 2004. [2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. Mc-Cauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for. in-memory cluster computing, in Proc. of NSDI 12, 2012. [3] R. Power and J. Li, Piccolo: Building fast, distributed programs with partitioned tables, in Proc. of OSDI 10, 2010, pp. 1 14. [4] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, Pregel: a system for large-scale graph processing, in Proc. of SIGMOD 10, 2010. [5] S. R. Mihaylov, Z. G. Ives, and S. Guha, Rex: recursive, delta based data-centric computation, PVLDB, vol. 5, no. 11, pp. 1280 1291, 2012. [6] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, Distributed graphlab: a framework for machine learning and data mining in the cloud, PVLDB, vol. 5, no. 8, pp. 716 727, 2012. [7] S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl, Spinning fast iterative data flows, PVLDB, vol. 5, no. 11, pp. 1268 1279, 2012. [8] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, Haloop: efficient iterative data processing on large clusters, PVLDB, vol. 3, no. 1-2, pp. 285 296, 2010. [9] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox, Twister: a runtime for iterative mapreduce, in Proc. of MAPREDUCE 10, 2010. [10] Y. Zhang, Q. Gao, L. Gao, and C. Wang, imapreduce: A distributed computing framework for iterative computation, J. Grid Comput., vol. 10, no. 1, 2012.