RECOMMENDATION METHOD ON HADOOP AND MAPREDUCE FOR BIG DATA APPLICATIONS



Similar documents
Recommendation Tool Using Collaborative Filtering

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece

Advanced In-Database Analytics

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Bigtable is a proven design Underpins 100+ Google services:

Chapter 7. Using Hadoop Cluster and MapReduce

How To Scale Out Of A Nosql Database

Greg Linden, Brent Smith, and Jeremy York Amazon.com

Energy Efficient MapReduce

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

INTRODUCTION TO CASSANDRA

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

International Journal of Innovative Research in Computer and Communication Engineering

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Big Data With Hadoop

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Search Result Optimization using Annotators

Hadoop Cluster Applications

A Database Hadoop Hybrid Approach of Big Data

Hadoop Technology for Flow Analysis of the Internet Traffic

Networking in the Hadoop Cluster

Big Data with Rough Set Using Map- Reduce

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Secure Collaborative Privacy In Cloud Data With Advanced Symmetric Key Block Algorithm

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Machine Learning using MapReduce

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Big Data and Apache Hadoop s MapReduce

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

How To Handle Big Data With A Data Scientist

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

A programming model in Cloud: MapReduce

Introduction to Hadoop

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Performance Analysis of Book Recommendation System on Hadoop Platform

Policy-based Pre-Processing in Hadoop

Processing of Hadoop using Highly Available NameNode

Advances in Natural and Applied Sciences

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Implementing Graph Pattern Mining for Big Data in the Cloud

Data Refinery with Big Data Aspects

NoSQL for SQL Professionals William McKnight

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Task Scheduling in Hadoop

Survey on Load Rebalancing for Distributed File System in Cloud

Comparison of Different Implementation of Inverted Indexes in Hadoop

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

Cloud Computing based on the Hadoop Platform

Infrastructures for big data

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Log Mining Based on Hadoop s Map and Reduce Technique

An Approach to Implement Map Reduce with NoSQL Databases

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Similarity Search in a Very Large Scale Using Hadoop and HBase

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS

A Study of Data Management Technology for Handling Big Data

Cloud Computing at Google. Architecture

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Lecture Data Warehouse Systems

Big Data on Microsoft Platform

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Intelligent Web Techniques Web Personalization

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

User Behavior Analysis Based On Predictive Recommendation System for E-Learning Portal

A Survey on Product Aspect Ranking

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Aggregation Methodology on Map Reduce for Big Data Applications by using Traffic-Aware Partition Algorithm

Hadoop. Sunday, November 25, 12

Massive Cloud Auditing using Data Mining on Hadoop

Figure 1 Cloud Computing. 1.What is Cloud: Clouds are of specific commercial interest not just on the acquiring tendency to outsource IT

Application Development. A Paradigm Shift

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Table A Distributed Storage System For Data

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

Yuji Shirasaki (JVO NAOJ)

Transcription:

RECOMMENDATION METHOD ON HADOOP AND MAPREDUCE FOR BIG DATA APPLICATIONS T.M.S.MEKALARANI #1, M.KALAIVANI *2 # ME, Computer Science and Engineering, Dhanalakshmi College of Engineering, Tambaram, India. * ME, Computer Science and Engineering Asst Prof of Dhanalakshmi College of Engineering, Tamabaram, India. Abstract In recent years, the amount of data in our world has been increasing explosively, and analyzing large data sets the so-called Big Data becomes a key basis of competition underpinning new waves of productivity growth, innovation, and consumer surplus. Then, what is Big Data?, Big Data refers to datasets whose size is beyond the ability of current technology, method and theory to capture, man-age, and process the data within a tolerable elapsed time. With the growing number of alternative services, effectively recommending services that a user preferred has become an important research issue. Service recommender systems have been shown as valuable tools to help users deal with services overload and provide appropriate recommendations to them. Service recommender systems have been shown as valuable tools for providing appropriate recommendations to users. In the last decade, the amount of customers, services and online information has grown rapidly, yielding the big data analysis problem for service recommender systems. Consequently, traditional service recommender systems often suffer from scalability and inefficiency problems when processing or analyzing such large-scale data. Moreover, most of existing service recommender systems present the same ratings and rankings of services to different users without considering diverse users' preferences, and therefore fails to meet users' personalized requirements. Key words: Recommendation, Scalability, Inefficiency, Big Data, Hadoop, MapReduce. 442

INTRODUCTION Current recommendation methods usually can be classified into three main categories: content-based, collaborative, and hybrid recommendation approaches. Content-based approaches recommend services similar to those the user preferred in the past. Collaborative filtering (CF) approaches recommend services to the user that users with similar tastes preferred in the past. Hybrid approaches combine contentbased and CF methods in several different ways. In CF based systems, users receive recommendations based on people who have similar tastes and preferences, which can be further classified into item-based CF and user-based CF. In item-based systems; the predicted rating depends on the ratings of other similar items by the same user. While in user-based systems, the prediction of the rating of an item for a user depends upon the ratings of the same item rated by similar users. And in this work, we will take advantage of a user-based CF algorithm to deal with our problem. The one example of big data analysis problem Alice and Tom are respectively browsing a hotel reservation website to reserve a hotel in Kowloon, Hong Kong. But the ratings and recommendation list of the hotels provided by the website to them are the same. Assuming there are three hotels in Kowloon: A, B and C. Comparing the three hotels, A is convenient to the airport and has a shopping mall nearby; B has convenient transportation with an underground station nearby and owns comfortable accommodation equipment; the breakfast and food of C is delicious and its view is very good. According to the overall ratings provided by the website, B is better than A and A is better than C. However, in this travel, Alice prefers a shopping mall near the hotel and good location, while Tom is concerned about food and wishes good view around the hotel. So hotel B may not be the best choice for them, and hotel A and C may be more appropriate to Alice and Tom respectively. Over the last two and a half years we have designed, implemented, and deployed a distributed storage system for managing structured data at Google called Bigtable. Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability. Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable for a variety of demanding workloads, which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users. The Bigtable clusters used by these products span a wide range of con_gurations, from a handful to 443

thousands of servers, and store up to several hundred terabytes of data. In many ways, Bigtable resembles a database: it shares many implementation strategies with databases. Parallel databases and mainmemory databases have achieved scalability and high performance, but Bigtable provides a different interface than such systems. Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterrupted strings, although clients often serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. RELATED WORK Most recommendation algorithms start by finding a set of customers whose purchased and rated items overlap the user s purchased and rated items. The algorithm aggregates items from these similar customers, eliminates items the user has already purchased or rated, and recommends the remaining items to the user. Two popular versions of these algorithms are collaborative filtering and cluster models. Other algorithms including search-based methods and our own item-to-item collaborative filtering focus on finding similar items, not similar customers. For each of the user s purchased and rated items, the algorithm attempts to find similar items. It then aggregates the similar items and recommends them. Traditional Collaborative Filtering A traditional collaborative filtering algorithm represents a customer as an N- dimensional vector of items, where N is the number of distinct catalog items. The components of the vector are positive for purchased or positively rated items and negative for negatively rated items. To compensate for best-selling items, the algorithm typically multiplies the vector components by the inverse frequency (the inverse of the number of customers who have purchased or rated the item), making less well-known items much more relevant. For almost all customers, this vector is extremely sparse. The algorithm generates recommendations based on a few customers who are most similar to the user. It can measure the similarity of two customers, A and B, in various ways; a common method is to measure the cosine of the angle between the two vectors. The algorithm can select recommendations from the similar customers items using various methods as well, a common technique is to rank each item according to how many similar 444

customers purchased it. Using collaborative filtering to generate recommendations is computationally expensive. It is O(MN) in the worst case, where M is the number of customers and N is the number of product catalog items, since it examines M customers and up to N items for each customer. However, because the average customer vector is extremely sparse, the algorithm s performance tends to be closer to O(M + N). Scanning every customer is approximately O(M), not O(MN), because almost all customer vectors contain a small number of items, regardless of the size of the catalog. But there are a few customers who have purchased or rated a significant number of items examined by a small, constant factor by partitioning the item space based on product category or subject classification. Dimensionality reduction techniques such as clustering and principal component analysis can reduce M or N by a large factor. Login Details Requirements of Administrator Data Base Hotels percentage of the catalog, requiring O(N) processing time. Thus, the final performance of the algorithm is approximately O(M + N). Even so, for very large data sets such as 10 million or more customers and 1 million or more catalog items the algorithm encounters severe performance and scaling issues. It is possible to partially address these scaling issues by reducing the data size. We can reduce M by randomly sampling the customers or discarding customers with few purchases, and reduce N by discarding very popular or unpopular items. It is also possible to reduce the Cost, Travelling, Facilities, Entertainments, etc,. Choose of Comfort ability CONCLUSION List of Hotels Collaborative Filtering algorithm is adopted to generate appropriate recommendations. More specifically, a keyword-candidate list and domain thesaurus are provided to help obtain users' preferences. The active user gives his/her preferences by selecting the keywords from 445

the keyword-candidate list, and the preferences of the previous users can be extracted from their reviews for services according to the keyword-candidate list and domain thesaurus. Our method aims at presenting a personalized service recommendation list and recommending the most appropriate service(s) to the users. Moreover, to improve the scalability and efficiency of KASR in Big Data environment, we have implemented it on a MapReduce framework in Hadoop platform. Finally, the experimental results demonstrate that KASR significantly improves the accuracy and scalability of service recommender systems over existing approaches. In our future work, we will do further research in how to deal with the case where term appears in different categories of a domain thesaurus from context and how to distinguish the positive and negative preferences of the users from their reviews to make the predictions more accurate. REFERENCE [1] Shunmei Meng, Wanchun Dou, Xuyun Zhang, KASR: A Keyword- Aware Service Recommendation Method On Mapreduce For Big Data Applications.(IEEE-2013) [2] Rafael Sotelo Jose, Joskowicz Alberto, An Affordable and Inclusive System to Provide Contents to DTV Using Recommender System.(IEEE-2014) [3] Fay Chang, Jeffrey Dean, Bigtable: A Distributed Storage System For Structured Data. (IEEE-2013) [4] A Ramachandran Individualized Travel Recommendation By Mining People Ascribes And Travel Logs Types From Community Imparted Pictures. (IEEE-2013) [5] Yasha Sardey, Pranoti Deshmukh A Mobile Application For Bus Information System And Location Tracking Using Client-Server Technology. (IEEE-2014) [6] Katarina Grolinger Challenges For Mapreduce In Big Data. (IEEE- 2014) [7] Puneet Singh Duggal, Big Data Analysis: Challenges And Solutions. (IEEE-2013) [8] Chansup Byun, William Arcand, Driving Big Data With Big Compute. (IEEE-2014) [9] Jeffrey Dean And Sanjay Ghemawat, Mapreduce: Simplied Data Processing On Large Clusters. (IEEE-2014) 446

[10] Greg Linden, Brent Smith, And Model Of Computation For Big Jeremy York, Amazon.Com Data. (IEEE-2014) Recommendations Item-To-Item [16] Rui Han ; Imperial Coll. London, Collaborative Filtering. (IEEE- London, UK, Elastic Algorithms 2014) For Guaranteeing Quality [11] Sanders, Communication Efficient Monotonicity In Big Data Mining. Algorithms For Fundamental Big (IEEE-2014) Data Problems. (IEEE-2014) [17] Pastorelli, M. ; EURECOM, HFSP: [12] Bianchi, P. ; Inst. Mines-Telecom, Size-Based Scheduling For Hadoop. On-Line Learning Gossip (IEEE-2013) Algorithm In Multi-Agent Systems [18] Elser, B. ; Univ. Degli Studi Di With Local Decision Rules. (IEEE- Trento, An Evaluation Study of Big 2014) Data Frameworks for Graph [13] Sanders, P. ; Inst. Of Theor. Inf., Processing. (IEEE- 2013) Karlsruhe Inst, Communication [19] Honjo, T. ; NTT Software Efficient Algorithms For Innovation Center, Hardware Fundamental Big Data Problems. Acceleration of Hadoop (IEEE-2014) Mapreduce. (IEEE- 2013) [14] Gupta, U. ; CSE, Univ. Of Texas At [20] Lei Zhang ; Huynh Phung Huynh, Arlington, Map-Based Graph Optimizing The Mapreduce Analysis On Mapreduce. (IEEE- Framework On Intel Xeon Phi 2014) Coprocessor. (IEEE- 2013) [15] Tao Luo ; Sch. of Computer. Sci. & Technol., Univ. Of Sci, P-DOT: A 447