E6893 Big Data Analytics: Yelp Fake Review Detection

Similar documents
Mimicking human fake review detection on Trustpilot

Fraud Detection in Online Reviews using Machine Learning Techniques

Anomaly detection. Problem motivation. Machine Learning

Social Media Implementations

CS 6220: Data Mining Techniques Course Project Description

Conclusions and Future Directions

Proofpoint Anti-SPAM Quarantine System

Profit from Big Data flow. Hospital Revenue Leakage: Minimizing missing charges in hospital systems

IBM's Fraud and Abuse, Analytics and Management Solution

Anti Spamming Techniques

Hospital Billing Optimizer: Advanced Analytics Solution to Minimize Hospital Systems Revenue Leakage

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Predicting & Preventing Banking Customer Churn By Unlocking Big Data.

Uniphore Software Systems Contact: Website: 1

Analyze It use cases in telecom & healthcare

The Data Mining Process

Combining Global and Personal Anti-Spam Filtering

Using Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG

730 Yale Avenue Swarthmore, PA

Learning is a very general term denoting the way in which agents:

Sentiment analysis on tweets in a financial domain

Maschinelles Lernen mit MATLAB

Azure Machine Learning, SQL Data Mining and R

A Property & Casualty Insurance Predictive Modeling Process in SAS

A Secure Online Reputation Defense System from Unfair Ratings using Anomaly Detections

Performance Metrics for Graph Mining Tasks

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

EFFECTIVE SPAM FILTERING WITH MDAEMON

Machine Learning Final Project Spam Filtering

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

CENG 734 Advanced Topics in Bioinformatics

Domain Hygiene as a Predictor of Badness

BARRACUDA. N e t w o r k s SPAM FIREWALL 600

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Turning Data into Action: How Credit Card Programs Can Benefit from the World of Big Data

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Comparison of K-means and Backpropagation Data Mining Algorithms

Applying Data Science to Sales Pipelines for Fun and Profit

Time series experiments

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Social Media Mining. Data Mining Essentials

Predictive Analytics

Lead Scoring. Five steps to getting started. wowanalytics. 730 Yale Avenue Swarthmore, PA

FAdR: A System for Recognizing False Online Advertisements

Tufts Technology Services (TTS) Proofpoint Frequently Asked Questions (FAQ)

MiSeq: Imaging and Base Calling

Environmental Remote Sensing GEOG 2021

T O B C A T C A S E G E O V I S A T DETECTIE E N B L U R R I N G V A N P E R S O N E N IN P A N O R A MISCHE BEELDEN

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

STERN SCHOOL OF BUSINESS C ADVANCED COST MANAGEMENT. Professor K. R. Balachandran KMEC 10-82, Tel:

Warranty Fraud Detection & Prevention

This software agent helps industry professionals review compliance case investigations, find resolutions, and improve decision making.

Data Mining Techniques in CRM

Enhancing Quality of Data using Data Mining Method

HOSTED EXCHANGE ADVANCED SECURITY. Hosted Exchange Advanced Security Feb 2013

Three proven methods to achieve a higher ROI from data mining

Using Data Mining to Detect Insurance Fraud

How To Cluster

DISCOVER MERCHANT PREDICTOR MODEL

Solutions IT Ltd Virus and Antispam filtering solutions

How To Improve Forecast Accuracy

Adaption of Statistical Filtering Techniques

INTRODUCTION. By now you ve read the title, How to get your cheating competitors kicked off of Google, and you likely have one question in mind.

Survey of review spam detection using machine learning techniques

Risk pricing for Australian Motor Insurance

Working with telecommunications

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

GMI CLOUD SERVICES. GMI Business Services To Be Migrated: Deployment, Migration, Security, Management

Data Mining/Fraud Detection. April 28, 2014 Jonathan Meyer, CPA KPMG, LLP

Domain Classification of Technical Terms Using the Web

Online Security Information. Tips for staying safe online

Statement of. Mark Nelsen. Senior Vice President, Risk Products and Business Intelligence. Visa Inc. House Ways & Means Subcommittee.

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Multichannel Customer Listening and Social Media Analytics

A Survey on Product Aspect Ranking

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Local outlier detection in data forensics: data mining approach to flag unusual schools

Secure Because Math: Understanding ML- based Security Products (#SecureBecauseMath)

The Lane s Gifts v. Google Report

Penetration Testing for Spam Filters

Three Methods for ediscovery Document Prioritization:

Transcription:

E6893 Big Data Analytics: Yelp Fake Review Detection Mo Zhou, Chen Wen, Dhruv Kuchhal, Duo Chen Columbia University in the City of New York December 11th, 2014

Overview 1 Problem Summary 2 Technical Approach 3 Result and Discussion 4 Future Work and Conclusion

Introduction The goal of this project is to flag opinion spammers on rating websites such as Yelp.com. These users are usually receiving incentives to post positive reviews for certain businesses. Opinion spam creates unfair competitions, provides deceptive information for users and is detrimental to the credibility of rating websites. We use dataset acquired from Yelp Dataset Challenge, which provides >1 million ratings and reviews of >40,000 businesses.

Background Content Based Detection Analyze the reviews and detect spams using computational linguistics analysis Characteristics like not having been to the place and still being able to write the review differentiates it from an original review, with an accuracy of approximately 68% Behavioral Detection Maximum number of reviews submitted by a user based on the fact that an average user does not submit more than a few reviews in a day. Percentage of positive review used to identify false accounts that submit only false reviews

Related Work Lin et al. (2014) Finding Valuable Yelp Comments by Personality, Content, Geo, and Anomaly Analysis Used to reorder the reviews to improve visibility to the user Utilized both the content and the behavior of the reviewers for the detection Used Natural Language processing and Personality Analyser tool, with an accuracy of approximately 80% Mukherjee et al. (2012) Spotting Fake Reviewer Groups in Consumer Reviews Used behavioral model among fake reviewers and relational model between different groups, individual reviewers and products they reviewed Utilized GS Rank (Group Spam Rank) algorithm, a relation based algorithm to rank the spamming groups

Strategy Overview Opinion spam signals Large deviation in ratings Large deviation in text reviews Interdependent feature extraction Our approach to detect opinion spam Compute average rating similarity score between one user and all others Compute average review similarity score between one user and all others Cluster users with both low rating similarity score and review similarity score Generate a list of opinion spam candidates

Algorithm - Compute Rating Similarity Let r 1, r 2,..., r n be the rating of user n of the same business, with a maximum rating of 5, and let N be the number of users, then N i=0,k=0,i k rsim n = 1 (r i r k )/5 [0, 1] N denotes the average rating similarity score between user n and all other users, limited for the same business

Algorithm - Compute Review Similarity Let c 1, c 2,..., c n be the cosine similarity between review posted by user n and all other users of the same business, and let N be the number of users, then N i=0,k=0,i k csim n = c i c k N N [0, 1] i=0 c2 i k=0 c2 k denotes the average review similarity score between user n and all other users of the same business

Result Overview The majority of users have high rating similarity score and decent review similarity score. The deviants at bottom left corner are users of interest.

Opinion Spam Candidates Cut-off score rsim = 0.2 csim = 0.3

Discussion The location of data points sustantiates our hypothesis that most ratings and reviews are genuine. Yelp.com has its own filter which may result in the lack of the true negative. Nevertheless, we still obtain several hundred potential fake review candidates. The threshold setting for rsim and csim is worth further studies. csim seems to have larger impact on the process of determining opinion spam. rsim is served as simply a filtering process.

Future Work Gather ground truth for review data Domain expert MTurk Incorporate more features into spam signals Time window User relationships Compare our approach with other learning Algorithm GSRank Supervised Classification

References Lin et al. (2014) Finding Valuable Yelp Comments by Personality, Content, Geo, and Anomaly Analysis Mukherjee et al. (2012) Spotting Fake Reviewer Groups in Consumer Reviews Dot products - the Stanford NLP, http://nlp.stanford.edu/ir-book/html/htmledition/ dot-products-1.html Yelp Dataset Challenge, http://www.yelp.com/dataset_challenge