Approximate Object Location and Spam Filtering on Peer-to-Peer Systems



Similar documents
Approximate Object Location and Spam Filtering on Peer-to-Peer Systems

Approximate Object Location and Spam Filtering on Peer-to-peer Systems

Object Request Reduction in Home Nodes and Load Balancing of Object Request in Hybrid Decentralized Web Caching

GISP: Global Information Sharing Protocol a distributed index for peer-to-peer systems

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Middleware and Distributed Systems. Peer-to-Peer Systems. Martin v. Löwis. Montag, 30. Januar 12

Distributed Hash Tables in P2P Systems - A literary survey

Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam

Bandwidth consumption: Adaptive Defense and Adaptive Defense 360

A PROXIMITY-AWARE INTEREST-CLUSTERED P2P FILE SHARING SYSTEM

How To Create A P2P Network

An Introduction to Peer-to-Peer Networks

query enabled P2P networks Park, Byunggyu

Chord - A Distributed Hash Table

Adapting Distributed Hash Tables for Mobile Ad Hoc Networks

Department of Computer Science Institute for System Architecture, Chair for Computer Networks. File Sharing

P2P Storage Systems. Prof. Chun-Hsin Wu Dept. Computer Science & Info. Eng. National University of Kaohsiung

DUP: Dynamic-tree Based Update Propagation in Peer-to-Peer Networks

Identity Theft Protection in Structured Overlays

Join and Leave in Peer-to-Peer Systems: The DASIS Approach

Calto: A Self Sufficient Presence System for Autonomous Networks

Acknowledgements. Peer to Peer File Storage Systems. Target Uses. P2P File Systems CS 699. Serving data with inexpensive hosts:

Scalable Source Routing

Denial of Service Resilience in Peer to Peer. D. Dumitriu, E. Knightly, A. Kuzmanovic, I. Stoica, W. Zwaenepoel Presented by: Ahmet Canik

Data Center Content Delivery Network

A Survey and Comparison of Peer-to-Peer Overlay Network Schemes

A P2P SERVICE DISCOVERY STRATEGY BASED ON CONTENT

8 Conclusion and Future Work

Peer-to-Peer and Grid Computing. Chapter 4: Peer-to-Peer Storage

Tornado: A Capability-Aware Peer-to-Peer Storage Network

Load Balancing in Structured Overlay Networks. Tallat M. Shafaat

An Efficient Load Balancing Technology in CDN

Anti-Spam Grid: A Dynamically Organized Spam Filtering Infrastructure

Tools for Peer-to-Peer Network Simulation

Locality-Aware Randomized Load Balancing Algorithms for DHT Networks

Discovery and Routing in the HEN Heterogeneous Peer-to-Peer Network

New Structured P2P Network with Dynamic Load Balancing Scheme

Optimizing and Balancing Load in Fully Distributed P2P File Sharing Systems

RESEARCH ISSUES IN PEER-TO-PEER DATA MANAGEMENT

Peer-VM: A Peer-to-Peer Network of Virtual Machines for Grid Computing

Information Searching Methods In P2P file-sharing systems

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Network Architecture and Topology

2. Research and Development on the Autonomic Operation. Control Infrastructure Technologies in the Cloud Computing Environment

Design and Implementation of a P2P Cloud System

Architectures and protocols in Peer-to-Peer networks

Bigtable is a proven design Underpins 100+ Google services:

Improving Availability with Adaptive Roaming Replicas in Presence of Determined DoS Attacks

Plaxton routing. Systems. (Pastry, Tapestry and Kademlia) Pastry: Routing Basics. Pastry: Topology. Pastry: Routing Basics /3

RELOAD Usages for P2P Data Storage and Discovery

Bayesian Spam Filtering

DFSgc. Distributed File System for Multipurpose Grid Applications and Cloud Computing

T he Electronic Magazine of O riginal Peer-Reviewed Survey Articles ABSTRACT

International Journal of Applied Science and Technology Vol. 2 No. 3; March Green WSUS

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Identity Theft Protection in Structured Overlays

Interoperability of Peer-To-Peer File Sharing Protocols

How to Design and Create Your Own Custom Ext Rep

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Peer-to-Peer Networks Organization and Introduction 1st Week

The Case for a Hybrid P2P Search Infrastructure

A Content based Spam Filtering Using Optical Back Propagation Technique

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

McAfee Agent Handler

How To Understand The Power Of Icdn

RVS-Seminar Overlay Multicast Quality of Service and Content Addressable Network (CAN)

Analysis of QoS Routing Approach and the starvation`s evaluation in LAN

Argonne National Laboratory, Argonne, IL USA 60439

Ashok Kumar Gonela MTech Department of CSE Miracle Educational Group Of Institutions Bhogapuram.

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Methods & Tools Peer-to-Peer Jakob Jenkov

Scalable Internet Services and Load Balancing

A Middleware Strategy to Survive Compute Peak Loads in Cloud

Improving Data Availability through Dynamic Model-Driven Replication in Large Peer-to-Peer Communities

Efficient Data Retrieving in Distributed Datastreaming

Definition. A Historical Example

A NEW FULLY DECENTRALIZED SCALABLE PEER-TO-PEER GIS ARCHITECTURE

Ant-based Load Balancing Algorithm in Structured P2P Systems

PSON: A Scalable Peer-to-Peer File Sharing System Supporting Complex Queries

Advanced Farm Administration with XenApp Worker Groups

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Cassandra A Decentralized, Structured Storage System

Application-Specific Biometric Templates

Lessons learned: Sinkholing the Zeroaccess botnet. Ross Gibb. Attack Investigations Team Symantec Security Response.

Edge Configuration Series Reporting Overview

Spam Detection Using Customized SimHash Function

Technical Overview Simple, Scalable, Object Storage Software

Spam Detection A Machine Learning Approach

Transcription:

Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Feng Zhou, Li Zhuang, Ben Y. Zhao, Ling Huang, Anthony D. Joseph and John D. Kubiatowicz University of California, Berkeley

The Problem of Spam Spam Unsolicited, automated emails Radicati Group: $20B cost in 2003, $198B in 2007 Proposed solutions Economic model for spam prevention Attach cost to mass email distribution Weakness: needs wide-spread deployment, prevent but not filter Bayesian network / machine learning (independent) Train mailer with spam, rely on recognizing words / patterns Weakness: key words can be masked (images, invis. characters) Collaborative filtering Store / query for spam signatures on central repository Other users query signatures to filter out incoming spam Weakness: central repository limited in bandwidth, computation ACM Middleware 2003 ravenben@eecs.berkeley.edu 2

Our Contribution Can signatures effectively detect modified spam? Goals: Minimize false positives (marking good email as spam) Recognize modified/customized spam as same as original Present signature scheme based on approx. fingerprints Evaluate against random text and real email messages Can we build a scalable, resilient signature repository Leverage structured peer-to-peer networks Constrain query latency and limit bandwidth usage Orthogonal issues we do not address: Preprocessing emails to extract content Interpreting collective votes via reputation systems ACM Middleware 2003 ravenben@eecs.berkeley.edu 3

Outline Introduction An Approximate Signature Scheme Evaluation using random text and real emails Approximate object location Similarity search on P2P systems Constraining latency and bandwidth usage Conclusion ACM Middleware 2003 ravenben@eecs.berkeley.edu 4

Collaborative Spam Filtering 1. Spam sent to user A 2. Signatures stored 3. Spam sent to user B 4. B queries repository 5. B filters out spam store signature query response ACM Middleware 2003 ravenben@eecs.berkeley.edu 5

An Approximate Signature Scheme Sort by value Choose N Randomize A B C D E F Vector checksum Email Message How to match documents with very similar content Calculate checksums of all substrings of length L Select deterministic set of N checksums A matches B iff sig(a) sig(b) > Threshold Computation tput: 13MByte/s on P-III 1Ghz ACM Middleware 2003 ravenben@eecs.berkeley.edu 6

Accuracy of Signature Vectors Probability of Match 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Matching Accuracy vs Changes 10/5K 50/5K 5x5/5K 0 1 2 3 4 5 6 7 8 9 10 Minimum Matches out of 10 10000 random text documents, size = 5KB, calculate 10 signatures Compare signatures of before and after modifications Analytical results match experimental results ACM Middleware 2003 ravenben@eecs.berkeley.edu 7

Eliminating False positives False Positive Rate 1/10 Signatures Match 2/10 Signatures Match 1.00E-03 False Positive Rate 1.00E-04 1.00E-05 1.00E-06 1.00E-07 1.00E-08 1.00E-09 1000 10000 100000 Size of Documents (Bytes) Compare pair-wise signatures between 10000 random docs None matched 3 of 10 signatures (100,000,000 pairs) ACM Middleware 2003 ravenben@eecs.berkeley.edu 8

Evaluation on Real Messages 29631 Spam Emails from www.spamarchive.org Processed visually by project members 14925 (unique), 86% of spam = 5K Robustness to modification test Most popular 39 msgs have 3440 modified copies Examine recognition between copies and originals THRES Detected Failed % 3/10 3356 84 97.56 4/10 3172 268 92.21 5/10 2967 473 86.25 ACM Middleware 2003 ravenben@eecs.berkeley.edu 9

False Positive Test Non-spam emails 9589 messages: 50% newsgroup posts + 50% personal emails Compare against 14925 unique spam messages THRES 1/10 2/10 3/10 # of pairs 270 Sweet spot, using threshold of 3/10 signatures Recognition rate > 97.5% False positive rate < 1 in 140 million pairs 4 0 Probability 1.89e-6 2.79e-8 0 ACM Middleware 2003 ravenben@eecs.berkeley.edu 10

A Distributed Signature Repository? store signature query response How do we build a scalable distributed repository? How do we limit bandwidth consumption and latency? ACM Middleware 2003 ravenben@eecs.berkeley.edu 11

Structured Peer-to-Peer Overlays Storage / query via structured P2P overlay networks Large sparse ID space N (160 bits: 0 2 160 ) Nodes in overlay network have nodeids N Given k N, overlay deterministically maps k to its root node (a live node in the network) E.g. Chord, Pastry, Tapestry, Kademlia, Skipnet, etc Decentralized Object Location and Routing (DOLR) Objects identified by Globally Unique IDs (GUIDs) N Decentralized directory service for endpoints/objects Route messages to nearest available endpoint Object location with locality: routing stretch (overlay location / shortest distance) O(1) ACM Middleware 2003 ravenben@eecs.berkeley.edu 12

DOLR on Tapestry Routing Mesh Client Root(O) Client Object O Server ACM Middleware 2003 ravenben@eecs.berkeley.edu 13

More Than Just Unique Identifiers Objects named by Globally Unique ID (GUID) Application maps secondary characteristics to ID: versioning, modified replicas, app-specific info Simpler Queries DHT/DOLRs Search on Features More Powerful QLs Databases Simplify the search problem out of m search fields, or features, find objects matching at least n exactly ACM Middleware 2003 ravenben@eecs.berkeley.edu 14

A Layered Perspective approxpublish(fv,guid) approxsearch(fv) route (FV, msg) Publish (GUID) route (GUID, msg) route (IPAddr, msg) ADOLR layer Approximate DOLR/DHT Structured P2P Overlay Network (DOLR) IP Network Layer Introduce naming mapping from feature vector to GUIDs Rely on overlay infrastructure for storage Abstraction of feature vectors as approximate names for object(s) ACM Middleware 2003 ravenben@eecs.berkeley.edu 15

Marking a New Spam Message S 1 E 3, E 2 Overlay X X node A E 3 put (S 1, E 2 ) put (S 2, E 2 ) put (S 3, E 2 ) node C E 2 S 2 E 2 S 3 E 2 Signatures stored as inverted index (feature object) inside overlay User on C gets spam E 2, calculates signatures S: {S 1, S 2, S 3 } For each feature in S, if feature object exists, add E 2 If no feature object exists, create one locally and publish ACM Middleware 2003 ravenben@eecs.berkeley.edu 16

Filtering New Emails for Spam S 1 E 3, E 2 S 4 X S 1 node A E 3 node D Overlay S 2 E 2 E 2 node C E 2 S 2 E 2 S 3 E 2 User at node D receives new email E 2 with signatures {S 1, S 2, S 4 } Queries overlay for signatures, retrieve matching GUIDs for each Threshold = 2/3, contact GUIDs that occur in 2 of 3 result sets Contact E 2 via overlay for any additional info (votes etc) ACM Middleware 2003 ravenben@eecs.berkeley.edu 17

Constraining Bandwidth and Latency Need to constrain bandwidth and latency Limit signature query to h overlay hops Return null set if h hops reached without result Simulation on transit-stub topologies 5K nodes, 4K overlay nodes, diameter = 400ms Each spam message only reaches small group For each message: % of users seen and marked = mark rate Measure tradeoff between latency and success rate of locating known spam, for different mark rates ACM Middleware 2003 ravenben@eecs.berkeley.edu 18

Simulation Result Prob. of Finding Match 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Trading Accuracy for Latency and Bandwidth network diameter 10% Mark Rate 4% Mark Rate 2% Mark Rate 1% Mark Rate 0.5% Mark Rate 0 100 200 300 400 500 Latency to Locate Object ACM Middleware 2003 ravenben@eecs.berkeley.edu 19

Feature-based Queries Approximate Text Addressing Text objects: features hashed signatures Applications: plagiarism detection, replica management Other feature-based searching Music similarity search Extract musical characteristics Signatures: {hash(field1=value1), hash(field2=value2) } E.g. Fourier transform values, # of wavetable entries Image similarity search Locate files with similar geometric properties, etc. ACM Middleware 2003 ravenben@eecs.berkeley.edu 20

Finally Status ADOLR infrastructure implemented on Tapestry SpamWatch: P2P spam filter implemented, including Microsoft Outlook plug-in Available for download: http://www.cs.berkeley.edu/~zf/spamwatch Tapestry http://www.cs.berkeley.edu/~ravenben/tapestry Thank you Questions? ACM Middleware 2003 ravenben@eecs.berkeley.edu 21