A Spam Message Filtering Method: focus on run time



Similar documents

Optical Illusion. Sara Bolouki, Roger Grosse, Honglak Lee, Andrew Ng

CASE STUDY ALLOCATE SOFTWARE

A technical guide to 2014 key stage 2 to key stage 4 value added measures

REDUCTION OF TOTAL SUPPLY CHAIN CYCLE TIME IN INTERNAL BUSINESS PROCESS OF REAMER USING DOE AND TAGUCHI METHODOLOGY. Abstract. 1.

CASE STUDY BRIDGE.

Two Dimensional FEM Simulation of Ultrasonic Wave Propagation in Isotropic Solid Media using COMSOL

Return on Investment and Effort Expenditure in the Software Development Environment

DISTRIBUTED DATA PARALLEL TECHNIQUES FOR CONTENT-MATCHING INTRUSION DETECTION SYSTEMS

Support Vector Machine Based Electricity Price Forecasting For Electricity Markets utilising Projected Assessment of System Adequacy Data.

DISTRIBUTED DATA PARALLEL TECHNIQUES FOR CONTENT-MATCHING INTRUSION DETECTION SYSTEMS. G. Chapman J. Cleese E. Idle

Laureate Network Products & Services Copyright 2013 Laureate Education, Inc.

Performance of Multiple TFRC in Heterogeneous Wireless Networks

Mixed Method of Model Reduction for Uncertain Systems

QUANTIFYING THE BULLWHIP EFFECT IN THE SUPPLY CHAIN OF SMALL-SIZED COMPANIES

CHARACTERISTICS OF WAITING LINE MODELS THE INDICATORS OF THE CUSTOMER FLOW MANAGEMENT SYSTEMS EFFICIENCY

SELF-MANAGING PERFORMANCE IN APPLICATION SERVERS MODELLING AND DATA ARCHITECTURE

TRADING rules are widely used in financial market as

International Journal of Heat and Mass Transfer

Chapter 10 Stocks and Their Valuation ANSWERS TO END-OF-CHAPTER QUESTIONS

Brand Equity Net Promoter Scores Versus Mean Scores. Which Presents a Clearer Picture For Action? A Non-Elite Branded University Example.

FEDERATION OF ARAB SCIENTIFIC RESEARCH COUNCILS

Queueing systems with scheduled arrivals, i.e., appointment systems, are typical for frontal service systems,

SRA SOLOMON : MUC-4 TEST RESULTS AND ANALYSI S

Mobile Network Configuration for Large-scale Multimedia Delivery on a Single WLAN

EVALUATING SERVICE QUALITY OF MOBILE APPLICATION STORES: A COMPARISON OF THREE TELECOMMUNICATION COMPANIES IN TAIWAN

Bi-Objective Optimization for the Clinical Trial Supply Chain Management

How Enterprises Can Build Integrated Digital Marketing Experiences Using Drupal

THE ROLE OF IMPLEMENTATION TOTAL QUALITY MANAGEMENT SYSTEM ON PERFORMANCE IN SAIPA GROUP COMPANIES

MBA 570x Homework 1 Due 9/24/2014 Solution

Morningstar Fixed Income Style Box TM Methodology

Profitability of Loyalty Programs in the Presence of Uncertainty in Customers Valuations

Unit 11 Using Linear Regression to Describe Relationships

A note on profit maximization and monotonicity for inbound call centers

A Review On Software Testing In SDlC And Testing Tools

Cluster-Aware Cache for Network Attached Storage *

Project Management Basics

Report b Measurement report. Sylomer - field test

A Resolution Approach to a Hierarchical Multiobjective Routing Model for MPLS Networks

UNDERSTANDING SCHOOL LEADERSHIP AND MANAGEMENT IN CONTEMPORARY NIGERIA

Acceleration-Displacement Crash Pulse Optimisation A New Methodology to Optimise Vehicle Response for Multiple Impact Speeds

The Cash Flow Statement: Problems with the Current Rules

AN OVERVIEW ON CLUSTERING METHODS

Brokerage Commissions and Institutional Trading Patterns

Pekka Helkiö, 58490K Antti Seppälä, 63212W Ossi Syd, 63513T

A CRF Approach to Fitting a Generalized Hand Skeleton Model

CLUSTBIGFIM-FREQUENT ITEMSET MINING OF BIG DATA USING PRE-PROCESSING BASED ON MAPREDUCE FRAMEWORK

Progress 8 measure in 2016, 2017, and Guide for maintained secondary schools, academies and free schools

BUILT-IN DUAL FREQUENCY ANTENNA WITH AN EMBEDDED CAMERA AND A VERTICAL GROUND PLANE

Control Theory based Approach for the Improvement of Integrated Business Process Interoperability

Performance of a Browser-Based JavaScript Bandwidth Test

A Note on Profit Maximization and Monotonicity for Inbound Call Centers

Bio-Plex Analysis Software

462 Machine Translation Systems for Europe

Simulation of Power Systems Dynamics using Dynamic Phasor Models. Power Systems Laboratory. ETH Zürich Switzerland

Apigee Edge: Apigee Cloud vs. Private Cloud. Evaluating deployment models for API management

Risk Management for a Global Supply Chain Planning under Uncertainty: Models and Algorithms

Growing Self-Organizing Maps for Surface Reconstruction from Unstructured Point Clouds

CHAPTER 5 BROADBAND CLASS-E AMPLIFIER

TIME SERIES ANALYSIS AND TRENDS BY USING SPSS PROGRAMME

Multi-Objective Optimization for Sponsored Search

MATLAB/Simulink Based Modelling of Solar Photovoltaic Cell

Research Article An (s, S) Production Inventory Controlled Self-Service Queuing System

OPINION PIECE. It s up to the customer to ensure security of the Cloud

Benchmarking Bottom-Up and Top-Down Strategies for SPARQL-to-SQL Query Translation

SENSING IMAGES. School of Remote Sensing and Information Engineering, Wuhan University, 129# Luoyu Road, Wuhan,

A New Optimum Jitter Protection for Conversational VoIP

Scheduling of Jobs and Maintenance Activities on Parallel Machines

Change Management Plan Blackboard Help Course 24/7

Achieving Quality Through Problem Solving and Process Improvement

Assessing the Discriminatory Power of Credit Scores

Simulation of Sensorless Speed Control of Induction Motor Using APFO Technique

Evaluating Teaching in Higher Education. September Bruce A. Weinberg The Ohio State University *, IZA, and NBER

1) Assume that the sample is an SRS. The problem state that the subjects were randomly selected.

INSIDE REPUTATION BULLETIN

Warehouse Security System based on Embedded System

Algorithms for Advance Bandwidth Reservation in Media Production Networks

Four Ways Companies Can Use Open Source Social Publishing Tools to Enhance Their Business Operations

Transcription:

, pp.29-33 http://dx.doi.org/10.14257/atl.2014.76.08 A Spam Meage Filtering Method: focu on run time Sin-Eon Kim 1, Jung-Tae Jo 2, Sang-Hyun Choi 3 1 Department of Information Security Management 2 Department of Buine Data Converion 3 Profeor Department of Management Information Sytem, BK21+ BSO Team Chungbuk National Univerity 52 Naeudong-ro, Heungdeok-gu, Chungbuk 361-763 Korea trebrone@gmail.com, hlla007@gmail.com, choi@cbnu.ac.kr Abtract. In thi paper, we trie to propoe light and quick algorithm through with SMS filtering can be performed within mobile device independently. After introducing thi algorithm, it can be the olution for limitation of memory and reource that had not been olved. Keyword: Mobile phone, SMS pam, pam filtering, Data Mining 1 Introduction A SMS pam meage have dratically increaed, typical filtering method are not effective to be proceed within mobile phone anymore. For efficient pam filtering, technique to remove unneceary data are needed. Thee data reducing technique include data filtering, feature election, data clutering, etc. The main idea i to elect important feature uing relative magnitude of feature value. We compare the performance of our method with tandard feature election method; Naive Baye, J- 48 Deciion Tree, Logitic. In thi paper, we propoe a new feature election method the average ratio of each cla relative to total data. We compare between propoed method and other method. 2 Related Work The reearche include tatitic-baed method, uch a bayeian baed claifier, logitic regreion and deciion tree method. There are till few tudie about SMS pam filtering method available in the reearch journal while reearche about email pam claifier are continuouly increaing. We preent the mot relevant work related to thi topic. Gómez Hidalgo et. al. (2006) evaluated everal Bayeian baed claifier to detect mobile phone pam. In thi work, the author propoed the firt two well-known SMS pam dataet: the Spanih (199 pam and 1,157 ham) and Englih (82 pam and 1,119 ham) tet databae. They have teted on them a number of meage ISSN: 2287-1233 ASTL Copyright 2014 SERSC

repreentation technique and machine learning algorithm, in term of effectivene. The reult indicate that Bayeian filtering technique can be effectively employed to claify SMS pam[1]. 2.1 SMS Spam Collection v.1 Data Set The SMS Spam Collection v.1 i a et of SMS tagged meage that have been collected for SMS pam reearch. It contain one et of SMS meage in Englih of 5,574 meage, tagged according being ham or pam. The data i contain one meage per line. Each line i conit of two column: one with label (ham or pam) and other with the raw text. Table 1. Type of feature. Meage Amount % Ham 4,827 86.60 Spam 747 13.40 Total 5,574 100% A hown in Table 1, the data et ha 86.6% of Ham meage and 13.4% of Spam meage. Table 2 how ome example about ham and pam meage[8]. 2.2 Data Mining algorithm Typical method to detect pam meage include bayeian claifier, logitic regreion, deciion tree and o on. Bayeian Claification provide a ueful perpective for undertanding and evaluating many learning algorithm[7]. A deciion tree i a flowchart-like tructure in which internal node repreent tet on an attribute, each branch repreent outcome of tet and each leaf node repreent cla label (deciion taken after computing all attribute). A path from root to leaf repreent claification rule[2]. Logitic regreion i a type of probabilitic tatitical claification model. It i alo ued to predict a binary repone from a binary predictor, ued for predicting the outcome of a categorical dependent variable baed on one or more predictor variable feature[3]. 3 Experimental Study We explained above that SMS pam data i rapidly increaing. In order to detect pam meage, filtering algorithm or feature election method have to be more efficiently run. The above three method ue a complex calculation to do thi. For thi 30 Copyright 2014 SERSC

reaon, thee method i inefficient for dealing with large cale data. In thi paper, we propoe a imple and efficient feature election method. 3.1 Propoed Method Thi tudy propoe a VR (Value Ratio) meaure for evaluating lightne and quickne of filtering method o that SMS filtering can be performed independently within mobile device. Firt, each Cla (Spam and Ham) i divided, and appearance frequencie of word on SMS meage are evaluated. Then the appearance frequencie of each word are aggregated and then divided by the number of meage to calculate an average. The formula i a below. W j = i pam w / k = w / k (1) W j = i ham w / k = w / k (2) Here, i and j repreent row and column repectively, and total meage i k. The reult of calculating a VA by uing calculated W j and W h j value i a below. VR(j) = W j / h W j (3) VR(j) repreent the relative ratio of average frequency of jth keyword in pam meage to that in ham meage. A the value of VR(j) i larger, the word are more frequently refered in pam meage. A hown in the figure, a a reult of executing algorithm by uing the VR attribute election technique, run time varied much. Thu, it i expected to fit for executing algorithm independently in the mobile environment that ha many limitation in the apect of torage pace, memory, and proceing capability. Copyright 2014 SERSC 31

Figure 2. The reult of algorithm 4 Future work In the future, reearche hould make a program with the method propoed in thi tudy and prove that it i an efficient technique by conducting a comparative analyi on calculated time taken when it i performed within actual mobile phone independently. Becaue pam meage continuouly increae, data hould be added contantly for a precie analyi. Additionally, the propoed method hould not be limited in the pam filtering but applied to variou field to extract ueful information o that reearche on data reducing technique for an efficient analyi in the maive data environment can be conducted. Acknowledgement. Thi reearch wa upported by the MSIP(Minitry of Science, ICT and Future Planning), Korea, under the ITRC(Information Technology Reearch Center) upport program (NIPA-2014-H0301-14-1022) upervied by the NIPA(National IT Indutry Promotion Agency). Thi reearch wa upported by the MSIP(The Minitry of Science,ICT and Future Planning), Korea, under the "SW mater' coure of a hiring contract" upport program (NIPA-2013-HB301-13-1008) upervied by the NIPA(National IT Indutry Promotion Agency). Reference 1. Gómez Hidalgo, J. M., Bringa, G. C., Sánz, E. P., & García, F. C. (2006). Content baed SMS pam filtering. In Proceeding of the 2006 ACM ympoium on Document engineering,107-114. 2. http://en.wikipedia.org/wiki/deciion_tree 3. http://en.wikipedia.org/wiki/logitic_regreion 4. http://www.c.waikato.ac.nz/~ml/weka/ 32 Copyright 2014 SERSC

5. Liu H. Setiono R. Motoda H. Zhao Z. (2010). Feature Selection: An Ever Evolving Frontier in Data 6. Mining, JMLR: Workhop and Conference Proceeding, 4-13 7. Saurabh Mukherjeea. Neelam Sharmaa. (2012) Intruion Detection uing Naive Baye Claifier with Feature Reduction, Procedia Technology, 119-128. 8. SMS Spam Collection v.1 (2012) http://archive.ic.uci.edu/ml/index.html Copyright 2014 SERSC 33