A Novel Method of Spam Mail Detection using Text Based Clustering Approach
|
|
|
- Job Osborne
- 10 years ago
- Views:
Transcription
1 Internatinal Jurnal f Cmputer Applicatins ( ) Vlume 5 N.4, August 2010 A Nvel Methd f Spam Mail Detectin using Text Based Clustering Apprach M. Basavaraju Research Schlar, Dept. f CSE, CIT, Anna University, Cimbatre, Tamilnadu., INDIA Prfessr & Head, Dept. f CSE, Atria Institute f Technlgy, Bengaluru, Karnataka., INDIA Dr. R. Prabhakar Prfessr-Emeritus Dept. f CSE, Cimbatre Institute f Tech., Cimbatre, Tamilnadu, INDIA ABSTRACT A nvel methd f efficient spam mail classificatin using clustering techniques is presented in this research paper. spam is ne f the majr prblems f the tday s internet, bringing financial damage t cmpanies and annying individual users. Amng the appraches develped t stp spam, filtering is an imprtant and ppular ne. A new spam detectin technique using the text clustering based n vectr space mdel is prpsed in this research paper. By using this methd, ne can extract spam/nn-spam and detect the spam efficiently. Representatin f data is dne using a vectr space mdel. Clustering is the technique used fr data reductin. It divides the data int grups based n pattern similarities such that each grup is abstracted by ne r mre representatives. Recently, there is a grwing emphasis n explratry analysis f very large datasets t discver useful patterns, it is called data mining. Each cluster is abstracted using ne r mre representatives. It mdels data by its clusters. Clustering is a type f classificatin impsed n a finite set f bjects. If the bjects are characterized as patterns, r pints in a n-dimensinal metric space, the prximity measure can be the Euclidean distance between pair f pints r similarity in the frm f the csine f the angle between the vectrs crrespnding t the dcuments. In the wrk cnsidered in this paper, an efficient clustering algrithm incrprating the features f algrithm and algrithm is presented. Nearest neighbur distances and K-Nearest neighbur distances can serve as the basis f classificatin f test data based n supervised learning. Predictive accuracy f the classifier is calculated fr the clustering algrithm. Additinally, different evaluatin measures are used t analyze the perfrmance f the clustering algrithm develped in cmbinatin with the varius classifiers. The results presented at the end f the paper in the results sectin shw the effectiveness f the prpsed methd. General Terms Classificatin, Data reductin, Vectr space mdel, Preprcessing Keywrds Keywrds are yur wn designated keywrds which can be used fr easy lcatin f the manuscript using any search engines. 1. INTRODUCTION In this digital age, which is the era f electrnics & cmputers, ne f the efficient & pwer mde f cmmunicatin is the . Undesired, unslicited is a nuisance fr its recipients; hwever, it als ften presents a security threat. Fr ex., it may cntain a link t a phny website intending t capture the user s lgin credentials (identity theft, phishing), r a link t a website that installs malicius sftware (malware) n the user s cmputer. Installed malware can be used t capture user infrmatin, send spam, hst malware, hst phish, r cnduct denial f service attacks as part f a bt net. While preventin f spam transmissin wuld be ideal, detectin allws users & prviders t address the prblem tday [1]. Spam filtering has becme a very imprtant issue in the last few years as unslicited bulk impses large prblems in terms f bth the amunt f time spent n and the resurces needed t autmatically filter thse messages [2]. cmmunicatin has cme up as the mst effective and ppular way f cmmunicatin tday. Peple are sending and receiving many messages per day, cmmunicating with partners and friends, r exchanging files and infrmatin. datas are nw becming the dminant frm f inter and intra-rganizatinal written cmmunicatin fr many cmpanies and gvernment departments. s are the essential part f life nw just likes mbile phnes & i-pds [2]. s can be f spam type r nn-spam type as shwn in the Fig. 1. Spam mail is als called as junk mail r unwanted mail whereas nn-spam mails are genuine in nature and meant fr a specific persn and purpse. Infrmatin retrieval ffers the tls and algrithms t handle text dcuments in their data vectr frm [3].The Statistics f spam are increasing in number. At the end f 2002, as much as 40 % f all traffic cnsisted f spam. In 2003, the percentage was estimated t be abut 50 % f all s. In 2006, BBC news reprted 96 % f all s t be spam. The statistics are as shwn in the fllwing table I. Spam can be defined as unslicited (unwanted, junk) fr a recipient r any that the user d nt wanted t have in his inbx. It is als defined as Internet Spam is ne r mre unslicited messages, sent r psted as a part f larger cllectin f messages, all having substantially identical cntent. There are severe prblems frm the spam mails, viz., wastage f netwrk resurces (bandwidth), wastage f time, damage t the PC s & laptps due t viruses & the ethical issues such as the spam s advertising prngraphic sites which are harmful t the yung generatins [5]. Sme f the existing appraches t slve the prblem frm spam mails culd be listed as belw. 15
2 Internatinal Jurnal f Cmputer Applicatins ( ) Vlume 5 N.4, August 2010 Spam (Junk r unwanted fr a recipient) Nn-spam (Legitimate r genuine mail) Fig. 1 : types (spam r nn-spam) is the mst widely used medium fr cmmunicatin wrld wide because it s Cheap, Reliable, Fast and easily accessible. is als prne t spam s because f its wide usage, cheapness & with a single click yu can cmmunicate with any ne any where arund the glbe. It hardly cst spammers t send ut 1 millin s than t send 10 s. Hence, Spam is ne f the majr prblems f the tday s internet, bringing financial damage t cmpanies and annying individual users [4]. Daily Spam s sent TABLE I. STATISTICS OF THE SPAM MAILS Daily Spam received per persn billin Annual Spam received per persn 2,200 Spam cst t all nn-crprate Internet users Spam cst t all U.S. Crpratins in address changes due t Spam 16% Annual Spam in 1,000 emplyee cmpany Users wh reply t Spam 28% $255 millin $8.9 billin 2.1 millin Rule based Hand made rules fr detectin f spam made by experts (needs dmain experts & cnstant updating f rules). Custmer Revlt Frcing cmpanies nt t publicize persnal ids given t them. (hard t implement) Dmain filters Allwing mails frm specific dmains nly (hard jb f keeping track f dmains that are valid fr a user). Blacklisting Blacklist filters use databases f knwn abusers, and als filter unknwn addresses (cnstant updating f the data bases wuld be required). White list Filters Mailer prgrams learn all cntacts f a user and let mail frm thse cntacts thrugh directly (every ne shuld first be needed t cmmunicate his -id t the user and nly then he can send ). Hiding address Hiding nes riginal address frm the spammers by allwing all s t be received at temprary -id which is then frwarded t the riginal if fund valid by the user (hard jb f maintaining cuple f -ids). Checks n number f recipients by the agent prgrams. Gvernment actins Laws implemented by gvernment against spammers (hard t implement laws). Autmated recgnitin f Spam Uses machine learning algrithms by first learning frm the past data available (seems t be the best at current). Here, fllws a brief verview f spam filtering [6]. Amng the appraches develped t stp spam, filtering is an imprtant and ppular ne. It can be defined as autmatic classificatin f messages int spam and legitimate mail. It is pssible t apply the spam filtering algrithms n different phases f transmissin at ruters, at destinatin mail server r in the destinatin mailbx. Filtering n the destinatin prt slves the prblems caused by spam nly partially, i.e., prevents endusers frm wasting their time n junk messages, but it des nt prevent resurces misuse, because all the messages are delivered nevertheless. In general, a spam filter is an applicatin which implements a functin : ƒ(m, θ) = { c spam, if the decisin is spam c leg, therwise }, (1) where, m is a message r t be classified, θ is a vectr f parameters, and c spam and c leg are labels assigned t the messages [7]. Mst f the spam filters are based n a machine learning classificatin techniques. In a learning-based technique the vectr f parameters θ is the result f training the classifier n a precllected dataset: θ = Ө(M), (2) M = {(m 1, y 1 ), (m n, y n )}, y i є { c spam, c leg }, where m 1, m 2,.., m n are previusly cllected messages, y 1, y 2,., y n are the crrespnding labels, and Ө is the training functin. In rder t classify new message, a spam filter can analyze them either separately (by just checking the presence f certain wrds) r in grups (cnsider the arrival f dzen f messages with same cntent in five minutes than arrival f ne message with the same cntent). In additin, learning-based filter analyzes a cllectin f labeled training data (pre-cllected messages with reliable judgment). A brief survey f the results and the limitatins f the varius methd prpsed by varius researchers acrss the wrld was perfrmed. Any can be represented in terms f features with discrete values based n sme statistics f the presence r absence f wrds based n a vectr space mdel. Thus data can be represented in their vectr frm using the vectr space mdel. Befre implementing the vectr space mdel fr representing the data, it is imprtant that the data is pre-prcessed [8]. The paper is rganized in the fllwing sequence. A brief literature survey f the -spams, etc. was presented in the previus paragraphs. The aim f the research wrk is presented in sectin II. Data pre-prcessing is dealt with in sectin III. The sectin IV describes hw t build a vcabulary using vectr space mdel. Data pre-prcessing is dealt with in sectin III. Data preprcessing is dealt with in sectin III. The sectin IV describes hw t build a vcabulary using vectr space mdel. Sectin V discusses briefly abut the data reductin including the data clustering & its types. Classificatin f the types f classifiers used is depicted in sectin VI. Prter s algrithm used in the 16
3 Internatinal Jurnal f Cmputer Applicatins ( ) Vlume 5 N.4, August 2010 wrk is explained in the sectin VII. The evaluatin measures are presented next in sectin VIII. The prpsed mdel alng with the data flw diagram is presented in sectin IX. The sectin X depicts the cding part. Testing f the designed & develped sftware mdule with test case specificatins is explained in greater detail in sectin XI. The results & discussins are presented in sectin XII. This is fllwed by the cnclusins in sectin XIII, the references & the authr bigraphies. 2. AIM OF THE RESEARCH WORK The scpe f the wrk cnsidered in this paper is t develp an algrithm & efficiently classify a dcument int spam r nnspam and t analyze hw accurately they are classified int their riginal categries. An can be represented in terms f its features. The features in the dmain cnsidered in ur wrk will be wrds. Wrds are represented with discrete values based n statistics f the presence r absence f wrds. A classifier is used t classify a dcument t be either spam r nn-spam accurately. Data clustering is dne where each cluster is abstracted using ne r mre representatives. These representative pints are the results f efficient clustering algrithms like and. Nearest neighbr classifier and K-nearest neighbr classifier are the tw classifiers which are used & assigns each unlabeled dcument t it s nearest labeled neighbr cluster. Based n the mdules, the functin f the develped algrithm is as fllws, Remval f wrds with length < 3. Remval f stp wrds. Replaces ccurrences f multiple wrds f the same frm int a Single wrd. Cnversin f categrical dcument int vectr frm. Clustering similar training dcuments fr data reductin. Classificatin f the test pattern. This paper deals with the pssible imprvements gained frm differing classifiers used fr a specific task. Basic classificatin algrithms as well as clustering are intrduced. Cmmn evaluatin measures are used. The methdlgy used in the research wrk cnsidered in this paper fr the spam filtering is summarized under the fllwing 4 steps shwn in the Fig. 2, viz., Classificatin Data reductin Vectr space mdel Pre-prcessing f data Fig. 2 : Methdlgy used fr spam filtering 3. DATA PRE-PROCESSING The basic step dne in data pre-prcessing is stpping and stemming. Stpping is the prcess f remval f wrds that are lesser in length (i.e., wrds with length less than specified value like 2 r 3), frequently ccurring wrds and special symbls. Fr grammatical reasns, dcuments are ging t use different frms f a wrd, such as rganize, rganizes, and rganizing [9]. Stemming reduces derivatin related frms f a wrd t a cmmn base frm. Prter s stemming and stpping algrithm can be used fr this purpse [5]. Stpping and stemming are dne t reduce the vcabulary size which helps infrmatin retrieval and classificatin purpses. 4. BUILDING A VOCABULARY USING A VECTOR SPACE MODEL This sectin explains hw t build a vcabulary using a space mdel. T start with, assign t each term in the dcument, a weight fr that term [2]. The simplest apprach is t assign the weight t be equal t the number f ccurrences f the term t in dcument d. This weighting scheme is referred t as term frequency and is dented tf t,d with the subscripts denting the term and the dcument in rder [10]. Fr the dcument d, the set f weights (determined by the tf weighting functin abve r indeed any weighting functin that maps the number f ccurrences f t in d t a psitive real value) may be viewed as a vectr, with ne cmpnent fr each distinct term. In this view f a dcument, knwn in the literature as the bag f wrds mdel, the exact rdering f the terms in a dcument is ignred. The vectr view nly retains infrmatin n the number f ccurrences [11]. Raw term frequency suffers frm a critical prblem, i.e., all terms are cnsidered equally imprtant when it cmes t assessing relevancy n a query. Certain terms have little r n discriminating pwer in determining relevance [2]. An immediate idea is t scale dwn the term weights f terms with high cllectin frequency, defined t be the ttal number f ccurrences f a term in the cllectin. The idea wuld be t reduce the tf weight f a term by a factr that grws with its cllectin frequency. Instead, use the dcument frequency df t defined t be the number f dcuments in the cllectin that cntain a term t. Denting the ttal number f dcuments in a cllectin by N, the inverse dcument frequency (idf) inverse dcument frequency f a term t is given by Eq. (3) as [12] dcuments with their weighted terms N idft = lg. (3) dft Term W 1 W n 2 : : : : : n W 1 W n Term n Nw, cmbine the abve expressins fr term frequency and inverse dcument frequency, t prduce a cmpsite weight fr each term in each dcument. The tf-idf weighting scheme assigns t term t a weight in dcument d given by Eq. (4) as tf - idf t, d = tft, d idft. (4) The mdel thus built shws the representatin f each dcument described by their attributes [13]. Each tuple is assumed t belng t a prir-defined class, as determined by ne 17
4 Internatinal Jurnal f Cmputer Applicatins ( ) Vlume 5 N.4, August 2010 f the attributes called the class label attribute. Thus, the dcuments with their weighted terms can be represented in the frm f a table given in Table II. 5. DATA REDUCTION & CLUSTERING Data reductin includes data clustering that cncerns hw t grup a set f bjects based n their similarity f attributes and / r their prximity in the vectr space. Clustering is applied t 2 different training sets, ne belnging t spam and the ther belnging t nn-spam. The cluster representatives will nw belng t 2 different classes. The class that each tuple ( ) belngs t is given by ne f the attributes f the tuple, ften called the class label attribute [14]. 5.1 Data Clustering Clustering is the prcess f partitining r dividing a set f patterns (data) int grups. Each cluster is abstracted using ne r mre representatives. Representing the data by fewer clusters necessarily lses certain fine details, but achieves simplificatin. It mdels data by its clusters. Clustering is a type f classificatin impsed n finite set f bjects. The relatinship between bjects is represented in a prximity matrix in which the rws represent n s and clumns crrespnd t the terms given as dimensins. If bjects are categrized as patterns, r pints in a d-dimensinal metric space, the prximity measure can be Euclidean distance between a pair f pints. Unless a meaningful measure f distance r prximity, between a pair f bjects is established, n meaningful cluster analysis is pssible. Clustering is useful in many applicatins like decisin making, data mining, text mining, machine learning, gruping, and pattern classificatin and intrusin detectin. Clustering has t be dne as it helps in detecting utliners & t examine small size clusters [15]. The prximity matrix is used in this cntext & thus serves as a useful input t the clustering algrithm. It represents a cluster f n patterns by m pints. Typically, m < n leading t data cmpressin, can use centrids. This wuld help in prttype selectin fr efficient classificatin. The clustering algrithms are applied t the training set belnging t 2 different classes separately t btain their crrespndent cluster representatives. There are different stages in clustering. Typical pattern clustering activity invlves the fllwing steps, viz., Pattern representatin (ptinally including feature extractin and/r selectin), Definitin f a pattern prximity measure apprpriate t the data dmain, Clustering r gruping, Data abstractin (if needed), and Assessment f utput (if needed). The Fig. 3 shwn belw depicts a typical sequencing f the first three f the abve mentined 5 steps, including a feedback path where the gruping prcess utput culd affect subsequent feature extractin and similarity cmputatins [16]. Pattern representatin refers t the number f classes, the number f available patterns, and the number, type and scale f the features available t the clustering algrithm where a pattern x is a single data item used by the clustering algrithm. Sme f this infrmatin may nt be cntrllable by the practitiner. Patterns Feature Selectin / Extractin Pattern representatins Inter-pattern similarity Gruping Feedback lp Fig. 3 : First 3 steps f clustering prcess Clusters Feature selectin is the prcess f identifying the mst effective sub-set f the riginal features t use in clustering where the individual scalar cmpnents x i f a pattern x are called features. Feature extractin is the use f ne r mre transfrmatins f the input features t prduce new salient features. Either r bth f these techniques can be used t btain an apprpriate set f features t use in clustering. Pattern prximity is usually measured by a distance functin defined n pairs f patterns, such as the Euclidean distance between patterns. The gruping step can be perfrmed in a number f ways. The utput clustering (r clusterings) can be hard (a partitin f the data int grups) r fuzzy (where each pattern has a variable degree f membership in each f the utput clusters). Hierarchical clustering algrithms prduce a nested series f partitins based n a criterin fr merging r splitting clusters based n similarity. Partitinal clustering algrithms identify the partitin that ptimizes (usually lcally) a clustering criterin [17]. Hierarchical Clustering Single link Cmplete link Partitinal Square errr Graph Theretric Mixture reslving Expected maximizatin Mde seeking Fig. 4 : Appraches t clustering the data Data abstractin is the prcess f extracting a simple and cmpact representatin f a data set. Here, simplicity is either frm the perspective f autmatic analysis (s that a machine can perfrm further prcessing efficiently) r it is human-riented (s that the representatin btained is easy t cmprehend and intuitively appealing). In the clustering cntext, a typical data abstractin is a cmpact descriptin f each cluster, usually in terms f cluster prttypes r representative patterns such as the centrid. Different appraches t clustering data can be described with the help f the hierarchy shwn in Fig. 4. In the wrk cnsidered in this paper, the hierarchical type f clustering has been used [18]. Sme f the clustering algrithms ppularly used are: The algrithm takes the input parameter k, and partitins a set f n bjects int k clusters s that the resulting intra cluster similarity is high but the inter-cluster similarity is lw. The K- means algrithm prceeds as fllws: 18
5 Internatinal Jurnal f Cmputer Applicatins ( ) Vlume 5 N.4, August 2010 Arbitrarily chse k bjects as the initial cluster centers. Repeat (re)assign each bject t the cluster t which the bject is the mst similar, based n the mean value f the bjects in the cluster. update the cluster means, i.e., calculate the mean value f the bjects fr each clusters. until n change. Generally, the k-means algrithm has the fllwing imprtant prperties, viz., It is efficient in prcessing large data sets. It ften terminates at a lcal ptimum. The clusters have spherical shapes. It is sensitive t nise. The algrithm is classified as a batch methd, because it requires that all the data shuld be available in advance. Hwever, there are variants f the k-means clustering prcess, which gets arund this limitatin. Chsing the prper initial centrids is the key step f the basic prcedure. The time cmplexity f the k-means algrithm is O(nkl), where n is the number f bjects, k is the number f clusters, and l is the number f iteratins [19]. The ther variant f algrithm is the single pass K- means algrithm. Generally algrithm takes mre number f iteratins t cnverge. Fr handling large data sets with algrithm needs buffering strategy which takes single pass ver the data set. It wrks as fllws. B is the size f buffer and k 1 is the cluster representatives which are representing the means f cluster. Initially, data f size B is taken int the buffer. On this data, algrithm is applied. The cluster representatives are stred int the memry. The remaining data is discarded frm the memry. Again data is laded int memry frm disk. But, the algrithm is perfrmed n this new data with previus cluster representatives. This prcess repeats until the whle data is clustered. It takes less cmputatinal effrt as cmpared t the nrmal algrithm. But, the algrithm suffers frm initial guess f centrids, value f K, lack f scalability, capacity t handle numerical attributes and resulting clusters can be unbalanced Balanced Iterative Reducing and Clustering using Hierarchies is an integrated hierarchical clustering methd. It has clustering feature and a clustering feature tree (CF-Tree) which is used t summarize cluster representatins. A clustering feature (CF) is a triplet summarizing infrmatin abut sub-clusters f the bjects. Given N d-dimensinal pints r bjects { i } in a subcluster, then the CF f the sub-cluster is defined as: CF = (N, LS, SS) (3) where, N is the number f pints in the sub cluster, LS is the linear sum n N pints and SS is the square sum f the data pints. A CF-Tree is a height balanced tree that stres the clustering features fr a hierarchical clustering. The nn-leaf ndes stre sums f the CF s f their children, and thus summarize clustering infrmatin abut their children. A CF tree has 2 parameters, viz., branching factr B and threshld T. The branching factr specifies the maximum number f children a nn-leaf nde can have and the threshld specifies the maximum diameter f the sub clusters stred at the leaf ndes. It cnsists f 4 phases, Phase 1 : Phase 2 : Phase 3 : Phase 4 : Lad int memry by building a CF Tree. Cndense int desirable range by building a smaller CF Tree (P 2 is ptinal). Glbal Clustering. Cluster Refining (P 4 is ptinal). Phase 1 : The CF tree is built as the bjects are inserted. An bject is inserted t the clsest leaf entry. If the diameter f the sub cluster stred in the leaf nde is larger than the threshld then the leaf nde is split. After the insertin f a new bject, infrmatin abut it is passed twards the rt f the tree. Phase 2 : P 2 is ptinal. It s bserved that the existing glbal r semi-glbal clustering methds applied. Phase 3 : A glbal clustering algrithm is used t cluster the leaf ndes f a CF tree. Phase 4 : P 4 is ptinal and entails the cst f additinal passes ver the data t crrect thse inaccuracies and refine the clusters further. Up t this pint, the riginal data has nly been scanned nce, althugh the tree and utlier infrmatin may have been scanned multiple times. It uses the centrids f the clusters prduced by phase 3 as seeds, and redistributes the data pints t its clsest seed t btain a set f new clusters [20]. 6. CLASSIFICATION OF CLASSFIERS Classifiers are used t predict the class label f the new dcument which is unlabelled [7]. Fr classificatin, we use classifiers like (Nearest Neighbur Classifier) and it s variant (K- Nearest Neighbur Classifier). 6.1 The nearest neighbur classifier assigns t a test pattern a class label f its clsest neighbur. If there are n patterns X 1, X 2,.., X n each f dimensin d, and each pattern is assciated with a class c, and if we have a test pattern P, then if d( P, X ) = min{ d( P, X )}, where i= 1,2,..., n k i (5) T cmpare the distances f a given test pattern with ther patterns, the nearest neighbur classifier uses the Euclidean distance methd which is given by [1] 2 ( x ) ( y ) 2 x (6) y1 Pattern P is assigned t the class assciated with X k. 6.2 The K-nearest neighbur classifier is a variant f the nearest neighbur classifier where instead f finding just ne nearest neighbur as in the case f nearest neighbur classifier, k nearest neighburs are fund. The nearest neighburs are fund using the 19
6 Internatinal Jurnal f Cmputer Applicatins ( ) Vlume 5 N.4, August 2010 Euclidean distance. The majrity class f this k nearest neighburs is the class label assigned t the new test pattern. The value chsen fr k is crucial & with the right value f k, the classificatin accuracy will be better than that f using the nearest neighbur classifier. Fr large data sets, k can be larger t reduce the errr. Chsing k can be dne experimentally, where a number f patterns taken ut frm the training set can be classified using the remaining training patterns fr different values f k & k can be chsen the value which gives the least errr in classificatin [2]. This methd will reduce the errr in classificatin when training patterns are nisy. The clsest pattern f the test pattern may belng t anther class, but when a n. f neighburs are btained & the majrity class label is cnsidered, pattern is mre likely t be classified crrectly. 7. PORTER S ALGORTIHM In this sectin, a small verview f the prter s algrithm is presented. The Prter Stemmer is a cnflatin stemmer develped by Martin Prter at the University f Cambridge in The stemmer is based n the idea that the suffixes in the English language (apprximately 1200) are mstly made up f a cmbinatin f smaller and simpler suffixes. This stemmer is a linear step stemmer. The prter stemming algrithm (r Prter Stemmer ) is a prcess fr remving the cmmner mrphlgical and in-flexinal endings frm wrds in English. Its main use is as part f a term nrmalizatin prcess that is usually dne when setting up infrmatin retrieval systems. Prter s algrithm wrks based n number f vwel characters, which are fllwed be a cnsnant character in the stem (measure), must be greater than ne fr the rule t be applied [3]. Using this prter s algrithm, a cde has been develped & used t classify the s int spam & nnspam s mre efficiently. 8. EVALUATION MEASURES This is dne in 2 steps, viz., classifier accuracy & alternative t the measure f the accuracy 8.1 Classifier Accuracy Estimating classifier accuracy is imprtant in that it allws ne t evaluate hw accurately a given classifier will label the test data. It can be calculated using the frmula discussed belw. The data set used fr training and testing is the ling spam crpus. Each f the 10 sub-directries cntains spam and legitimate messages, ne message in each file. The ttal number f spam messages is 481 and that f legitimate messages are Alternatives t accuracy measure A classifier is trained t classify s as nn-spam and spam mails [6]. An accuracy f 85 % may make the classifier accurate, but what if nly % f the training samples are actually spam? Clearly an accuracy f 85 % may nt be acceptable-the classifier culd be crrectly labelling nly the nn-spam samples. Instead, we wuld like t be able t access hw well the classifier can recgnize spam samples (referred t as psitive samples) hw well it can recgnize nn-spam samples (referred t as negative samples). The sensitivity (recall) and specificity measures can be used, respectively fr this purpse. In additin, we may use precisin t access the percentage f samples labeled as spam that actually are spam samples. The evaluatin measures which are used in apprach fr testing prcess in ur research wrk culd be defined as fllws [4]: True Psitive (TP) True Negative (TN) False Psitive (FP) False Negative (FN) TABLE II. : This states the n. f spam dcuments crrectly classified as spam. : This states the number f nn-spam dcuments crrectly classified as nnspam. : This states the number spam dcuments classified as nn-spam. : This states the number f nn-spam dcument classified as spam. THE DIFFERENT MEASURES USED FOR CLASSIFICATION OF SPAM & NON-SPAM SAMPLES MEASURE FORMULA MEANING Precisin Recall / Sensitivity Specificity Accuracy TP TP+ FP TP TP+ FN TN TN+ FP ( TP+ TN) TP+ TN+ FP+ FN The percentage f psitive predictins that are crrect. The percentage f psitive labelled instances that were predicted as psitive. The percentage f negative labelled instances that were predicted as negative. The percentage f predictins that are crrect. Nte that the evaluatin is dne n the abve 4 parameters. The different methds f evaluatin measures used in the research wrk cnsidered is summarized in the frm f a table in table III. 9. DETAILED DESIGN The design invlves three parts, viz., vectr space mdel, the CF tree & the develpment f the DFD. Design Cnstraints : The design cnstraints are divided int sftware & hardware cnstraints, which are listed as belw. Sftware Cnstraints : Linux Operating system. Gcc cmpiler t cmpile C prgrams. Hardware Cnstraints : Intel Pentium Prcessr. 1 GB RAM. 80 GB hard disc space. 20
7 Internatinal Jurnal f Cmputer Applicatins ( ) Vlume 5 N.4, August Vectr space mdel Due t the large number f features (terms) in the training set, memry requirements will be mre. Arrays cannt be used t stre the features as this leads t memry prblems s we use a linked list t implement the strage f features and the T f - idf calculatin [5]. As the training set cntains large number f dcuments, the dcuments are als implemented in the linked list frmat as shwn in Fig. 5. Term 1 Term 2 Term d Dc 1 T f T f -idf Dc 2 T f T f -idf Fig. 5 : Linked list frmat in the mdel Dc n T f T f -idf 9.2 CF Tree is used t cluster large number f data. It inserts the data int the ndes f a CF tree ne by ne fr efficient memry usage. The insertin f a data int the CF tree is carried ut by traversing the CF tree tp-dwn frm the rt accrding t an instancecluster distance functin i.e. Euclidean distance [6]. The data is then inserted int the clsest sub-cluster under a leaf nde as shwn in Fig. 6. Nte that in the Fig. 6 shwn, B = 7 & L = 6. CF 1 child 1 CF 2 child 2 CF 3 Nn-leaf nde Leaf nde child 3 CF1 CF2 CF3 CF5 Child1 child2 child3 child5 CF 6 child 6 Leaf nde Prev CF1 CF2. CF6 Next Prev CF1 CF2. CF4 Next Fig. 6 : insertin tree The data flw diagram used fr the design f the algrithm fr efficient spam mail classificatin is shwn in the Fig. 7 alng with the inputs & utputs. The general descriptin f the inputs & the utputs shwn in the Fig. 7 culd be further explained as fllws which invlves a 5 step prcedure [7]. 1) In pre-prcessing f data, there are tw main mdules, i.e., Stpping and Stemming Stpping Input : Training & test dcument. : Dcument with stpped wrds. Stemming Input : f Stpping mdule, i.e., dcument with wrds that are stpped : Dcument with stemmed wrds. Training & test dcuments (input) Stpping & Stemming Vectr space mdel Training pattern Clustering Centrids Dcuments with stpped & stemmed wrds File cnverted t vectr frmat with frequency Tf - idf test pattern Test pattern Classificatin (spam r nn-spam) Fig. 7 : Data flw diagram (DFD) f the designed system r the prpsed mdel 2) In Vectr Space Mdel, Input : f stemming mdule. : The T f -idf f each dcument. 3) Data Reductin has tw main mdules, i.e., and, bth have identical Input and frms. Input : Vectr representatin f the training data. : Tw sets f data, ne belnging t spam & the ther t nn-spam represented by centrids. 4) Classificatin als has tw main mdules, i.e., and K- where bth have identical Input and frms. Input : Test pattern frm the user & the centrids : The classified result as the pattern belngs t Spam r Nn-Spam categry. 5) The main mdule is the integratin f all the abve fur stages. Input : Training pattern and test pattern where nly the training patterns are clustered. : The classified result f the test pattern and the accuracy. Sequence diagrams are als drawn fr stpping, stemming, vectr space mdel,,, &, which are nt shwn in the paper fr the sake f cnvenience [8]. 21
8 Internatinal Jurnal f Cmputer Applicatins ( ) Vlume 5 N.4, August Cding Tp-level pseud cde develped in C language. The cding develped in this research wrk cnsisted f 4 mdules, viz., main mdule, stpping, stemming & the vectr space mdeling. Each mdule develped is explained as fllws [9]: Main Mdule Step 1 Step 2 Step 3 Step 4 Step 5 : Read each wrd frm each dcument. : If the scanned wrd is a stp wrd then remve the stp wrd. : Perfrm stemming based n the rules f the stemming algrithm. : Build the vcabulary and calculate the Tf-idf : Cluster the dcuments by any f the tw clustering algrithms r Step 6 : Classify the test dcument by using either r K- classifiers. Stpping Step 1 Step 2 Step 3 Step 4 Step 5 Stemming Step 1 Step 2 : Check if the wrd in the main mdule is present in the stp list f wrds. : If present, then remve the wrd. : Else d nt remve. : Check if the data is a number r any special symbl : If s, remve that wrd. : If the wrd is nt stpped, then check if a rt wrd exists fr that wrd by varius rules prvided by the algrithm. : If a rt wrd exists, then replace all the ccurrences f that wrd with the rt wrd. Vectr space mdel Step 1 Step 2 Step 3 Step 4 Step 5 : Check if the wrd is already present in the vcabulary list. : If nt, insert this wrd int a new nde and update the dcument number and frequency in the crrespnding nde. : If the wrd is already present, and if it is appearing fr the first time in the dcument, then create a new nde with the dcument number and it s crrespnding frequency. : Else if the wrd is appearing again in the same dcument then increment the frequency. : Calculate the inverse dcument frequency (idf) fr each term(wrd) by the frmula idf = lg (N/df t ), where N is the ttal number f dcuments and df t is the number f dcuments that the term has ccurred in. Step 6 : Calculate the T f - idf f each wrd in each dcument by the frmula T f - idf = Frequency * idf. 11. TESTING OF THE DESIGNED & DEVELOPED SOFTWARE MODULE WITH TEST CASE SPECS Testing is a very imprtant prcess in any design & develpment f the sftware. It uncvers all the bugs generated by the sftware t make the applicatin a successful prduct. It can be dne in fur different stages such as unit testing, mdule testing, integratin testing and system testing. A very imprtant criterin fr testing is the data set used, i.e., crpus. The crpus used fr training and testing is the Ling Spam crpus [10]. In LingSpam, there are fur subdirectries, crrespnding t 4 versins f the crpus, viz., bare: Lemmatiser disabled, stp-list disabled, lemm: Lemmatiser enabled, stp-list disabled, lemm_stp: Lemmatiser enabled, stp-list enabled, stp: Lemmatiser disabled, stp-list enabled, where lemmatizing is similar t stemming and stp-list tells if stpping is dne n the cntent f the parts r nt. Our analysis is dne with the lemm_stp subdirectry. Each ne f these 4 directries cntains 10 subdirectries (part 1,., part 10). These crrespnd t the 10 partitins f the crpus that were used in the 10-fld experiments. In every part, 2/3 rd f the cntent is taken as training data and 1/3 rd as the test data. Each ne f the 10 subdirectries cntains bth spam and legitimate messages, ne message in each file. Files whse names have the frm spmsg*.txt are spam messages. All ther files are legitimate messages. The ttal number f spam messages is 481 and that f legitimate messages are chsen data set: ratinale easy t preprcess, relatively small in terms f features, simple: nly tw categries. thus: nt very demanding cmputatinally, nt very much time cnsuming, but still pretty illustrative and inspiring, as well as f high practical imprtance. Step 1 : Select k initial centres. Step 2 : repeat { assign every data instance t the clsest cluster based n the distance between the data instance and the center f the cluster cmpute the new centers f the k clusters } until(the cnvergence criterin is met) 22
9 Internatinal Jurnal f Cmputer Applicatins ( ) Vlume 5 N.4, August 2010 Sequence N. TABLE III. Test case TESTING SCHEDULE Cnditins being checked 1 Value K 2 3 Value clustern Branching Factr 4 Threshld 5 K Expected Larger the data size & Higher the value f K, clustering is better Larger the data size, Higher the value f K & mre the value f clustern, clustering is better Less the Branching Factr, gd quality f clusters & hence mre number f centrids are btained Mre the threshld, TABLE IV. EVALUATION MEASURES Better the cluster quality & mre number f centrids Lager the data & Higher the value f K, better classificatin results. Test dcument Classified t True Psitive Spam Spam True Negative Nn-Spam Nn-Spam False Psitive Spam Nn-Spam False Negative Nn-Spam Spam Phase 1 : Scan all data and build an initial CF tree. Phase 2 : Cndense int desirable length by building a smaller CF tree. Phase 3 : Glbal clustering Phase 4 : Cluster refining (ptinal) - requires mre passes ver the data t refine the results Step 1 : Get the centrids frm the clustering mdule. Step 2 : Calculate the distance between the test data and each centrid. Step 3 : Test data is assigned t the class assciated with the least distance frm the distances calculated. Step 1 : Calculate the distance f test data with respect t each centrid. Step 2 : Find ut the K nearest neighbrs frm the abve calculated distances. Step 3 : Classify the test data crrespnding t the class label with which the test data has majrity f the minimum distances. 12. RESULTS AND INFERENCE The cding was dne in C ; after the cde was run, varius perfrmance measures such as the precisin, recall, specificity & the accuracy, etc. were bserved. The results are shwn in the Figs. 8 t 11 respectively Precisin Fig. 8 : Plt f measure f precisin vs. data size Inference : The percentages f psitive predictins that are crrect are high fr nearest neighbur classifiers. The precisin table in table V and the fllwing graph in Fig. 8 shws that fr large data sets, with and with has an ptimal value. TABLE V. QUANTITATIVE RESUTLS OF PRECISION Data size % 72.9% 75% 61.9% % 74.4% 88.2% 91.6% % 56.6% 93.7% 69.8% 23
10 Internatinal Jurnal f Cmputer Applicatins ( ) Vlume 5 N.4, August Recall cmbinatin, which can als be bserved frm the Fig Accuracy Fig. 9 : Plt f measure f recall vs. data size Inference : The percentage f psitive labelled instances that predicted psitive are high fr the cmbinatin f algrithm with as the classifier and the percentage increases as the data set size increases. des nt wrk well fr smaller data sets. The recall values can be visualized frm the fllwing table in table VI which indicates that fr large data sets, with has a high value, which can als be bserved frm the Fig. 9. TABLE VI. QUANTITATIVE RESUTLS OF RECALL Data size % 84.3% 63.1% 49.6% % 97.6% 71.4% 52.3% % 96.2% 58.1% 63.7% 12.3 Specificity Fig. 10 : Plt f measure f specificity vs. data size TABLE VII. QUANTITATIVE RESUTLS OF SPECIFICITY Data size % 68.7% 69.2% 75% % 66.6% 90.4% 95.2% % 53.9% 93.6% 82.8% Inference : The percentages f negative labelled instances that are predicted as negative are high fr the cmbinatin using as the classifier. The specificity values fr large data sets as seen frm the fllwing table in table VII are ptimal fr with Fig. 11 : Plt f measure f accuracy vs. data size TABLE VIII. QUANTITATIVE RESUTLS OF ACCURACY Data size % 76.56% 65.62% 57.81% % 82.14% 80.9% 73.8% % 70.19% 71.6% 75.48% Inference : Accuracy fr with has a ptimal value as the data set increases, als wrks well fr smaller data set. It can be visualized frm the graph in Fig. 11 that cnditins being checked hld gd fr large data and with is the best cmbinatin if the data set increases. It can be seen that with is mre accurate fr large data, which can be bserved frm the quantitative results shwn in the table VIII. TABLE IX. COMPARISONS OF & K-MEANS WITH DATASETS Time Faster Slwer Sensitivity t input pattern f dataset Yes N Cluster Quality (center lcatin, number f data pint in a cluster, radii f clusters) Mre Accurate Less Accurate Demand fr memry Less Mre Finally, cmparisns are made between Birch & and the advantages / dis-advantages are shwn in the table IX. It is cncluded that is the best when data-sets are taken int cnsideratin. 13. Cnclusins In this paper, an clustering methd is prpsed and implemented t efficient detect the spam mails. The prpsed technique includes the distance between all f the attributes f an . The prpsed technique is implemented using pen surce technlgy in C language; ling spam crpus dataset was selected fr the experiment. Different perfrmance measures such as the precisin, recall, specificity & the accuracy, etc. were bserved. clustering algrithm wrks well fr smaller data sets. 24
11 Internatinal Jurnal f Cmputer Applicatins ( ) Vlume 5 N.4, August 2010 with is the best cmbinatin as it wrks better with large data sets. In clustering, decisins made withut scanning the whle data & utilizes lcal infrmatin (each clustering decisin is made withut scanning all data pints). is a better clustering algrithm requiring a single scan f the entire data set thus saving time. The wrk presented in this paper can be further extended & can be tested with different algrithms and varying size f large data sets. REFERENCES [1] Sudipt Guha, Adam Meyersn, Nina Mishra, Rajeev Mtwani and Liadan O Callaghan, Clustering Data Streams, IEEE Trans.s n Knwledge & Data Engg., [2] Enric Blanzieri and Antn Bryl, A Survey f Learning- Based Techniques f Spam Filtering, Cnference n and Anti-Spam., [3] Jain A.K., M.N. Murthy and P.J. Flynn, Data Clustering : A Review, ACM Cmputing Surveys., [4] Tian Zhang, Raghu Ramakrishnan, Mirn Livny, : An Efficient Data Clustering Methd Fr Very Large Databases, Technical Reprt, Cmputer Sciences Dept., Univ. f Wiscnsin-Madisn, [5] Prter. M, An algrithm fr suffix stripping, Prc. Autmated library Infrmatin systems, pp , [6] Manning C.D., P. Raghavan, H. Schütze, Intrductin t Infrmatin Retrieval, Cambridge University Press, [7] Richard O. Duda, Peter E. Hart, David G. Strk, Pattern Classificatin, Wiley-Interscience Pubs., 2 nd Edn., Oct [8] [9] [10] [11] [12] Jiawei Han and Micheline Kamber, Data Mining Cncepts and Techniques, Secnd Edn. [13] Ajay Gupta and R. Sekar, An Apprach fr Detecting Self- Prpagating Using Anmaly Detectin, Springer Berlin / Heidelberg, Vl. 2820/2003. [14] Anagha Kulkarni and Ted Pedersen, Name Discriminatin and Clustering using Unsupervised Clustering and Labeling f Similar Cntexts, 2 nd Indian Internatinal Cnference n Artificial Intelligence (IICAI-05), pp , [15] Bryan Klimt and Yiming Yang, The Enrn Crpus: A New Dataset fr Classificatin Research, Eurpean Cnference n Machine Learning, Pisa, Italy, [16] Sahami M., S. Dumais, D. Heckerman, E. Hrvitz, A Bayesian apprach t filtering junk . AAAI 98 Wrkshp n Learning fr Text Categrizatin, [17] Sculley D., Grdn V. Crmack, Filtering Spam in the Presence f Nisy User Feedback, CEAS 2008: Prc. f the Fifth Cnference n and Anti-Spam. Aug., [18] Dave DeBarr, Harry Wechsler, Spam Detectin using Clustering, Randm Frests, and Active Learning, CEAS 2009 Sixth Cnference n and Anti-Spam, Muntain View, Califrnia, USA, July 16-17, [19] Manning, C.D., Raghavan, P., and Schutze, H., Scring, Term Weighting, and the Vectr Space Mdel, Intrductin t Infrmatin Retrieval, Cambridge University Press, Cambridge, England, pp , [20] Naresh Kumar Nagwani and Ashk Bhansali, An Object Oriented Clustering Mdel Using Weighted Similarities between s Attributes, Internatinal Jurnal f Research and Reviews in Cmputer Science (IJRRCS), Vl. 1, N. 2, pp Jun Mr. M. Basavaraju cmpleted his Masters in Engineering in Electrnics and Cmmunicatin Engg. frm the University Visvesvaraya Cllege f Engg. (Bangalre), Bangalre University in 1990, & B.E. frm Siddaganga Institute f Technlgy (Tumkur), Bangalre University in the year He has gt a vast teaching experience f 23 years & an industrial experience f 7 years. Currently, he is wrking as Prfessr and Head f Cmputer science & Engg. Dept., Atria Institute f Technlgy, Bangalre, Karnataka, India. He is als a research schlar in Cimbatre Inst. f. Tech., Cimbatre, ding his research wrk & prgressing twards his Ph.D. in the cmputer science field frm Anna University Cimbatre, India, He has cnducted a number f seminars, wrkshps, cnferences, summer curses in varius fields f cmputer science & engineering. His research interests are Data Mining, Cmputer Netwrks, Parallel cmputing. Dr. R. Prabhakar btained his B. Tech. degree frm IIT Madras in 1969, M.S. frm Oklahma State University, USA and Ph.D. frm Purdue University, USA. Currently he is prfessr f Cmputer f Science and Engineering and Secretary f Cimbatre Institute f Technlgy, Cimbatre, India. His areas f specializatin include Cntrl Systems, CNC Cntrl, Rbtics, Cmputer Graphics, Data Structures, Cmpilers, Optimizatin. He has published a number f papers in varius natinal & internatinal jurnals, cnferences f high repute. He has dne a number f prjects in the natinal & internatinal level. At the same time, he has guided a number f students in UG, PG & in the dctral level. 25
Chapter 3: Cluster Analysis
Chapter 3: Cluster Analysis 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA 3..5
Licensing Windows Server 2012 for use with virtualization technologies
Vlume Licensing brief Licensing Windws Server 2012 fr use with virtualizatin technlgies (VMware ESX/ESXi, Micrsft System Center 2012 Virtual Machine Manager, and Parallels Virtuzz) Table f Cntents This
The ad hoc reporting feature provides a user the ability to generate reports on many of the data items contained in the categories.
11 This chapter includes infrmatin regarding custmized reprts that users can create using data entered int the CA prgram, including: Explanatin f Accessing List Screen Creating a New Ad Hc Reprt Running
Licensing Windows Server 2012 R2 for use with virtualization technologies
Vlume Licensing brief Licensing Windws Server 2012 R2 fr use with virtualizatin technlgies (VMware ESX/ESXi, Micrsft System Center 2012 R2 Virtual Machine Manager, and Parallels Virtuzz) Table f Cntents
BRILL s Editorial Manager (EM) Manual for Authors Table of Contents
BRILL s Editrial Manager (EM) Manual fr Authrs Table f Cntents Intrductin... 2 1. Getting Started: Creating an Accunt... 2 2. Lgging int EM... 3 3. Changing Yur Access Cdes and Cntact Infrmatin... 3 3.1
Trends and Considerations in Currency Recycle Devices. What is a Currency Recycle Device? November 2003
Trends and Cnsideratins in Currency Recycle Devices Nvember 2003 This white paper prvides basic backgrund n currency recycle devices as cmpared t the cmbined features f a currency acceptr device and a
WEB APPLICATION SECURITY TESTING
WEB APPLICATION SECURITY TESTING Cpyright 2012 ps_testware 1/7 Intrductin Nwadays every rganizatin faces the threat f attacks n web applicatins. Research shws that mre than half f all data breaches are
The Importance Advanced Data Collection System Maintenance. Berry Drijsen Global Service Business Manager. knowledge to shape your future
The Imprtance Advanced Data Cllectin System Maintenance Berry Drijsen Glbal Service Business Manager WHITE PAPER knwledge t shape yur future The Imprtance Advanced Data Cllectin System Maintenance Cntents
Business Intelligence represents a fundamental shift in the purpose, objective and use of information
Overview f BI and rle f DW in BI Business Intelligence & Why is it ppular? Business Intelligence Steps Business Intelligence Cycle Example Scenaris State f Business Intelligence Business Intelligence Tls
UNIVERSITY OF CALIFORNIA MERCED PERFORMANCE MANAGEMENT GUIDELINES
UNIVERSITY OF CALIFORNIA MERCED PERFORMANCE MANAGEMENT GUIDELINES REFERENCES AND RELATED POLICIES A. UC PPSM 2 -Definitin f Terms B. UC PPSM 12 -Nndiscriminatin in Emplyment C. UC PPSM 14 -Affirmative
Service Desk Self Service Overview
Tday s Date: 08/28/2008 Effective Date: 09/01/2008 Systems Invlved: Audience: Tpics in this Jb Aid: Backgrund: Service Desk Service Desk Self Service Overview All Service Desk Self Service Overview Service
Disk Redundancy (RAID)
A Primer fr Business Dvana s Primers fr Business series are a set f shrt papers r guides intended fr business decisin makers, wh feel they are being bmbarded with terms and want t understand a cmplex tpic.
CSE 231 Fall 2015 Computer Project #4
CSE 231 Fall 2015 Cmputer Prject #4 Assignment Overview This assignment fcuses n the design, implementatin and testing f a Pythn prgram that uses character strings fr data decmpressin. It is wrth 45 pints
Research Report. Abstract: The Emerging Intersection Between Big Data and Security Analytics. November 2012
Research Reprt Abstract: The Emerging Intersectin Between Big Data and Security Analytics By Jn Oltsik, Senir Principal Analyst With Jennifer Gahm Nvember 2012 2012 by The Enterprise Strategy Grup, Inc.
Implementing ifolder Server in the DMZ with ifolder Data inside the Firewall
Implementing iflder Server in the DMZ with iflder Data inside the Firewall Nvell Cl Slutins AppNte www.nvell.cm/clslutins JULY 2004 OBJECTIVES The bjectives f this dcumentatin are as fllws: T cnfigure
Data Protection Policy & Procedure
Data Prtectin Plicy & Prcedure Page 1 Prcnnect Marketing Data Prtectin Plicy V1.2 Data prtectin plicy Cntext and verview Key details Plicy prepared by: Adam Haycck Apprved by bard / management n: 01/01/2015
Data Protection Act Data security breach management
Data Prtectin Act Data security breach management The seventh data prtectin principle requires that rganisatins prcessing persnal data take apprpriate measures against unauthrised r unlawful prcessing
Frequently Asked Questions November 19, 2013. 1. Which browsers are compatible with the Global Patent Search Network (GPSN)?
Frequently Asked Questins Nvember 19, 2013 General infrmatin 1. Which brwsers are cmpatible with the Glbal Patent Search Netwrk (GPSN)? Ggle Chrme (v23.x) and IE 8.0. 2. The versin number and dcument cunt
Improved Data Center Power Consumption and Streamlining Management in Windows Server 2008 R2 with SP1
Imprved Data Center Pwer Cnsumptin and Streamlining Management in Windws Server 2008 R2 with SP1 Disclaimer The infrmatin cntained in this dcument represents the current view f Micrsft Crpratin n the issues
TRAINING GUIDE. Crystal Reports for Work
TRAINING GUIDE Crystal Reprts fr Wrk Crystal Reprts fr Wrk Orders This guide ges ver particular steps and challenges in created reprts fr wrk rders. Mst f the fllwing items can be issues fund in creating
Personal Data Security Breach Management Policy
Persnal Data Security Breach Management Plicy 1.0 Purpse The Data Prtectin Acts 1988 and 2003 impse bligatins n data cntrllers in Western Care Assciatin t prcess persnal data entrusted t them in a manner
Using PayPal Website Payments Pro UK with ProductCart
Using PayPal Website Payments Pr UK with PrductCart Overview... 2 Abut PayPal Website Payments Pr & Express Checkut... 2 What is Website Payments Pr?... 2 Website Payments Pr and Website Payments Standard...
IN-HOUSE OR OUTSOURCED BILLING
IN-HOUSE OR OUTSOURCED BILLING Medical billing is ne f the mst cmplicated aspects f running a medical practice. With thusands f pssible cdes fr diagnses and prcedures, and multiple payers, the ability
Access to the Ashworth College Online Library service is free and provided upon enrollment. To access ProQuest:
PrQuest Accessing PrQuest Access t the Ashwrth Cllege Online Library service is free and prvided upn enrllment. T access PrQuest: 1. G t http://www.ashwrthcllege.edu/student/resurces/enterlibrary.html
Helpdesk Support Tickets & Knowledgebase
Helpdesk Supprt Tickets & Knwledgebase User Guide Versin 1.0 Website: http://www.mag-extensin.cm Supprt: http://www.mag-extensin.cm/supprt Please read this user guide carefully, it will help yu eliminate
This report provides Members with an update on of the financial performance of the Corporation s managed IS service contract with Agilisys Ltd.
Cmmittee: Date(s): Infrmatin Systems Sub Cmmittee 11 th March 2015 Subject: Agilisys Managed Service Financial Reprt Reprt f: Chamberlain Summary Public Fr Infrmatin This reprt prvides Members with an
learndirect Test Information Guide The National Test in Adult Numeracy
learndirect Test Infrmatin Guide The Natinal Test in Adult Numeracy 1 Cntents The Natinal Test in Adult Numeracy: Backgrund Infrmatin... 3 What is the Natinal Test in Adult Numeracy?... 3 Why take the
1 GETTING STARTED. 5/7/2008 Chapter 1
5/7/2008 Chapter 1 1 GETTING STARTED This chapter intrduces yu t the web-based UIR menu system. Infrmatin is prvided abut the set up necessary t assign users permissin t enter and transmit data. This first
Chris Chiron, Interim Senior Director, Employee & Management Relations Jessica Moore, Senior Director, Classification & Compensation
TO: FROM: HR Officers & Human Resurces Representatives Chris Chirn, Interim Senir Directr, Emplyee & Management Relatins Jessica Mre, Senir Directr, Classificatin & Cmpensatin DATE: May 26, 2015 RE: Annual
Tipsheet: Sending Out Mass Emails in ApplyYourself
GEORGETOWN GRADUATE SCHOOL Tipsheet: Sending Out Mass Emails in ApplyYurself In ApplyYurself (AY), it is very simple and easy t send a mass email t all f yur prspects, applicants, r students with applicatins
2. When logging is used, which severity level indicates that a device is unusable?
Last updated by Admin at March 3, 2015. 1. What are the mst cmmn syslg messages? thse that ccur when a packet matches a parameter cnditin in an access cntrl list link up and link dwn messages utput messages
Integrate Marketing Automation, Lead Management and CRM
Clsing the Lp: Integrate Marketing Autmatin, Lead Management and CRM Circular thinking fr marketers 1 (866) 372-9431 www.clickpintsftware.cm Clsing the Lp: Integrate Marketing Autmatin, Lead Management
esupport Quick Start Guide
esupprt Quick Start Guide Last Updated: 5/11/10 Adirndack Slutins, Inc. Helping Yu Reach Yur Peak 908.725.8869 www.adirndackslutins.cm 1 Table f Cntents PURPOSE & INTRODUCTION... 3 HOW TO LOGIN... 3 SUBMITTING
HarePoint HelpDesk for SharePoint. For SharePoint Server 2010, SharePoint Foundation 2010. User Guide
HarePint HelpDesk fr SharePint Fr SharePint Server 2010, SharePint Fundatin 2010 User Guide Prduct versin: 14.1.0 04/10/2013 2 Intrductin HarePint.Cm (This Page Intentinally Left Blank ) Table f Cntents
Software and Hardware Change Management Policy for CDes Computer Labs
Sftware and Hardware Change Management Plicy fr CDes Cmputer Labs Overview The cmputer labs in the Cllege f Design are clsely integrated with the academic needs f faculty and students. Cmputer lab resurces
Getting Started Guide
AnswerDash Resurces http://answerdash.cm Cntextual help fr sales and supprt Getting Started Guide AnswerDash is cmmitted t helping yu achieve yur larger business gals. The utlined pre-launch cnsideratins
Key Steps for Organizations in Responding to Privacy Breaches
Key Steps fr Organizatins in Respnding t Privacy Breaches Purpse The purpse f this dcument is t prvide guidance t private sectr rganizatins, bth small and large, when a privacy breach ccurs. Organizatins
Welcome to Microsoft Access Basics Tutorial
Welcme t Micrsft Access Basics Tutrial After studying this tutrial yu will learn what Micrsft Access is and why yu might use it, sme imprtant Access terminlgy, and hw t create and manage tables within
Often people have questions about new or enhanced services. This is a list of commonly asked questions and answers regarding our new WebMail format.
Municipal Service Cmmissin Gerald P. Cle Frederick C. DeLisle Thmas M. Kaul Gregry L. Riggle Stanley A. Rutkwski Electric, Steam, Water Cable Televisin and High Speed Internet Service since 1889 Melanie
Mobile Workforce. Improving Productivity, Improving Profitability
Mbile Wrkfrce Imprving Prductivity, Imprving Prfitability White Paper The Business Challenge Between increasing peratinal cst, staff turnver, budget cnstraints and pressure t deliver prducts and services
Software Quality Assurance Plan
Sftware Quality Assurance Plan fr AnthrpdEST pipeline System Versin 1.0 Submitted in partial fulfillment f the requirements f the degree f Master f Sftware Engineering Prepared by Luis Fernand Carranc
efusion Table of Contents
efusin Cst Centers, Partner Funding, VAT/GST and ERP Link Table f Cntents Cst Centers... 2 Admin Setup... 2 Cst Center Step in Create Prgram... 2 Allcatin Types... 3 Assciate Payments with Cst Centers...
Backups and Backup Strategies
IT Security Office Versin 2.3 02/19/10 Backups and Backup Strategies IT managers need t plan fr backups in terms f time and space required. Hwever, mst mdern backup sftware can cmpress the backup files
TaskCentre v4.5 Send Message (SMTP) Tool White Paper
TaskCentre v4.5 Send Message (SMTP) Tl White Paper Dcument Number: PD500-03-17-1_0-WP Orbis Sftware Limited 2010 Table f Cntents COPYRIGHT 1 TRADEMARKS 1 INTRODUCTION 2 Overview 2 FEATURES 2 GLOBAL CONFIGURATION
CDC UNIFIED PROCESS PRACTICES GUIDE
Dcument Purpse The purpse f this dcument is t prvide guidance n the practice f Risk Management and t describe the practice verview, requirements, best practices, activities, and key terms related t these
Access EEC s Web Applications... 2 View Messages from EEC... 3 Sign In as a Returning User... 3
EEC Single Sign In (SSI) Applicatin The EEC Single Sign In (SSI) Single Sign In (SSI) is the secure, nline applicatin that cntrls access t all f the Department f Early Educatin and Care (EEC) web applicatins.
Considerations for Success in Workflow Automation. Automating Workflows with KwikTag by ImageTag
Autmating Wrkflws with KwikTag by ImageTag Cnsideratins fr Success in Wrkflw Autmatin KwikTag balances cmprehensive, feature-rich Transactinal Cntent Management with affrdability, fast implementatin, ease
IX- On Some Clustering Techniques for Information Retrieval. J. D. Broffitt, H. L. Morgan, and J. V. Soden
IX-1 IX- On Sme Clustering Techniques fr Infrmatin Retrieval J. D. Brffitt, H. L. Mrgan, and J. V. Sden Abstract Dcument clustering methds which have been prpsed by R. E. Bnner and J. J. Rcchi are cmpared.
Succession Planning & Leadership Development: Your Utility s Bridge to the Future
Successin Planning & Leadership Develpment: Yur Utility s Bridge t the Future Richard L. Gerstberger, P.E. TAP Resurce Develpment Grup, Inc. 4625 West 32 nd Ave Denver, CO 80212 ABSTRACT A few years ag,
Best Practice - Pentaho BA for High Availability
Best Practice - Pentah BA fr High Availability This page intentinally left blank. Cntents Overview... 1 Pentah Server High Availability Intrductin... 2 Prerequisites... 3 Pint Each Server t Same Database
How to put together a Workforce Development Fund (WDF) claim 2015/16
Index Page 2 Hw t put tgether a Wrkfrce Develpment Fund (WDF) claim 2015/16 Intrductin What eligibility criteria d my establishment/s need t meet? Natinal Minimum Data Set fr Scial Care (NMDS-SC) and WDF
The Importance of Market Research
The Imprtance f Market Research 1. What is market research? Successful businesses have extensive knwledge f their custmers and their cmpetitrs. Market research is the prcess f gathering infrmatin which
CS 360 Software Development Spring 2008 Tuesdays and Thursdays 3:30 p.m. 4:45 p.m.
CS 360 Sftware Develpment Spring 2008 Tuesdays and Thursdays 3:30 p.m. 4:45 p.m. Instructr: Ingrid Russell Office: Dana 343 email: [email protected] http://uhaweb.hartfrd.edu/irussell Curse Descriptin:
Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies
Vlume 2, Issue 11, Nvember 2014 Internatinal Jurnal f Advance Research in Cmputer Science and Management Studies Research Article / Survey Paper / Case Study Available nline at: www.ijarcsms.cm ISSN: 2321
Team Process Data Warehouse Goals and High-Level Requirements
Team Prcess Data Warehuse Gals and High-Level Requirements Backgrund TSP SM is used by teams wrking in a wide variety f prblem dmains (e.g. sftware, hardware, services). Since these activities are nt limited
Data Abstraction Best Practices with Cisco Data Virtualization
White Paper Data Abstractin Best Practices with Cisc Data Virtualizatin Executive Summary Enterprises are seeking ways t imprve their verall prfitability, cut csts, and reduce risk by prviding better access
Completing the CMDB Circle: Asset Management with Barcode Scanning
Cmpleting the CMDB Circle: Asset Management with Barcde Scanning WHITE PAPER The Value f Barcding Tday, barcdes are n just abut everything manufactured and are used fr asset tracking and identificatin
Computer Science Undergraduate Scholarship
Cmputer Science Undergraduate Schlarship R E G U L A T I O N S The Cmputer Science Department at the University f Waikat runs an annual Schlarship examinatin which ffers up t 10 undergraduate Schlarships
How do I evaluate the quality of my wireless connection?
Hw d I evaluate the quality f my wireless cnnectin? Enterprise Cmputing & Service Management A number f factrs can affect the quality f wireless cnnectins at UCB. These include signal strength, pssible
Army DCIPS Employee Self-Report of Accomplishments Overview Revised July 2012
Army DCIPS Emplyee Self-Reprt f Accmplishments Overview Revised July 2012 Table f Cntents Self-Reprt f Accmplishments Overview... 3 Understanding the Emplyee Self-Reprt f Accmplishments... 3 Thinking Abut
Using PayPal Website Payments Pro with ProductCart
Using PayPal Website Payments Pr with PrductCart Overview... 2 Abut PayPal Website Payments Pr & Express Checkut... 3 What is Website Payments Pr?... 3 Website Payments Pr and Website Payments Standard...
:: EMAIL ADMIN HELP AT A GLANCE Contents
:: EMAIL ADMIN HELP AT A GLANCE Cntents Email Admin Dmain Inf... 2 POP Accunts... 3 Edit POP Accunts... 4 Search Accunts... 5 Frwards... 6 Spam Cntrl... 7 CatchAll... 8 EMAIL ADMIN HELP AT A GLANCE ::
Knowledge Base Article
Knwledge Base Article Crystal Matrix Interface Cmparisn TCP/IP vs. SDK Cpyright 2008-2012, ISONAS Security Systems All rights reserved Table f Cntents 1: INTRODUCTION... 3 1.1: TCP/IP INTERFACE OVERVIEW:...
Mobile Device Manager Admin Guide. Reports and Alerts
Mbile Device Manager Admin Guide Reprts and Alerts September, 2013 MDM Admin Guide Reprts and Alerts i Cntents Reprts and Alerts... 1 Reprts... 1 Alerts... 3 Viewing Alerts... 5 Keep in Mind...... 5 Overview
Online Learning Portal best practices guide
Online Learning Prtal Best Practices Guide best practices guide This dcument prvides Micrsft Sftware Assurance Benefit Administratrs with best practices fr implementing e-learning thrugh the Micrsft Online
Network Security Trends in the Era of Cloud and Mobile Computing
Research Reprt Abstract: Netwrk Security Trends in the Era f Clud and Mbile Cmputing By Jn Oltsik, Senir Principal Analyst and Bill Lundell, Senir Research Analyst With Jennifer Gahm, Senir Prject Manager
Fund Accounting Class II
Fund Accunting Class II BS&A Fund Accunting Class II Cntents Gvernmental Financial Reprting Mdel - Minimum GAAP Reprting Requirements... 1 MD&A (Management's Discussin and Analysis)... 1 Basic Financial
In addition to assisting with the disaster planning process, it is hoped this document will also::
First Step f a Disaster Recver Analysis: Knwing What Yu Have and Hw t Get t it Ntes abut using this dcument: This free tl is ffered as a guide and starting pint. It is des nt cver all pssible business
Research Report. Abstract: Security Management and Operations: Changes on the Horizon. July 2012
Research Reprt Abstract: Security Management and Operatins: Changes n the Hrizn By Jn Oltsik, Senir Principal Analyst With Kristine Ka and Jennifer Gahm July 2012 2012, The Enterprise Strategy Grup, Inc.
Phi Kappa Sigma International Fraternity Insurance Billing Methodology
Phi Kappa Sigma Internatinal Fraternity Insurance Billing Methdlgy The Phi Kappa Sigma Internatinal Fraternity Executive Bard implres each chapter t thrughly review the attached methdlgy and plan nw t
White Paper for Mobile Workforce Management and Monitoring Copyright 2014 by Patrol-IT Inc. www.patrol-it.com
White Paper fr Mbile Wrkfrce Management and Mnitring Cpyright 2014 by Patrl-IT Inc. www.patrl-it.cm White Paper fr Mbile Wrkfrce Management and Mnitring Cpyright 2014 by Patrl-IT Inc. www.patrl-it.cm 2
Custom Portlets. an unbiased review of the greatest Practice CS feature ever. Andrew V. Gamet
Custm Prtlets an unbiased review f the greatest Practice CS feature ever Andrew V. Gamet Descriptin In Practice CS, the firm can use any f the fur dashbards t quickly display relative infrmatin. The Firm,
SBClient and Microsoft Windows Terminal Server (Including Citrix Server)
SBClient and Micrsft Windws Terminal Server (Including Citrix Server) Cntents 1. Intrductin 2. SBClient Cmpatibility Infrmatin 3. SBClient Terminal Server Installatin Instructins 4. Reslving Perfrmance
The AppSec How-To: Choosing a SAST Tool
The AppSec Hw-T: Chsing a SAST Tl Surce Cde Analysis Made Easy GIVEN THE WIDE RANGE OF SOURCE CODE ANALYSIS TOOLS, SECURITY PROFESSIONALS, AUDITORS AND DEVELOPERS ALIKE ARE FACED WITH THE QUESTION: Hw
User Guide Version 3.9
User Guide Versin 3.9 Page 2 f 22 Summary Cntents 1 INTRODUCTION... 3 1.1 2 CREATE A NEW ACCOUNT... 4 2.1 2.2 3 NAVIGATION... 3 CREATE AN EMAIL ACCOUNT... 4 CREATE AN ALIAS ACCOUNT... 6 MODIFYING AN EXISTING
Chicago Department of Finance. Tax Audit Process
Chicag Department f Finance Tax Audit Prcess Audit Overview There are varius ways a business gets selected fr audit. The mst cmmn are referrals frm anther divisin f the Department f Finance, referral frm
TOWARDS OF AN INFORMATION SERVICE TO EDUCATIONAL LEADERSHIPS: BUSINESS INTELLIGENCE AS ANALYTICAL ENGINE OF SERVICE
TOWARDS OF AN INFORMATION SERVICE TO EDUCATIONAL LEADERSHIPS: BUSINESS INTELLIGENCE AS ANALYTICAL ENGINE OF SERVICE A N D R E I A F E R R E I R A, A N T Ó N I O C A S T R O, D E L F I N A S Á S O A R E
Equal Pay Audit 2014 Summary
Equal Pay Audit 2014 Summary Abut the dcument The fllwing summary is an abridged versin f Ofcm s equal pay audit 2014. In the full versin f the reprt we set ut ur key findings, cmment n any issues arising
How To Install Fcus Service Management Software On A Pc Or Macbook
FOCUS Service Management Sftware Versin 8.4 fr Passprt Business Slutins Installatin Instructins Thank yu fr purchasing Fcus Service Management Sftware frm RTM Cmputer Slutins. This bklet f installatin
System Business Continuity Classification
Business Cntinuity Prcedures Business Impact Analysis (BIA) System Recvery Prcedures (SRP) System Business Cntinuity Classificatin Cre Infrastructure Criticality Levels Critical High Medium Lw Required
Research Report. Abstract: Advanced Malware Detection and Protection Trends. September 2013
Research Reprt Abstract: Advanced Malware Detectin and Prtectin Trends By Jn Oltsik, Senir Principal Analyst With Jennifer Gahm, Senir Prject Manager September 2013 2013 by The Enterprise Strategy Grup,
IMT Standards. Standard number A000014. GoA IMT Standards. Effective Date: 2010-09-30 Scheduled Review: 2011-03-30 Last Reviewed: Type: Technical
IMT Standards IMT Standards Oversight Cmmittee Gvernment f Alberta Effective Date: 2010-09-30 Scheduled Review: 2011-03-30 Last Reviewed: Type: Technical Standard number A000014 Electrnic Signature Metadata
NASDAQ BookViewer 2.0 User Guide
NASDAQ BkViewer 2.0 User Guide NASDAQ BkViewer 2.0 ffers a real-time view f the rder depth using the NASDAQ Ttalview prduct fr NASDAQ and ther exchange-listed securities including: The tp buy and sell
Durango Merchant Services QuickBooks SyncPay
Durang Merchant Services QuickBks SyncPay Gateway Plug-In Dcumentatin April 2011 Durang-Direct.cm 866-415-2636-1 - QuickBks Gateway Plug-In Dcumentatin... - 3 - Installatin... - 3 - Initial Setup... -
Traffic monitoring on ProCurve switches with sflow and InMon Traffic Sentinel
An HP PrCurve Netwrking Applicatin Nte Traffic mnitring n PrCurve switches with sflw and InMn Traffic Sentinel Cntents 1. Intrductin... 3 2. Prerequisites... 3 3. Netwrk diagram... 3 4. sflw cnfiguratin
McAfee Enterprise Security Manager. Data Source Configuration Guide. Infoblox NIOS. Data Source: September 2, 2014. Infoblox NIOS Page 1 of 8
McAfee Enterprise Security Manager Data Surce Cnfiguratin Guide Data Surce: Infblx NIOS September 2, 2014 Infblx NIOS Page 1 f 8 Imprtant Nte: The infrmatin cntained in this dcument is cnfidential and
