Dimensionality Reduction and Model Selection for Click Prediction

Size: px
Start display at page:

Download "Dimensionality Reduction and Model Selection for Click Prediction"

Transcription

1 Dmesoalty Reducto ad Model Selecto for Clc Predcto Author: Aet Mtra Supervsor: dr. Aue Pot Supervsor: dr. Evert Haasd Reader: dr. Fetse Bma Faculty of Sceces Bassweg 52D 08 HV Amsterdam 034 AP Amsterdam Abstract: As part of ths research, a Isghts System has bee bult, usg a esemble of Bg Data Techologes ad Mache Learg Algorthms. The Isghts System ca be used to process large data sets (mpressos, bd requests), create vewer hstory ad compute the values of varous derved attrbutes. The system further facltates feature ad model selecto usg dfferet statstcal techques ad mache learg algorthms, thereby eablg us to detfy, a set of relevat features ad a model, that estmates the lelhood of whether a mpresso wll result a clc or ot. Keywords: Hadoop, MogoDB, HIVE, MapReduce, Feature Selecto, Naïve Bayes, Logstc Regresso, Support Vector Maches

2 Dmesoalty Reducto & Model Selecto for Clc Predcto 2 Executve Summary The advet of aucto-based meda buyg s fast trasformg the ladscape of dgtal meda, thereby allowg advertsers ad publshers to obta cotextual relevace advertsemets, whch a user sees o a webste. Such customer maretg effectveess, combed wth process effcecy, promses better accoutablty for advertsers o ther ad speds. As per the Iteret Advertsg Board report publshed 203, the Dutch ole advertsg maret was 69m H 203, a H/H growth of 5.8%.Moreover, the report expected the Dutch dsplay advertsg maret to grow at 6% 204 mostly drve by hgh growth rates the aucto-based meda buyg. Aucto-based meda buyg has led to the creato of a ecosystem comprsg of Ageces, Tradg Dess, Data Provders, Buyg ad Sellg Solutos Provders as show the Dsplay Advertsg Ecosystem of Netherlads. Adscece, whch s a buy sde solutos provder, coects to multple sell sde solutos provders (that ru the auctos) ad partcpates aucto-based meda buyg o behalf of meda ageces /advertsers. Therefore, Adscece strves to combe meda buyg effcecy wth dsplay advertsemet effectveess order to delver greater accoutablty to advertsers o ther ad sped. Moreover, a Wterberry Group Whte Paper publshed 203, revealed that the 260 executve-level mareters, techologsts ad meda dustry leaders that t tervewed, lay a hgh emphass o applyg programmatc approached to audece segmetato (9% of respodets) ad actoable sght developmet (88% of respodets).ths thess wll provde detals of a Isghts System that has bee bult for Adscece ad usg the system, dfferet approaches for audece segmetato wll be explored. Moreover, usg the sghts system, dfferet attrbute selecto methods wll be explored, that ehaces the effectveess of a dsplay advertsemet. Ths system wll further streamle the meda buyg effcecy of Adscece by allowg them to detfy the optmum bd prce of a aucto.

3 Dmesoalty Reducto & Model Selecto for Clc Predcto 3 Preface Ths thess s part of my Master s degree program at VU Uversty, where each studet s requred to perform a research, as part of the fal semester of ther study. I would le to exted my grattude to dr. Aue Pot for hs supervso, dr. Evert Haasd for hs gudace ad dr Fetse Bma for her advce throughout. I am grateful to all my colleagues at Adscece for beg supportve durg the perod of the tershp. Aet Mtra Aetmtra00@gmal.com Jauary, 204

4 Dmesoalty Reducto & Model Selecto for Clc Predcto 4 Cotets Executve Summary... 2 Preface Itroducto The Isghts System Overvew The ETL Process Extracto Trasformato Load Setup of the Isghts System Data Storage Data Processg Data Aalyss Exteral API s Iterfaces Dmesoalty Reducto & Model Selecto The Data Mg Process Notatos & Deftos Clc Predcto Process DRMS Flow Dagram Naïve Bayes Algorthm The Model Estmato of Lelhood Estmato of Clc Probablty for New Attrbute Values Logstc Regresso The Model Estmato of Clc Probablty Support Vector Mache The Model RBF Kerel Parameter Estmato... 27

5 Dmesoalty Reducto & Model Selecto for Clc Predcto Dmesoalty Reducto for Logstc Regresso ad SVMs Trasformato of Features Recever Operatg Characterstcs (ROC) Aalyss Estmato of Model Performace Feature Selecto Feature Selecto Methods Results Coclusos ad Recommedatos Coclusos Recommedatos Appedx Appedx A-Data Flow Dagram (Extracto Process) Appedx B-Data Flow Dagram (Trasform Process) Appedx C-Etty Relatoshp Dagram Appedx D-Class Dagram Refereces... 45

6 Dmesoalty Reducto & Model Selecto for Clc Predcto 6. Itroducto Realtme Bddg (RTB) allows advertsers to bd o the opportuty to show a advertsemet to a specfc user o a specfc ste wth a dsplay advertsemet slot [5]. To help advertsers partcpate realtme bddg, termedares exst called demad sde platforms (DSP), that coect to multple ad exchages (whch ru the auctos), ad partcpate realtme bddg o behalf of the advertser. The demad sde platform ca access vetory from these ad exchages ad decde whether or ot they wat to bd ad at what prce. A vetory s defed as the place where the advertsemet wll be show. Each wo bd s termed as a mpresso. The goal s to detfy f showg a partcular advertsemet to a vewer wll result a clc ad to determe the bd prce accordgly. The lelhood of a vewer clcg o a advertsemet depeds o a umber of dfferet factors (attrbutes), ad some factors have a greater fluece tha others the fal outcome (clc/o-clc). The goal ca therefore be restated as detfcato of a set of factors that crease the lelhood of a vewer clcg o a partcular advertsemet Adscece provdes a demad sde platform that eables meda ageces ad advertsers to partcpate realtme bddg for dsplay ads. The core busess of Adscece s Maretg Accoutablty: demostratg the relatoshp betwee maretg efforts o the oe had, ad umstaable addtoal returs o the other. Adscece s also specalzed real-tme campag maagemet. NOAX, Adscece s optmzato platform, explores ad dscovers such relatoshps ad maes real-tme decsos, or hghlghts these relatoshps to maretg ad sales specalsts. Adscece s afflated wth Ortec, a global maret leader software ad cosultacy the feld of plag ad optmzato []. NOAX, the real-tme bddg algorthm of Adscece uses a Naïve Bayes algorthm to determe the bd prce of a baer to be dsplayed for a vetory. The algorthm uses a set of base attrbutes (receved as part of the bd requests) as well as derved attrbutes to determe the clc probablty of each vewer. The detals regardg each wo-bd (mpresso) ad the outcome of t (o acto, clc, coverso) s regstered ad stored the database. Ths formato s processed frequetly order to update the statstcs regardg each attrbute. A set of delvery rules ca be set that restrcts the algorthm from explorg uterestg audeces [4]. Moreover, NOAX allows campag maagers to defe costrats o the budget spet, bd prce, umber of mpressos etc. NOAX also allows campag maagers to do audece retargetg maly o two ds of attrbutes: the curret url ad the browsg patter of the user [4]. As show Fgure [], Adscece provdes a terface that allows campag maagers to create ad maage campags whle a collecto of bddg ad database servers, process ad store bds formato. Adscece receves approxmately 00M bd requests ad serves more tha M mpressos per day. Ths umber s set to grow rapdly as Adscece coects to more Ad Exchages ad therefore, gas access to global vetory the ear future. Ths allows Adscece access to large volumes of data ad therefore opes up the possblty of mg these massve datasets [6]. Therefore, Adscece requres a system that ca store ad process the bd requests ad mpressos. Moreover, the data stored ths ew system eeds to be processed to compute the values of varous attrbutes. Furthermore, the system should eable attrbute selecto usg varous mache learg techques ad also serve as a test bed for estmatg the clc probablty of a mpresso usg alterate algorthms le Logstc Regresso ad Support Vector Maches.

7 Dmesoalty Reducto & Model Selecto for Clc Predcto 7 Mareters Meda Ageces Campags Meda Ageces Campags Baers Baers CTR Coverso Rate Reveues ad Cost per Clc Reveues ad Cost per Coverso Reveues ad Cost per thousad Impressos Bddg Servers Bddg Servers NOAX Strateges Database Servers Bddg Servers Database Servers Bddg Servers Ad Exchages Supply Sde Platforms Publshers Database Servers Cosumers Fgure : NOAX- A Overvew Bddg Servers Bddg Servers Impressos Clcs Coversos The am of ths research s to mplemet a Isghts System usg a esemble of Bg Data techologes ad Mache Learg pacages for feature selecto ad classfer comparso. The Isghts System extracts the bds requests ad mpressos from the operatoal platform ad stores t a dstrbuted fle system (bd requests) ad a NoSQL data store (mpressos) [4][5]. Ths data s the trasformed usg a set of map-reduce obs to compute the values of varous attrbutes [6].The trasformed data set (attrbute values per mpresso) s the loaded to a evromet sutable for statstcal computg ad graphcs, where dfferet mache learg pacages are used to detfy the attrbutes that best predct the lelhood of a vewer clcg o a dsplay advertsemet [7] [8]. A umber of feature selecto methods le Wrappers, Flters ad Embedded methods have bee evaluated for ths purpose. Due to the hgh dmesoalty ad large umber of mpressos, t has bee foud that Wrappers have a hgh tme complexty. Therefore, a uque method has bee proposed, that combes the flter ad wrapper methods where a flter method s used to reduce the dmesoalty of the dataset by selectg attrbutes that have a hgh correlato wth the outcome whle a low correlato amog themselves. The reduced feature set s the passed to a wrapper method that performs a teratve ROC aalyss to detfy the set of features that best predct the lelhood of a vewer clcg o a dsplay advertsemet. Moreover, the exstece of class sew the data set has led us to formulate a ovel approach for ROC aalyss by rag mpressos based o ther estmated clc probabltes. Fally, the clc probablty of each mpresso has bee estmated wth alterate algorthms le Logstc Regresso ad Support Vector Maches ad performace of these algorthms have bee compared agast the Naïve Bayes algorthm [8][9][20]. The predctos of the models have bee valdated agast the actual outcome of dstct valdato sets (to assess robustess) ad methods le ROC aalyss ad statstcal sgfcace tests have bee used to draw coclusos o the performace of the models thereby allowg us to draw coclusos o the comparatve aalyss of the models. Moreover, the system s beg used for other research purposes le vewer proflg ad coverso attrbuto whch are out of scope for ths report.

8 Dmesoalty Reducto & Model Selecto for Clc Predcto 8 2. The Isghts System 2. Overvew The Isghts System s a esemble of Bg Data Techologes ad a Statstcal Aalyss Evromet for processg ad aalyzg mpressos data. As show Fgure [2], the system processes approxmately 00M+ bd requests, M+ mpressos, os, ther correspodg outcomes (o clcs, clcs, coversos) ad computes the attrbutes values for each mpresso daly. Moreover, the system rus varous mache learg algorthms o large volumes of mpressos data to estmate the clc probablty for each mpresso. Fgure 2: Isghts System A Overvew

9 Dmesoalty Reducto & Model Selecto for Clc Predcto The ETL Process 2.2. Extracto The extracto process, as show Appedx A, sources bd requests, mpressos ad evets data (clcs, coversos ad retargetg) from the operatoal system. Moreover, the extracto process also extracts doma categores ad average tme spet by a vewer o a webpage. Bd requests are extracted from each of the bddg server ad cota the followg formato Vewer Id Impresso Id Baer Id Posto of the Baer Sze of the Baer Browser Doma Name IP Address Laguage Operatg System SSP Tmestamp The value of Baer Id dcates whether we bd for the requested mpresso or ot. A Baer Id of - dcates that we dd ot bd for the mpresso. The Impresso Id s a cocateato of the Tmestamp ad Vewer Id. Smlarly, mpressos data s extracted from the operatoal databases ad cotas the followg formato Impresso Id Vewer Id Tmestamp Doma Name Estmated CTR (Clc Through Rate) IP Address Operatg System Baer Id Campag Id Browser Moreover, evets data (clcs, coversos & retargetg) s extracted whch cotas the followg formato Vewer Id Evet Tme The extracto process extracts these data sets ad stores them the Isghts System for further processg.

10 Dmesoalty Reducto & Model Selecto for Clc Predcto Trasformato The trasform process, as show Appedx B, creates ad stores vewer profle.e mpressos ad ts outcomes (clc, o clc, coverso, o coverso, retargetg) are aggregated per vewer. Creato of vewer profle trasforms the attrbute computato problem to a embarrassgly parallel problem such that the attrbutes for each vewer ca be computed parallel. Oce the vewer profle (hstory) has bee created, a large umber of attrbute ca be computed for each vewer. Some of these attrbutes are metoed below Base Attrbutes: These attrbutes are the default attrbutes preset the data extracted from the operatoal system. Examples of Base Attrbutes are: BROWSER OS DOMAIN IP BANNER CAMPAIGN SSP DATE_TIME CTR DOMAIN_CATEGORY DateTme Attrbutes: These attrbutes are derved from the tmestamp at whch the mpresso was served. Examples of such attrbutes are: DATE DAY_OF_WEEK HOUR_OF_DAY GeoLocato Attrbutes: These attrbutes are derved from the IP address of a vewer ad provde formato about the geographc locato of a user. Examples of such attrbutes are : COUNTRY LATITUDE LONGITUDE REGION CITY WEATHER: the geographc coordates (Lattude, Logtude) are used to derve the weather formato of the locato whch the mpresso s served. Cout Attrbutes: These attrbutes defe the umber of occurreces of a partcular evet per vewer, tll the gve mpresso was served. Examples of such attrbutes are: NOF_IM_THIS_AD : Number of tmes the vewer has see ths baer

11 Dmesoalty Reducto & Model Selecto for Clc Predcto NOF_IM_THIS_CAMPAIGN : Number of tmes the vewer has see ay baer belogg to ths campag NOF_IMPRESSIONS : Number of Impressos already served to the vewer NOF_CONVERSIONS : Number of coversos already regstered for the vewer NOF_RETARGETING_REQUESTS : Number of tmes the vewer has bee retargeted NOF_CLICKS : Number of clcs already regstered for the vewer NOF_CONVERSIONS_THIS_AD : Number of coversos already regstered for the vewer for ths baer NOF_VISITS_THIS_DOMAIN : Number of tmes the vewer has vsted ths doma NOF_BID_REQUESTS : Number of bd requests receved for the vewer NOF_VISITS_THIS_DOMAIN_CATEGORY : Number of tmes the vewer has vsted ths doma category NOF_CONVERSIONS_THIS_CAMPAIGN : Number of coversos already regstered for the vewer for ths campag NOF_CL_THIS_CAMPAIGN : Number of clcs already regstered for the vewer for ths campag NOF_CL_THIS_AD : Number of clcs already regstered for the vewer for ths baer NOF_BID_RESPONSES : Number of bd requests for the vewer that have bee respoded to Flag Attrbutes: These attrbutes dcate the occurrece of a evet the past for a vewer. Examples of such attrbutes are: HAS_CLICKED : A yes/o flags dcatg whether a vewer has already clced o a baer the past HAS_CONVERTED : A yes/o flags dcatg whether a coverso has bee regstered for the vewer the past Load Sce the ma obectve s to compute the clc probablty per mpresso, the load process maps the attrbute values computed per vewer ( the trasform step) to each mpresso served to the vewer thereby geeratg attrbute values per mpresso. The attrbute values per mpresso are the loaded to the statstcal aalyss evromet where varous mache learg algorthms are used to detfy the attrbutes/algorthm par that best predcts the lelhood of a vewer clcg o a dsplay advertsemet.

12 Dmesoalty Reducto & Model Selecto for Clc Predcto Setup of the Isghts System As descrbed Secto 2., due to the large volume of data that has to be stored ad processed by the Isghts System, a set of parallel ad dstrbuted computg techologes s requred. Ths secto descrbes s the dfferet techologes that have bee used settg up the Isghts System Fgure 5: Archtecture of the Isghts System

13 Dmesoalty Reducto & Model Selecto for Clc Predcto Data Storage The data model of the Isghts System (see Appedx C) s mplemeted usg MogoDB ad HIVE whch are descrbed ths secto. MogoDB s a NoSQL database whch stores data ey-value format ad offers a wde rage of features [2]. Some of the features of MogoDB whch mae t appealg as the data store of the Isghts System are Documet Storage: Sce most of the bg data sets geerated by the operatoal platform are ey-value format, the feature of MogoDB to store data as JSON documets maes t a good ft. The ey-value structure of the data model s show Fgure [8]. Idexg Support: The attrbute computato tas requres accessg documets (records) from large collectos of data. The dexg MogoDB allows low latecy access to documets ad therefore s expected to speed up the data trasformato process. Idexg the data model s show Fgure [8]. Horzotal Scalg: Sce the Isghts System has to store large collectos of data, ths data eeds to be splt ad stored as multple smaller sets, order to facltate quc retreval of data. Moreover, as the data sze eeps growg, t should be feasble to scale horzotally by dstrbutg the data over addtoal processg uts. MogoDB supports ths feature of horzotal scalg by allowg creato of shards ad addg them to a cluster as show Fgure [6] [22]. Oce a shard has bee added, the automatc shardg feature of MogoDB eables redstrbuto of data over ths ew (ad larger) processg cluster. The use of shard ey the data model s demostrated Fgure [8]. Queryg: MogoDB supports a large collecto of data retreval, aggregato ad data sert/update queres whch allows us to perform as lot of the ETL (extract, trasform, load) tass as descrbed Secto 2.2.2, va Mogo queres. Fgure 6: Archtecture of MogoDB Sharded Cluster for the Isghts System

14 Dmesoalty Reducto & Model Selecto for Clc Predcto 4 Whle MogoDB provdes a umber of features whch mae t a de-facto choce for data store o the Isghts System, t suffers from hgh latecy whe the umber of serts s hgh as t creates wrte locs o the collecto each shard. The performace of MogoDB worses for updates as the collecto sze gets larger [25].Therefore, MogoDB,though a good choce for storg ad processg mpressos data, t does ot wor for bd requests where the umber of bd requests receved daly s approxmately 00 tmes more tha the umber mpressos served daly. O the cotrary, Apache Hve s a powerful data warehousg applcato bult o top of Hadoop, whch eables access to data usg Hve QL, a laguage whch s very smlar to SQL[23][30]. As show Fgure [5], Hve oly requres the bd request fles ( so format) to be processed (va map-reduce obs as show Fgure [5]) ad coped to HDFS (Hadoop Dstrbuted Fle System). Hve provdes a database layer abstracto o ths fle (see Fgure [7]) by allowg the user to create tables ad map the colums of the table to each feld the fle. Data retreval/aggregato s doe va Hve QL, whch tur voes map-reduce obs (see Fgure [7]) to process the data, thereby allowg the computato of varous attrbutes from bd requests [30]. Fgure 7: Archtecture of Hve (Source: Hve- A Warehousg Soluto Over a Map-Reduce Framewor)

15 Dmesoalty Reducto & Model Selecto for Clc Predcto Data Processg The attrbute computato tas as outled Secto has a hgh computatoal complexty ad therefore requres parallel ad dstrbuted applcatos for the same. Hadoop s a ope source framewor for wrtg ad rug dstrbuted applcatos that process large amouts of data [4]. Of all the great features of Hadoop that has made t oe of the most wdely used Bg Data techologes, we ted to leverage ts salablty ad smplcty the Isghts System. Hadoop scales learly to hadle larger data by addg more processg uts to the cluster [4]. Moreover, Hadoop provdes a very smple abstracto for wrtg effcet parallel code, whle the booeepg of splttg/maagg put data ad the assgmet of computato to a processg ut s doe uder the hood [4].As show Fgure [0], a Hadoop setup cossts of the followg compoets Name Node: Hadoop employs a master/slave archtecture whch Name Node acts as the master of HDFS (Hadoop Dstrbuted Fle System) that drects the slave Data Node to perform low level I/O tass. The Name Node s the booeeper of HDFS; t eeps trac of how fles are broe dow to fle blocs, whch odes store whch blocs ad the overall health of the dstrbuted fle system. The Name Node s memory ad I/O tesve [4]. Data Node: The Data Node s resposble for readg or wrtg HDFS blocs to actual fles o the local fle system. Whe a user program wats to read or wrte a HDFS fle, the fle s broe to blocs ad the Name Node tells the user program whch Data Node does each bloc resde. The user program the drectly commucates wth the Data Node daemos to process the local fles correspodg to the blocs [4].The teracto of the Name Node ad Data Node s demostrated Fgure [0]. Secodary Name Node: The Secodary Name Node s a assstat daemo for motorg the state of the cluster HDFS. The SSN dffers from the Name Node that ths process does t receve or record ay realtme chages to HDFS. Istead, t commucates wth Name Node to tae sapshots of the HDFS metadata at tervals defed by the cluster cofgurato [4]. Job Tracer: As show Fgure [9], the Job Tracer acts as a laso betwee the user s program ad the Tas Tracer. Oce a code s submtted to the Hadoop cluster, the Job Tracer determes the executo pla by determg whch fles to process, assg Tas Tracers to dfferet tass, ad motors all tass that are rug o the Tas Tracers. Should a tas fal, the Job Tracer wll automatcally relauch the tas, possbly o a dfferet Tas Tracer, upto a predefed lmt of retres. There s oe Job Tracer daemo per cluster [4]. Tas Tracer: The Tas Tracer s resposble for maagg the executo of each mapreduce ob that the Job Tracer assgs to t, as show Fgure [9]. The Tas Tracer costatly commucates wth the Job Tracer. If the Job Tracer fals to receve a heartbeat from a Tas Tracer wth a specfed amout of tme, t wll assume that the

16 Dmesoalty Reducto & Model Selecto for Clc Predcto 6 Tas Tracer has crashed ad wll resubmt the correspodg tas to aother Tas Tracer the cluster [4]. Fgure 9: Iteracto betwee Job Tracer ad Tas Tracer (Source: Hadoop Acto by Chuc Lam) Fgure 0: Archtecture of Hadoop Cluster for the Isghts System (Source: Hadoop Acto by Chuc Lam) The structure of a typcal map-reduce ob s show Fgure [] where the master s the Job Tracer whch assgs map tas to each Tas Tracer the cluster. The Tas Tracers, the execute the map tas o the correspodg put splt ad geerate termedate fles owhch reduce tass are ru to create the fal output fle. Sce we are usg both MogoDB ad Hadoop o the Isghts System, ths allows us to leave the data aggregato (reducto) to MogoDB. Therefore, for the executo of a map-reduce program (data trasformato / attrbute computato) o the Isghts Platform (see Fgure [2] ), each map tas seds the output to be aggregate/stored to the Mogo router, whch tur dstrbutes the data across multple shards. Ths structure therefore allows us to further smplfy the map-reduce user program, where the user oly has to wrte map-obs ad embedd the Mogo sert/update query the

17 Dmesoalty Reducto & Model Selecto for Clc Predcto 7 map ob ad therefore elmates the eed to wrte reduce obs. Moreover, the output of a map-reduce tas s stored MogoDB (ad therfore avalable for further procesg by leveragg o MogoDB s rch queryg features) rather tha flat fles. Hadoop ca be cofgured to ru fully dstrbuted mode where the Data Node ad Tas Tracers ru o dfferet feret maches o a cluster or pseudo-dstrbuted mode where all Hadoop compoets ru o a sgle mache. Sce the Isghts System s setup o a sgle server wth multple CPU cores, we are rug Hadoop pseudo-dstrbuted dstrbuted mode. Fgure : Overvew of the executo of a map-reduce program (Source: Mg of Massve Datasets by Raarama,Lesovec ad Ullma) Fgure 2: Overvew of the executo of a map-reduce program o the Isghts System

18 Dmesoalty Reducto & Model Selecto for Clc Predcto 8 The data processg ad attrbute computato classes for the Isghts System have bee wrtte Pytho. The map-reduce obs that have bee wrtte pytho are executed o Hadoop by usg the Hadoop Streamg API, whch s a utlty that allows map-reduce obs to be wrtte laguages other tha Java[6].Appedx D gves a outle of all the classes that have bee wrtte for the Isghts System Data Aalyss Apart from beg the statstcal computg ad graphcs evromet, wth a large lbrary of pacages that support statstcal computg ad data vsualzato, R has a umber of mache learg pacages le glm (for Logstc Regresso), lar (for Naïve Bayes), e07 (for Support Vector Maches) ad pracma (for ROC aalyss) amog others whch are requred for the aalyss doe ths research [7][8][26].Moreover, R provdes pacages le rmogodb ad Rhve for coectg to ad retrevg data from Mogo DB ad Hve respectvely [27][28] Exteral API s As show Fgure [5], the Isghts system uses two exteral API. The frst API s the GeoLte Cty database of MaxMd whch eables us to determe the coutry, rego, cty, lattude ad logtude assocated wth IP addresses worldwde [3]. The secod API s the Local Weather API of World Weather Ole whch returs the curret weather for a gve lattude/logtude coordates [32]. Sce the Isghts System processes the mpressos data of the prevous day, the weather data has to be stored beforehad. Therefore a lst of lattude/logtude coordates s mataed alog wth the frequecy of mpressos from each coordate whch s updated whe the Isghts System processes mpressos. Therefore the weather formato s obtaed daly from the Local Weather API based o the geographcal coordates sorted by frequecy of mpressos descedg order Iterfaces A set of terfaces have bee setup for the Isghts System, whch allows for ease of access, motorg ad aalyss of data. The frst terface s a ob scheduler called Ctre, whch allows us to create, schedule ad motor all data processg tass Secto Please refer to [34] for detals. Smlarly, we have setup a terface for MogoDB called RocMogo whch allows us to access, update ad motor the state of data the Mogo cluster. Please refer to [35] for detals. Rstudo IDE servers as a users terface for R o the Isghts platform, thereby allowg data aalyss to be doe easly. Please refer to [36] for detals. Moreover, the status of map-reduce obs rug o the Tas Tracers ca be motored through [37] ad the health of the Name Node ad therefore the Hadoop Cluster ca be motored through [38].

19 Dmesoalty Reducto & Model Selecto for Clc Predcto 9 3. Dmesoalty Reducto & Model Selecto 3. The Data Mg Process Fgure 4: The DataMg Process A stadard Data Mg process cossts of steps as show Fgure [4].We brefly descrbe each step ths secto Data Collecto/Storage : The data collecto/storage step cludes extracto ad storage of bd requests, mpressos ad evets data as descrbed secto [2.2.] Preprocessg: The preprocessg step cludes computato of varous attrbutes as descrbed secto [2.2.2]. Dmesoalty Reducto: I the dmesoalty reducto stage, we detfy rrelevat ad redudat features that ca be excluded from the model. The varous feature selecto methods have bee explaed secto [3.2] Predcto: Durg the predcto stage of the data mg process, we tra dfferet learg models (Naïve Bayes, Logstc Regresso, Support Vector Maches) o the trag set ad the use the model to predct outcomes for a valdato set. Trag Set ad Valdato Set have bee descrbed secto [3.2]. The varous learg models have bee explaed secto [3.5],[3.6],[3.7] Model Selecto & Aalyss : I the model selecto & aalyss stage, we evaluate the accuracy of each model by ROC Aalyss ad compare the performace of models agast each other usg methods descrbed secto [3.0],[3.]

20 Dmesoalty Reducto & Model Selecto for Clc Predcto Notatos & Deftos We defe a vewer hstory, V V : V = V, V = φ, Z + ad + Z as the umber of mpressos show to vewer V. The set of attrbutes for vewer V s deoted by A, =,..., J, A A. Let the class label C {0,} deote o-clc/clc.therefore, each mpresso for V s deoted by x : =,...,, = I where x = ( x, x 2,..., xj ), =,..., I; =,..., J s a vector of dmeso J ad represets the attrbute values for a mpresso ad s therefore called a stace of the data set. Therefore, x s the attrbute value of attrbute A for mpresso. Moreover c deotes the oucome for mpresso. Let X I J be a I J matrx deotg the dataset where = I s the umber of mpressos ad J s the umber of attrbutes the dataset. Let ℵ = X C deote the space of labeled staces. Therefore the vector ℵ = ( x, x 2,..., xj, c ), =,..., I, x. A represets the attrbute values for mpresso ad ts outcome (clc/o-clc). ℵ ℵ deotes the trag set whle ℵ \ ℵ represets the valdato set. A classfer C : x c maps a mpresso to ts outcome. Trag the classfer C o the trag set ℵ mples detfyg a scorg fucto Whle Cross Valdato mples hypotheszg the class c such f : X C R () C ( x ) = argmax f ( x, c ) x X \ X, c C (2) c ad the comparg the predcted class of x wth the actual class. I a bary classfcato problem, the followg scearos ca occur Table []: Cofuso Matrx for a Bary Classfer Actual Class = Actual Class = 0 Predcted Class = True Postves False Postves Predcted Class =0 False Negatves True Negatves Whe the umber of o-clcs (egatve staces) >> umber of clcs (postve staces).e Clc Through Rate CTR = ( ) = 0 I c, the data set suffers from Class Sew. I such cases, the I classfer C does ot fd suffcet umber of postve examples to tra o ad as a result the scorg fucto f s accurate. Therefore the cross valdato step results large umber of false egatves whle the umber true postves are close to zero. O the other had, -order to esure that the classfer fds suffcet umber of postve staces to tra o wthout dsturbg the class proportos (CTR), the umber of mpressos, I the trag set ad the valdato set have to be large whch tur creases the cost of trag a classfer. Moreover, f the umber of attrbutes J s large, the classfer further suffers from the Curse of Dmesoalty.

21 Dmesoalty Reducto & Model Selecto for Clc Predcto Clc Predcto Process The clc predcto process volves creatg K dstct radom samples from the data set, splttg the mpressos ordered by date-tme each of these K samples to a trag set ad test set such that both the trag set ad the test set have equal umber of clcs. We the tra the classfer o each of the K trag sets ad cross valdate o the K valdato sets. Fally, we coduct statstcal sgfcace tests o the result of the K cross valdato steps to hypothesze o the performace of the classfer [44]. A formal represetato of the algorthm s gve below. Step : Dvde the space of labeled staces ℵ = X C to K dstct subsets ℵ = X C, ℵ 2 = X 2 C2,..., ℵ = X C such that ℵ s a I ( J + ) matrx. Step 2: Sort ℵ K Step 3: Fd l such that by DATE_TIME I l = : c = ℵ ℵ ℵ = 2 c K ℵ = ℵ ad = K ℵ = φ where each = Step 4: Therefore ℵ whch s a l ( J + ) matrx forms the trag set whle ℵ \ ℵ whch s a ( I l ) ( J + ) matrx forms the valdato set. Step 5: Tra classfer C o ℵ ad obta scorg fucto f as show equato [] Step 6: Cross-valdate the classfer C o ℵ \ ℵ ad perform ROC aalyss. Step 7: Costruct cofdece tervals o the performace of the classfer C For classfers le Naïve Bayes where the trag costs are low ad frequecy of clcs the trag set s a parameter of the model we tra the model o ℵ ad cross valdate o ℵ \ ℵ, K.However, for models that cur a hgh cost of trag le Logstc Regresso ad Support Vector Maches, we select a trag set from oe of the K dstct trag sets ad add mpressos that resulted clcs from the rest of the K trag sets to the selected trag set. Ths results the classfer fdg suffcet umber of postve staces the trag set order to detfy a correct scorg fucto. A formal represetato of the algorthm s gve ext. Step : Choose a ℵ ad add ℵ : c =, m m K m Step 2: Tra classfer C o the trag set created step Step 3: Cross-valdate the classfer C o ℵ \ ℵ ad perform ROC aalyss Step 4: Costruct cofdece tervals o the performace of the classfer C

22 Dmesoalty Reducto & Model Selecto for Clc Predcto DRMS Flow Dagram 3. 5 Naïve Bayes Algorthm The Naïve Bayes algorthm estmates the probablty that a mpresso wll result a clc, by assumg depedece of the attrbutes codtoed o the outcome (clc/o-clc). clc). Therefore the codtoal probablty ty of the attrbutes alog wth the pror s estmated from the trag set, whch s the used to estmate the posteror (clc probablty) for a mpresso the valdato set The Model Adscece ce uses a Naïve Bayes algorthm to compute the clc probablty of a mpresso. Naïve Bayes ca be defed as J p ( c = x ) = p ( c = ) p ( x c = ) Z = J Z = p ( c = 0) p ( x c = 0) + p ( c = ) p ( x c = ) = = J (3) (4)

23 Dmesoalty Reducto & Model Selecto for Clc Predcto 23 Where p( c = x ) : x X \ X s the lelhood of a clc, gve a mpresso havg attrbutes x = ( x x,..., x ), =,..., I; =,..., J [2]. The pror p( c = ), whch s estmated from the, 2 J trag setℵ, s defed as [5], #Clcs p( c = ) = ectr = ; p( c = 0) = p( c = ) (5) # Impressos Furthermore, the Wea Law of Large Numbers mples I c lm P = = I E > ε = 0 ε > 0, c {0,} I I = = I c (6) Whch suggests that as the umber of mpressos the trag set ℵ, gets large, the CTR.e p( c ) coverges probablty to the expected clc through rate.e ectr.moreover, the lelhood, p( x c ) s defed as the codtoal probablty that mpresso wll have attrbute value x gve that t s a clc or a o-clc Estmato of Lelhood The lelhood, as defed secto [3.5.] ca be estmated two ways. The frst method as demostrated equato [7] computes the proporto of occurrece of a attrbute value for a gve evet the trag set. I p x c x A c x c = ( ) =, {0,} I = c (7) The secod method, as show equato [8] s the erel desty estmato of attrbute A where h > 0 s a smoothg parameter called badwdth ad K(.) s the erel, a symmetrc but ot ecessarly postve fucto that tegrates to oe. The erel fucto K (.) determes the shape whle the badwdth h determes the wdth of ˆf [44]. I ˆ x x fh ( A ) = K : x ℵ, A A Ih h (8)

24 Dmesoalty Reducto & Model Selecto for Clc Predcto Estmato of Clc Probablty for New Attrbute Values The computato of the posteror probablty p( c = x ) x X \ X requres the computato of p( x c ) x X. A Naïve Bayes Classfer whe traed o the trag set ℵ results J matrces each of dmesos C x where each elemet represets the probablty p( x c ). Therefore, cases where there ca be may possble values for a attrbute A, there ca be cases where x X \ X but x X ad as a result, the classfer does ot retur the correspodg p( x c ) x X.Thus, for the computato of the posteror probablty, the followg algorthm has bee proposed Step : x X \ X, chec whether x C x. If TRUE, the Goto Step 2 else Goto Step 3 Step 2 : Retur p( x c ) ; STOP Step 3 : Retur x ; STOP 3. 6 Logstc Regresso The Logstc Regresso model estmates the clc probablty of a mpresso by frst fdg a separatg hyper plae betwee the clcs ad o-clcs ad the by mappg the hyper plae to the terval (0, ) by a logt fucto The Model The Logstc Regresso model s a bary classfer C that detfes a scorg fucto f by Formulatg a lear model from the trag set, ℵ ad Map the lear model to the terval (0,) by a logt fucto whch tur becomes the probablty of a postve outcome (clc). Let us deote by p = P( c = ), the probablty that a mpresso results a clc out of I mpressos. Therefore we defe a Beroull radom varable M f c = =. Therefore, the radom varable M 0 o.w deotes the umber of clcs I mpressos ad has a bomal dstrbuto as show equato [9]. I M = M ~ B( I, p) (9) =

25 Dmesoalty Reducto & Model Selecto for Clc Predcto 25 Now, we formulate the lear model as show equato [0] where X s a I ( J + ) matrx of I mpressos, J attrbutes ad a vector of s. Moreover, the weghts β = ( β0, β2,..., β J ) s a ( J + ) vector where β0 represets the tercept. The model results a I vectorη. + We map η R to p (0,) by the logt fucto gve equato [] η = X β (0) p η = g( p) = log ( p ) () Therefore, the computatoal problem s estmato of the weghts ˆ β from the trag setℵ whch s obtaed by maxmzg the log-lelhood of the caocal dstrbuto fucto of M wth respect to β.ths s doe by the Fsher Scorg Method. A alterate method for obtag ˆ β s by the Batch Gradet Descet Method though; the Fsher scorg method s faster. We do ot elaborate o these methods as they are bult to most data mg pacages [9] [20] Estmato of Clc Probablty The clc probablty for a mpresso x s estmated from equato [2] where x X \ X ad ˆ β s estmated usg the methods descrbed secto [3.6.2.]. p = p( c ; ˆ = x β ) = T (2) x ˆ + e β 3. 7 Support Vector Mache A support vector mache s a bary classfer that bulds a model by fdg a separatg hyper plae wth maxmal margs, that separates clcs from o clcs ad the uses the model to classfy whether a mpresso wll result a clc or ot The Model Support Vector Mache s a bary classfer C that detfes a scorg fucto f by Mappg a stace x ℵ to a hgher (maybe fte) dmesoal space [50] Fd a separatg hyper plae wth the maxmal marg the hgher dmesoal space [50] Ths follows from redefg the classfcato problem as the problem of fdg a hyper plae (as show equato [3]) wth sutable weghts, w such that the hyper plae ca separate staces of the two classes [44]. Therefore a stace x ca be mapped to a target class depedg o whch sde of the hyper plae t les o.

26 Dmesoalty Reducto & Model Selecto for Clc Predcto 26 T ( w x + w ) 0 for c = 0 T ( w x + w ) 0 for c = 0 0 (3) Separablty of stace to two classes results from the followg theorem Theorem 3. (Cover s Theorem): A complex patter-classfcato problem, cast a hghdmesoal space olearly, s more lely to be learly separable tha a low dmesoal space, provded that the space s ot desely populated. Support Vector Maches, further mprovse o ths cocept by fdg a hyper plae that has the largest dstace to the earby trag stace of ay class (the marg), sce larger the marg, the better the classfer wll geeralze to usee data. Therefore, we rewrte equato [3] as follows to corporate margs to the classfcato problem T ( w x + w ) + for c = 0 T ( w x + w ) for c = 0 0 (4) Therefore, f we defe y f c = 0 =, equato [4] ca be re-wrtte as f c = T y ( w x + w ) + (5) 0 The dstace from a separatg hyper plae s defed as T t w x + w0 w (6) T y ( w x + w0 ) Usg equato [4], equato [6] ca be rewrtte as. However, sce the clc w predcto problem suffers from class-sew we defe a soft-marg hyper plae by re-wrtg equato (5) as y ( w x + w ) ξ, ξ > 0 (7) T 0 Where ξ > 0 s a slac varable whch stores the devato from the marg. A devato from the marg ca be of two types (a) x les o the wrog sde of the hyper plae ad s therefore msclassfed or (b) x les o the rght sde of the hyper plae but wth the marg. If trag a SVM oℵ, results

27 Dmesoalty Reducto & Model Selecto for Clc Predcto 27 ξ = 0, the x has bee correctly classfed, f0 < ξ <, x has bee msclassfed whle ξ mples that x has bee msclassfed. Moreover, we troduce a pealty parameter, C > 0 whch pealzes x : ξ > 0 thereby leadg to lower geeralzato error. The staces x : ξ > 0 are called support vectors ad the umber of support vectors s a upper boud estmate for the expected umber of errors [44]. Fally the stace x s mapped to a hgher dmesoal space by a fuctoφ. Equato [8] represets the optmzato problem solved by support vector maches where w s the ormal vector to the hyper plae. A umber of popular methods le Sequetal Mmzato Optmzato exst for solvg equato [8]. We wll however ot elaborate o these methods sce they are readly avalable wth the LIBSVM lbrary ad other data mg pacages. The tme complexty of a SVM s m w, b, ξ 2 I T w w + C ξ = 3 O( I ) [50]. T subect to y ( w φ( x ) + b) ξ ; ξ 0 (8) RBF Kerel T Equato [8], whe represeted ts dual form, requres the computato of φ( x ) φ ( x ). However, the T er product φ( x ) φ ( x ) ca be replaced by a erel fucto K ( x, x ). Therefore, stead of havg to map two staces x ad x to a hgher dmesoal space ad the havg to do a dot product, the T erel fucto ca be drectly appled the orgal space. The substtuto of φ( x ) φ ( x ) wth K ( x, x ) s called erel trc. The symmetrc ad postve sem defte I I matrx of erel values s called a Gram matrx [44]. The radal bass fucto (RBF) erel, gve equato [9] s oe of the most wdely used erels that o-learly maps staces to a hgher dmesoal space, so that, ule the lear erel, t ca hadle the case whe the relato betwee class labels ad attrbutes s o lear. A mportat property of the RBF erel s 0 < K [50]. 2 ( γ ) K ( x, x ) = exp x x, γ > 0 (9) Parameter Estmato Support Vector Maches requre the estmato of the pealty parameter C ad the erel parameter γ. Sce these parameters are ot ow beforehad, the parameters eed to be estmated from the trag set,ℵ. The goal s to detfy ( C, γ ) such that the classfer ca predct class labels from the valdato set ℵ \ ℵ wth mmum msclassfcato errors. A umber of techques have bee suggested lterature to

28 Dmesoalty Reducto & Model Selecto for Clc Predcto 28 estmate the parameters. Oe such techque s combg a grd-search method wth cross valdato whch ca be easly parallelzed [50]. However, ths techque s oly feasble whe the umber of staces the data set s order of magtude of thousads [50]. For very large data sets, we have to rely o heurstc techques. Oe such method s to cosder expoetally growg sequeces of C adγ , for example C = 2,2,...,2 ; γ = 2, 2,...,2 [50]. Therefore, the heurstc method detfes a startg posto ad the detfes the drecto of the ext sequece by cross valdatg wth values o ether sde of the chose parameter values. Ths eables us to specfy a lower boud for the parameter. Sce the clc predcto problem suffers from class sew, we eed to have a hgh pealty for msclassfcato. Therefore the lower boud o C ca be gve asc >. It s mportat to ote that for very large value ofc, the trag tme wll also go up accordgly. Therefore, startg wthc = 2, C ca be creased expoetally tll the msclassfcato errors o the valdato set ℵ \ ℵ have falle to a desred level. The startg value of γ s geerally chose to be J where J s umber of features the trag set [8]. 3.8 Dmesoalty Reducto for Logstc Regresso ad SVMs Logstc Regresso ad SVMs requre each mpresso, x to be represeted as a vector of real umbers. Attrbutes that are defed o the omal scale, have to be coverted to umerc data. Ths requres ecodg the attrbute values to a x bary vector, where for a mpresso the attrbute s represeted a vector Y of dmeso x where f x x Y =, x 0 o. w. (20) The umber of attrbute values for attrbutes le DOMAIN, IP, COUNTRY ad BANNER ca be very large ad as a result the bary vector Y would be large as well. Sce trag a Logstc Regresso ad SVMs requre estmato of weghts for each of the J features, creasg the sze of J by troducg largey requres the models to estmate a larger weght vector. As a result, the umber of teratos requred by the estmato method creases as well, thereby creasg the computatoal complexty of the estmato method. Therefore, to prevet the model from sufferg from curse of dmesoalty, such attrbutes eed to be grouped to categores. However, a umber of attrbutes le IP ad DOMAIN caot be easly categorzed. We preset the followg algorthm such cases whch uses Naïve Bayes posteror probabltes to group attrbute values. Step : Compute the prors ad the lelhood for Step 2: Estmate the clc probablty for attrbute ' ' as ℵ usg equatos [5][7]ad[8]

29 Dmesoalty Reducto & Model Selecto for Clc Predcto 29 p( cl = al ) = p( cl = ) p( al cl = ) al x, l x, x ℵ (2) Z Step 3: Dvde p( c l ) to ' m ' groups ad map x ℵ to each of these groups It s mportat to ote that the choce of m wll determe how well the target algorthm (logstc regresso or support vector mache) performs. Therefore, search heurstc methods combed wth cross valdato as outled secto [3.7.3] ca be used to estmate m. 3.9 Trasformato of Features For attrbutes defed o the ordal or terval scale, large umerc rages cause umercal dffcultes the algorthms used to estmate the weghts logstc regresso ad support vector maches. Moreover attrbutes wth large umerc rages ted to domate attrbutes smaller umerc rages ad therefore would lead to correct feature selecto. It s mportat to ote that both trag ad testg data have to be scaled usg the same method [50]. A umber of pacages for SVM scale the umerc attrbute values to zero mea ad ut varace. However, for the logstc regresso model, the put data has to be scaled ad the gve to the algorthm to tra or predct. A umber of trasformatos le the Box-Cox Trasform ad the Iverse Hyperbolc Se Trasform have bee proposed. We preset a smplfed verso of the Box-Cox trasform equato [22] that scales umerc attrbutes for whch the bouds are uow, to the terval[0,]. g( x ) = log( x + ) (22).. O the other had, o-egatve attrbutes for whch the upper bouds are ow, we scale the data to the terval [0,] by computg the rato of the attrbute value to the upper boud. 3.0 Recever Operatg Characterstcs (ROC) Aalyss A recever operatg characterstcs graph (ROC) s a techque for vsualzg, orgazg ad selectg classfers based o ther performace [3]. I a ROC graph, the true postve rate True Postves tp rate Total Postves s plotted agast the false postve rate False Postves fp rate Total Negatves.We ca oly compute the true postve rate ad the false postve rate case of dscrete classfers for whch the output s a class label. As defed secto [3.2], the cross-valdato step results a class c {0,} such that C ( x ) = argmax f ( x, c ) x X \ X, c C where the scorg fucto f : X C R s the c c = probablty that a mpresso wll result a clc. However, as the CTR, 0, the estmated clc I probablty f ( x, c = ) s small whle the probablty of ot clcg a mpresso, f ( x, c = 0) = f ( x, c = ) s large. Ths results the classfer, msclassfyg clcs as o-clcs I

30 Dmesoalty Reducto & Model Selecto for Clc Predcto 30 ad as a result f >> tp.therefore, stead of classfyg a mpresso the valdato set as clc/o-clc, we ra the mpressos by ther estmated clc probablty.e R : f R > R f ( x, c = ) > f ( x, c = ).Ths mples that the mpressos wth the hghest ra have the hghest probablty ( probablty ( 0 ) of beg clced, whle the mpresso wth the lowest ra have the lowest ) of beg clced. Let c C \ C be the actual clc for mpresso' ' '. I case of perfect estmato of clc probabltes, we ca partto R to R R R R R R R ' ' < ad \ ad happes whe the model over fts the data. The cross valdato step wll therefore result cases there ' R c : = (false egatve) ad c c R R R ' ' R ad R \ R such that ' ' = 0, =, \. However, such a scearo oly ' \ : = 0 (false postve).it s therefore mportat R R c + Z such that It s therefore mportat to ote that umber of false postves wll deped o the parttog of R as decreasg the umber of false postves wll crease the umber of false egatves ad vce versa. Ths leads to the optmzato problem of fdg the optmal partto of R such that msclassfcato costs are mmzed wth dfferet costs assged to msclassfcato of clc as o-clc ad msclassfcato of o-clc as clc. However, ths s out of scope of ths research ad s a topc that should be vestgated future wor. To costruct the ROC, we tae the rug sum of the actual clcs ordered by the ra of the mpressos.e = c, =,.., I : R R + (23) Fgure 5: ROC Graphs.. The dotted les represet the estmated clc probabltes sorted ascedg order. The dashed les represet the actual outcomes sorted by the clc probabltes. Fgure [5] shows three dfferet schematcs of ROC where the dotted le represets the estmated clc probabltes for each mpresso whle the dashed le represets the rug sum of the clcs ordered by the ra of each mpresso. I a perfect predcto or cases of over fttg, the dashed le wll super mpose the dotted le. I (a), the ROC shows that there exsts some false egatves where a clc probablty, f ( x, c = ) 0 was estmated eve for a clc. However there exst a substatal proporto of clcs whch have bee predcted correctly. Ths s represeted by the part of the curve where the dotted le ad the dashed le overlap. O the other had (b) the classfer performs worse tha (a) as

31 Dmesoalty Reducto & Model Selecto for Clc Predcto 3 the false egatves are large.e the scorg fucto has estmated a clc probablty, f ( x, c = ) 0 for all clcs. I (c), the classfer performs eve worse where a eve lower clc probablty s estmated for each clc as s evdet from the steep rse clc cout for mpresso wth low ra. We ow tur our atteto to the Area uder a ROC Curve (AUC) whch s defed as probablty that a classfer wll ra a radomly chose postve stace hgher tha a radomly chose egatve stace ad ca be used as a measure of classfer performace [3]. From Fgure [5], the total area of the ROC space ca be defed as the area of the rectagle, where the legth s the umber of mpressos ad breadth the total umber of clcs the valdato set. The ROC curve occupes a certa proporto of ths area ad therefore the area occuped by the ROC curve ca be expressed as a percetage of the total area. Now, as descrbed the propertes of the ROC curve, lesser the proporto of area occuped by the ROC curve, lower s the umber of msclassfcato by the algorthm ad as a result better the performace. Therefore, the proporto of space occuped by the ROC curve Fgure [5.a] s lesser tha the proporto of space occuped Fgure [5.b] ad Fgure [5.c] ad therefore represets better performace. However, stadard formulatos of ROC curves ad AUC suggest that larger the AUC, better s the performace of the model. Therefore, order to stadardze our aalyss, we tae the complemet of the proporto of space occuped by the ROC curve. Formally, The AUC s estmated as I c 0 =,,.., : I AUC = = I R R I c = + (24) The defte tegral I c s approxmated usg the trapezodal rule [42]. Moreover, as descrbed 0 = secto [3.3], whe there are K dstct folds o whch we tra ad valdate the model, we wll have K ROC curves. 3. Estmato of Model Performace For trag set ℵ ad valdato set ℵ \ ℵ, let us deote by p, the probablty that the classfer classfes a clc correctly C ( x ) = x X \ X, c =. Therefore we defe a Beroull radom varable f ( x ) x X \ X, c M C = = =. We defe N = c as the total umber of clcs the 0 o.w valdato set. Therefore, = N M = M ~ B( N, p) ad we test whether p s less tha or equal to a predefed probablty p 0.We test the followg hypothess

32 Dmesoalty Reducto & Model Selecto for Clc Predcto 32 H : p p vs H : p > p (25) N N m N m Thus the bomal test reects the ull hypothess f P( M ) = p0 ( p0) < α where α m= m s the sgfcace level. Sce, M s the sum of depedet radom varables from the same dstrbuto, by the Cetral Lmt Theorem, for large N, M N s approxmately ormal wth mea p0 ad varace p 0 ( p 0 ). The M p 0 N p ( p ) 0 0 ~ N(0,). Furthermore, f we tra C o X ad valdate o X \ X, K tmes the we obta the probablty of correctly classfyg clcs from each of the K valdato sets.e p, =,..., K where p = N = N M K p ( p p) = 2 = ad p = ; S = K K K 2. Therefore, K p S ( p0) ~ t (26) We reect the alterate hypothess that the classfcato algorthm ca predct clcs wth probablty more tha p 0 whe [26] s less tha t α, [44]. K Smlarly, we ca compare the performace of two classfers, valdated o K valdato sets, usg a pared t-test [44]. We defe K p = p p as the dfferece the proporto of clcs classfed by two 2 K 2 p ( p p) = 2 = classfers. Therefore p = ; S = ad we test the ull hypothess H 0 K K : µ = 0 that the dstrbuto of the dfferece of proporto of correct classfcatos across two classfers has mea 0. We reect the ull hypothess f K p S les outsde the terval ( t /2,, t /2, ). 3.2 Feature Selecto The feature selecto problem s defed as the process of selectg a subset A A of orgal features accordg to certa crtera [7] [9]. Therefore, the feature selecto process results a ℵ = I ( K + ) : K J where J s the umber of features the orgal feature space. O the other had feature extracto s defed as the process of costructg a ew set of K dmesoal features space out of the orgal J features. Some of the wdely used feature extracto techques are Prcpal Compoet Aalyss (PCA) ad Lear Dscrmat Aalyss (LDA) [44].However feature extracto s out of scope for ths research ad we oly focus o the problem of feature selecto. α K α K

6.7 Network analysis. 6.7.1 Introduction. References - Network analysis. Topological analysis

6.7 Network analysis. 6.7.1 Introduction. References - Network analysis. Topological analysis 6.7 Network aalyss Le data that explctly store topologcal formato are called etwork data. Besdes spatal operatos, several methods of spatal aalyss are applcable to etwork data. Fgure: Network data Refereces

More information

Statistical Pattern Recognition (CE-725) Department of Computer Engineering Sharif University of Technology

Statistical Pattern Recognition (CE-725) Department of Computer Engineering Sharif University of Technology I The Name of God, The Compassoate, The ercful Name: Problems' eys Studet ID#:. Statstcal Patter Recogto (CE-725) Departmet of Computer Egeerg Sharf Uversty of Techology Fal Exam Soluto - Sprg 202 (50

More information

APPENDIX III THE ENVELOPE PROPERTY

APPENDIX III THE ENVELOPE PROPERTY Apped III APPENDIX III THE ENVELOPE PROPERTY Optmzato mposes a very strog structure o the problem cosdered Ths s the reaso why eoclasscal ecoomcs whch assumes optmzg behavour has bee the most successful

More information

IDENTIFICATION OF THE DYNAMICS OF THE GOOGLE S RANKING ALGORITHM. A. Khaki Sedigh, Mehdi Roudaki

IDENTIFICATION OF THE DYNAMICS OF THE GOOGLE S RANKING ALGORITHM. A. Khaki Sedigh, Mehdi Roudaki IDENIFICAION OF HE DYNAMICS OF HE GOOGLE S RANKING ALGORIHM A. Khak Sedgh, Mehd Roudak Cotrol Dvso, Departmet of Electrcal Egeerg, K.N.oos Uversty of echology P. O. Box: 16315-1355, ehra, Ira sedgh@eetd.ktu.ac.r,

More information

Simple Linear Regression

Simple Linear Regression Smple Lear Regresso Regresso equato a equato that descrbes the average relatoshp betwee a respose (depedet) ad a eplaator (depedet) varable. 6 8 Slope-tercept equato for a le m b (,6) slope. (,) 6 6 8

More information

Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering

Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering Moder Appled Scece October, 2009 Applcatos of Support Vector Mache Based o Boolea Kerel to Spam Flterg Shugag Lu & Keb Cu School of Computer scece ad techology, North Cha Electrc Power Uversty Hebe 071003,

More information

Average Price Ratios

Average Price Ratios Average Prce Ratos Morgstar Methodology Paper August 3, 2005 2005 Morgstar, Ic. All rghts reserved. The formato ths documet s the property of Morgstar, Ic. Reproducto or trascrpto by ay meas, whole or

More information

ANOVA Notes Page 1. Analysis of Variance for a One-Way Classification of Data

ANOVA Notes Page 1. Analysis of Variance for a One-Way Classification of Data ANOVA Notes Page Aalss of Varace for a Oe-Wa Classfcato of Data Cosder a sgle factor or treatmet doe at levels (e, there are,, 3, dfferet varatos o the prescrbed treatmet) Wth a gve treatmet level there

More information

Preprocess a planar map S. Given a query point p, report the face of S containing p. Goal: O(n)-size data structure that enables O(log n) query time.

Preprocess a planar map S. Given a query point p, report the face of S containing p. Goal: O(n)-size data structure that enables O(log n) query time. Computatoal Geometry Chapter 6 Pot Locato 1 Problem Defto Preprocess a plaar map S. Gve a query pot p, report the face of S cotag p. S Goal: O()-sze data structure that eables O(log ) query tme. C p E

More information

Numerical Methods with MS Excel

Numerical Methods with MS Excel TMME, vol4, o.1, p.84 Numercal Methods wth MS Excel M. El-Gebely & B. Yushau 1 Departmet of Mathematcal Sceces Kg Fahd Uversty of Petroleum & Merals. Dhahra, Saud Araba. Abstract: I ths ote we show how

More information

Online Appendix: Measured Aggregate Gains from International Trade

Online Appendix: Measured Aggregate Gains from International Trade Ole Appedx: Measured Aggregate Gas from Iteratoal Trade Arel Burste UCLA ad NBER Javer Cravo Uversty of Mchga March 3, 2014 I ths ole appedx we derve addtoal results dscussed the paper. I the frst secto,

More information

Models for Selecting an ERP System with Intuitionistic Trapezoidal Fuzzy Information

Models for Selecting an ERP System with Intuitionistic Trapezoidal Fuzzy Information JOURNAL OF SOFWARE, VOL 5, NO 3, MARCH 00 75 Models for Selectg a ERP System wth Itutostc rapezodal Fuzzy Iformato Guwu We, Ru L Departmet of Ecoomcs ad Maagemet, Chogqg Uversty of Arts ad Sceces, Yogchua,

More information

An IG-RS-SVM classifier for analyzing reviews of E-commerce product

An IG-RS-SVM classifier for analyzing reviews of E-commerce product Iteratoal Coferece o Iformato Techology ad Maagemet Iovato (ICITMI 205) A IG-RS-SVM classfer for aalyzg revews of E-commerce product Jaju Ye a, Hua Re b ad Hagxa Zhou c * College of Iformato Egeerg, Cha

More information

Chapter Eight. f : R R

Chapter Eight. f : R R Chapter Eght f : R R 8. Itroducto We shall ow tur our atteto to the very mportat specal case of fuctos that are real, or scalar, valued. These are sometmes called scalar felds. I the very, but mportat,

More information

The Analysis of Development of Insurance Contract Premiums of General Liability Insurance in the Business Insurance Risk

The Analysis of Development of Insurance Contract Premiums of General Liability Insurance in the Business Insurance Risk The Aalyss of Developmet of Isurace Cotract Premums of Geeral Lablty Isurace the Busess Isurace Rsk the Frame of the Czech Isurace Market 1998 011 Scetfc Coferece Jue, 10. - 14. 013 Pavla Kubová Departmet

More information

Report 52 Fixed Maturity EUR Industrial Bond Funds

Report 52 Fixed Maturity EUR Industrial Bond Funds Rep52, Computed & Prted: 17/06/2015 11:53 Report 52 Fxed Maturty EUR Idustral Bod Fuds From Dec 2008 to Dec 2014 31/12/2008 31 December 1999 31/12/2014 Bechmark Noe Defto of the frm ad geeral formato:

More information

An Approach to Evaluating the Computer Network Security with Hesitant Fuzzy Information

An Approach to Evaluating the Computer Network Security with Hesitant Fuzzy Information A Approach to Evaluatg the Computer Network Securty wth Hestat Fuzzy Iformato Jafeg Dog A Approach to Evaluatg the Computer Network Securty wth Hestat Fuzzy Iformato Jafeg Dog, Frst ad Correspodg Author

More information

CIS603 - Artificial Intelligence. Logistic regression. (some material adopted from notes by M. Hauskrecht) CIS603 - AI. Supervised learning

CIS603 - Artificial Intelligence. Logistic regression. (some material adopted from notes by M. Hauskrecht) CIS603 - AI. Supervised learning CIS63 - Artfcal Itellgece Logstc regresso Vasleos Megalookoomou some materal adopted from otes b M. Hauskrecht Supervsed learg Data: D { d d.. d} a set of eamples d < > s put vector ad s desred output

More information

Optimal multi-degree reduction of Bézier curves with constraints of endpoints continuity

Optimal multi-degree reduction of Bézier curves with constraints of endpoints continuity Computer Aded Geometrc Desg 19 (2002 365 377 wwwelsevercom/locate/comad Optmal mult-degree reducto of Bézer curves wth costrats of edpots cotuty Guo-Dog Che, Guo-J Wag State Key Laboratory of CAD&CG, Isttute

More information

Settlement Prediction by Spatial-temporal Random Process

Settlement Prediction by Spatial-temporal Random Process Safety, Relablty ad Rs of Structures, Ifrastructures ad Egeerg Systems Furuta, Fragopol & Shozua (eds Taylor & Fracs Group, Lodo, ISBN 978---77- Settlemet Predcto by Spatal-temporal Radom Process P. Rugbaapha

More information

SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN

SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN Wojcech Zelńsk Departmet of Ecoometrcs ad Statstcs Warsaw Uversty of Lfe Sceces Nowoursyowska 66, -787 Warszawa e-mal: wojtekzelsk@statystykafo Zofa Hausz,

More information

Speeding up k-means Clustering by Bootstrap Averaging

Speeding up k-means Clustering by Bootstrap Averaging Speedg up -meas Clusterg by Bootstrap Averagg Ia Davdso ad Ashw Satyaarayaa Computer Scece Dept, SUNY Albay, NY, USA,. {davdso, ashw}@cs.albay.edu Abstract K-meas clusterg s oe of the most popular clusterg

More information

1. The Time Value of Money

1. The Time Value of Money Corporate Face [00-0345]. The Tme Value of Moey. Compoudg ad Dscoutg Captalzato (compoudg, fdg future values) s a process of movg a value forward tme. It yelds the future value gve the relevat compoudg

More information

An Effectiveness of Integrated Portfolio in Bancassurance

An Effectiveness of Integrated Portfolio in Bancassurance A Effectveess of Itegrated Portfolo Bacassurace Taea Karya Research Ceter for Facal Egeerg Isttute of Ecoomc Research Kyoto versty Sayouu Kyoto 606-850 Japa arya@eryoto-uacp Itroducto As s well ow the

More information

Fractal-Structured Karatsuba`s Algorithm for Binary Field Multiplication: FK

Fractal-Structured Karatsuba`s Algorithm for Binary Field Multiplication: FK Fractal-Structured Karatsuba`s Algorthm for Bary Feld Multplcato: FK *The authors are worg at the Isttute of Mathematcs The Academy of Sceces of DPR Korea. **Address : U Jog dstrct Kwahadog Number Pyogyag

More information

The simple linear Regression Model

The simple linear Regression Model The smple lear Regresso Model Correlato coeffcet s o-parametrc ad just dcates that two varables are assocated wth oe aother, but t does ot gve a deas of the kd of relatoshp. Regresso models help vestgatg

More information

T = 1/freq, T = 2/freq, T = i/freq, T = n (number of cash flows = freq n) are :

T = 1/freq, T = 2/freq, T = i/freq, T = n (number of cash flows = freq n) are : Bullets bods Let s descrbe frst a fxed rate bod wthout amortzg a more geeral way : Let s ote : C the aual fxed rate t s a percetage N the otoal freq ( 2 4 ) the umber of coupo per year R the redempto of

More information

Abraham Zaks. Technion I.I.T. Haifa ISRAEL. and. University of Haifa, Haifa ISRAEL. Abstract

Abraham Zaks. Technion I.I.T. Haifa ISRAEL. and. University of Haifa, Haifa ISRAEL. Abstract Preset Value of Autes Uder Radom Rates of Iterest By Abraham Zas Techo I.I.T. Hafa ISRAEL ad Uversty of Hafa, Hafa ISRAEL Abstract Some attempts were made to evaluate the future value (FV) of the expected

More information

Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS R =

Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS R = Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS Objectves of the Topc: Beg able to formalse ad solve practcal ad mathematcal problems, whch the subjects of loa amortsato ad maagemet of cumulatve fuds are

More information

The Gompertz-Makeham distribution. Fredrik Norström. Supervisor: Yuri Belyaev

The Gompertz-Makeham distribution. Fredrik Norström. Supervisor: Yuri Belyaev The Gompertz-Makeham dstrbuto by Fredrk Norström Master s thess Mathematcal Statstcs, Umeå Uversty, 997 Supervsor: Yur Belyaev Abstract Ths work s about the Gompertz-Makeham dstrbuto. The dstrbuto has

More information

Maintenance Scheduling of Distribution System with Optimal Economy and Reliability

Maintenance Scheduling of Distribution System with Optimal Economy and Reliability Egeerg, 203, 5, 4-8 http://dx.do.org/0.4236/eg.203.59b003 Publshed Ole September 203 (http://www.scrp.org/joural/eg) Mateace Schedulg of Dstrbuto System wth Optmal Ecoomy ad Relablty Syua Hog, Hafeg L,

More information

ECONOMIC CHOICE OF OPTIMUM FEEDER CABLE CONSIDERING RISK ANALYSIS. University of Brasilia (UnB) and The Brazilian Regulatory Agency (ANEEL), Brazil

ECONOMIC CHOICE OF OPTIMUM FEEDER CABLE CONSIDERING RISK ANALYSIS. University of Brasilia (UnB) and The Brazilian Regulatory Agency (ANEEL), Brazil ECONOMIC CHOICE OF OPTIMUM FEEDER CABE CONSIDERING RISK ANAYSIS I Camargo, F Fgueredo, M De Olvera Uversty of Brasla (UB) ad The Brazla Regulatory Agecy (ANEE), Brazl The choce of the approprate cable

More information

A Parallel Transmission Remote Backup System

A Parallel Transmission Remote Backup System 2012 2d Iteratoal Coferece o Idustral Techology ad Maagemet (ICITM 2012) IPCSIT vol 49 (2012) (2012) IACSIT Press, Sgapore DOI: 107763/IPCSIT2012V495 2 A Parallel Trasmsso Remote Backup System Che Yu College

More information

The Digital Signature Scheme MQQ-SIG

The Digital Signature Scheme MQQ-SIG The Dgtal Sgature Scheme MQQ-SIG Itellectual Property Statemet ad Techcal Descrpto Frst publshed: 10 October 2010, Last update: 20 December 2010 Dalo Glgorosk 1 ad Rue Stesmo Ødegård 2 ad Rue Erled Jese

More information

A Study of Unrelated Parallel-Machine Scheduling with Deteriorating Maintenance Activities to Minimize the Total Completion Time

A Study of Unrelated Parallel-Machine Scheduling with Deteriorating Maintenance Activities to Minimize the Total Completion Time Joural of Na Ka, Vol. 0, No., pp.5-9 (20) 5 A Study of Urelated Parallel-Mache Schedulg wth Deteroratg Mateace Actvtes to Mze the Total Copleto Te Suh-Jeq Yag, Ja-Yuar Guo, Hs-Tao Lee Departet of Idustral

More information

Green Master based on MapReduce Cluster

Green Master based on MapReduce Cluster Gree Master based o MapReduce Cluster Mg-Zh Wu, Yu-Chag L, We-Tsog Lee, Yu-Su L, Fog-Hao Lu Dept of Electrcal Egeerg Tamkag Uversty, Tawa, ROC Dept of Electrcal Egeerg Tamkag Uversty, Tawa, ROC Dept of

More information

A New Bayesian Network Method for Computing Bottom Event's Structural Importance Degree using Jointree

A New Bayesian Network Method for Computing Bottom Event's Structural Importance Degree using Jointree , pp.277-288 http://dx.do.org/10.14257/juesst.2015.8.1.25 A New Bayesa Network Method for Computg Bottom Evet's Structural Importace Degree usg Jotree Wag Yao ad Su Q School of Aeroautcs, Northwester Polytechcal

More information

Robust Realtime Face Recognition And Tracking System

Robust Realtime Face Recognition And Tracking System JCS& Vol. 9 No. October 9 Robust Realtme Face Recogto Ad rackg System Ka Che,Le Ju Zhao East Cha Uversty of Scece ad echology Emal:asa85@hotmal.com Abstract here s some very mportat meag the study of realtme

More information

Integrating Production Scheduling and Maintenance: Practical Implications

Integrating Production Scheduling and Maintenance: Practical Implications Proceedgs of the 2012 Iteratoal Coferece o Idustral Egeerg ad Operatos Maagemet Istabul, Turkey, uly 3 6, 2012 Itegratg Producto Schedulg ad Mateace: Practcal Implcatos Lath A. Hadd ad Umar M. Al-Turk

More information

Forecasting Trend and Stock Price with Adaptive Extended Kalman Filter Data Fusion

Forecasting Trend and Stock Price with Adaptive Extended Kalman Filter Data Fusion 2011 Iteratoal Coferece o Ecoomcs ad Face Research IPEDR vol.4 (2011 (2011 IACSIT Press, Sgapore Forecastg Tred ad Stoc Prce wth Adaptve Exteded alma Flter Data Fuso Betollah Abar Moghaddam Faculty of

More information

Credibility Premium Calculation in Motor Third-Party Liability Insurance

Credibility Premium Calculation in Motor Third-Party Liability Insurance Advaces Mathematcal ad Computatoal Methods Credblty remum Calculato Motor Thrd-arty Lablty Isurace BOHA LIA, JAA KUBAOVÁ epartmet of Mathematcs ad Quattatve Methods Uversty of ardubce Studetská 95, 53

More information

Projection model for Computer Network Security Evaluation with interval-valued intuitionistic fuzzy information. Qingxiang Li

Projection model for Computer Network Security Evaluation with interval-valued intuitionistic fuzzy information. Qingxiang Li Iteratoal Joural of Scece Vol No7 05 ISSN: 83-4890 Proecto model for Computer Network Securty Evaluato wth terval-valued tutostc fuzzy formato Qgxag L School of Software Egeerg Chogqg Uversty of rts ad

More information

CHAPTER 13. Simple Linear Regression LEARNING OBJECTIVES. USING STATISTICS @ Sunflowers Apparel

CHAPTER 13. Simple Linear Regression LEARNING OBJECTIVES. USING STATISTICS @ Sunflowers Apparel CHAPTER 3 Smple Lear Regresso USING STATISTICS @ Suflowers Apparel 3 TYPES OF REGRESSION MODELS 3 DETERMINING THE SIMPLE LINEAR REGRESSION EQUATION The Least-Squares Method Vsual Exploratos: Explorg Smple

More information

Cyber Journals: Multidisciplinary Journals in Science and Technology, Journal of Selected Areas in Telecommunications (JSAT), January Edition, 2011

Cyber Journals: Multidisciplinary Journals in Science and Technology, Journal of Selected Areas in Telecommunications (JSAT), January Edition, 2011 Cyber Jourals: Multdscplary Jourals cece ad Techology, Joural of elected Areas Telecommucatos (JAT), Jauary dto, 2011 A ovel rtual etwork Mappg Algorthm for Cost Mmzg ZHAG hu-l, QIU Xue-sog tate Key Laboratory

More information

Dynamic Two-phase Truncated Rayleigh Model for Release Date Prediction of Software

Dynamic Two-phase Truncated Rayleigh Model for Release Date Prediction of Software J. Software Egeerg & Applcatos 3 63-69 do:.436/jsea..367 Publshed Ole Jue (http://www.scrp.org/joural/jsea) Dyamc Two-phase Trucated Raylegh Model for Release Date Predcto of Software Lafe Qa Qgchua Yao

More information

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ  1 STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS Recall Assumpto E(Y x) η 0 + η x (lear codtoal mea fucto) Data (x, y ), (x 2, y 2 ),, (x, y ) Least squares estmator ˆ E (Y x) ˆ " 0 + ˆ " x, where ˆ

More information

CHAPTER 2. Time Value of Money 6-1

CHAPTER 2. Time Value of Money 6-1 CHAPTER 2 Tme Value of Moey 6- Tme Value of Moey (TVM) Tme Les Future value & Preset value Rates of retur Autes & Perpetutes Ueve cash Flow Streams Amortzato 6-2 Tme les 0 2 3 % CF 0 CF CF 2 CF 3 Show

More information

Banking (Early Repayment of Housing Loans) Order, 5762 2002 1

Banking (Early Repayment of Housing Loans) Order, 5762 2002 1 akg (Early Repaymet of Housg Loas) Order, 5762 2002 y vrtue of the power vested me uder Secto 3 of the akg Ordace 94 (hereafter, the Ordace ), followg cosultato wth the Commttee, ad wth the approval of

More information

A COMPARATIVE STUDY BETWEEN POLYCLASS AND MULTICLASS LANGUAGE MODELS

A COMPARATIVE STUDY BETWEEN POLYCLASS AND MULTICLASS LANGUAGE MODELS A COMPARATIVE STUDY BETWEEN POLYCLASS AND MULTICLASS LANGUAGE MODELS I Ztou, K Smaïl, S Delge, F Bmbot To cte ths verso: I Ztou, K Smaïl, S Delge, F Bmbot. A COMPARATIVE STUDY BETWEEN POLY- CLASS AND MULTICLASS

More information

Reinsurance and the distribution of term insurance claims

Reinsurance and the distribution of term insurance claims Resurace ad the dstrbuto of term surace clams By Rchard Bruyel FIAA, FNZSA Preseted to the NZ Socety of Actuares Coferece Queestow - November 006 1 1 Itroducto Ths paper vestgates the effect of resurace

More information

Statistical Intrusion Detector with Instance-Based Learning

Statistical Intrusion Detector with Instance-Based Learning Iformatca 5 (00) xxx yyy Statstcal Itruso Detector wth Istace-Based Learg Iva Verdo, Boja Nova Faulteta za eletroteho raualštvo Uverza v Marboru Smetaova 7, 000 Marbor, Sloveja va.verdo@sol.et eywords:

More information

Security Analysis of RAPP: An RFID Authentication Protocol based on Permutation

Security Analysis of RAPP: An RFID Authentication Protocol based on Permutation Securty Aalyss of RAPP: A RFID Authetcato Protocol based o Permutato Wag Shao-hu,,, Ha Zhje,, Lu Sujua,, Che Da-we, {College of Computer, Najg Uversty of Posts ad Telecommucatos, Najg 004, Cha Jagsu Hgh

More information

Research on Cloud Computing and Its Application in Big Data Processing of Railway Passenger Flow

Research on Cloud Computing and Its Application in Big Data Processing of Railway Passenger Flow 325 A publcato of CHEMICAL ENGINEERING TRANSACTIONS VOL. 46, 2015 Guest Edtors: Peyu Re, Yacag L, Hupg Sog Copyrght 2015, AIDIC Servz S.r.l., ISBN 978-88-95608-37-2; ISSN 2283-9216 The Itala Assocato of

More information

of the relationship between time and the value of money.

of the relationship between time and the value of money. TIME AND THE VALUE OF MONEY Most agrbusess maagers are famlar wth the terms compoudg, dscoutg, auty, ad captalzato. That s, most agrbusess maagers have a tutve uderstadg that each term mples some relatoshp

More information

ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN

ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN Colloquum Bometrcum 4 ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN Zofa Hausz, Joaa Tarasńska Departmet of Appled Mathematcs ad Computer Scece Uversty of Lfe Sceces Lubl Akademcka 3, -95 Lubl

More information

Proceedings of the 2010 Winter Simulation Conference B. Johansson, S. Jain, J. Montoya-Torres, J. Hugan, and E. Yücesan, eds.

Proceedings of the 2010 Winter Simulation Conference B. Johansson, S. Jain, J. Montoya-Torres, J. Hugan, and E. Yücesan, eds. Proceedgs of the 21 Wter Smulato Coferece B. Johasso, S. Ja, J. Motoya-Torres, J. Huga, ad E. Yücesa, eds. EMPIRICAL METHODS OR TWO-ECHELON INVENTORY MANAGEMENT WITH SERVICE LEVEL CONSTRAINTS BASED ON

More information

Optimal Packetization Interval for VoIP Applications Over IEEE 802.16 Networks

Optimal Packetization Interval for VoIP Applications Over IEEE 802.16 Networks Optmal Packetzato Iterval for VoIP Applcatos Over IEEE 802.16 Networks Sheha Perera Harsha Srsea Krzysztof Pawlkowsk Departmet of Electrcal & Computer Egeerg Uversty of Caterbury New Zealad sheha@elec.caterbury.ac.z

More information

AP Statistics 2006 Free-Response Questions Form B

AP Statistics 2006 Free-Response Questions Form B AP Statstcs 006 Free-Respose Questos Form B The College Board: Coectg Studets to College Success The College Board s a ot-for-proft membershp assocato whose msso s to coect studets to college success ad

More information

Three Dimensional Interpolation of Video Signals

Three Dimensional Interpolation of Video Signals Three Dmesoal Iterpolato of Vdeo Sgals Elham Shahfard March 0 th 006 Outle A Bref reve of prevous tals Dgtal Iterpolato Bascs Upsamplg D Flter Desg Issues Ifte Impulse Respose Fte Impulse Respose Desged

More information

A Bayesian Networks in Intrusion Detection Systems

A Bayesian Networks in Intrusion Detection Systems Joural of Computer Scece 3 (5: 59-65, 007 ISSN 549-3636 007 Scece Publcatos A Bayesa Networs Itruso Detecto Systems M. Mehd, S. Zar, A. Aou ad M. Besebt Electrocs Departmet, Uversty of Blda, Algera Abstract:

More information

An Evaluation of Naïve Bayesian Anti-Spam Filtering Techniques

An Evaluation of Naïve Bayesian Anti-Spam Filtering Techniques Proceedgs of the 2007 IEEE Workshop o Iformato Assurace Uted tates Mltary Academy, West Pot, Y 20-22 Jue 2007 A Evaluato of aïve Bayesa At-pam Flterg Techques Vkas P. Deshpade, Robert F. Erbacher, ad Chrs

More information

On Error Detection with Block Codes

On Error Detection with Block Codes BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 9, No 3 Sofa 2009 O Error Detecto wth Block Codes Rostza Doduekova Chalmers Uversty of Techology ad the Uversty of Gotheburg,

More information

Efficient Traceback of DoS Attacks using Small Worlds in MANET

Efficient Traceback of DoS Attacks using Small Worlds in MANET Effcet Traceback of DoS Attacks usg Small Worlds MANET Yog Km, Vshal Sakhla, Ahmed Helmy Departmet. of Electrcal Egeerg, Uversty of Souther Calfora, U.S.A {yogkm, sakhla, helmy}@ceg.usc.edu Abstract Moble

More information

Report 19 Euroland Corporate Bonds

Report 19 Euroland Corporate Bonds Rep19, Computed & Prted: 17/06/2015 11:38 Report 19 Eurolad Corporate Bods From Dec 1999 to Dec 2014 31/12/1999 31 December 1999 31/12/2014 Bechmark 100% IBOXX Euro Corp All Mats. TR Defto of the frm ad

More information

A DISTRIBUTED REPUTATION BROKER FRAMEWORK FOR WEB SERVICE APPLICATIONS

A DISTRIBUTED REPUTATION BROKER FRAMEWORK FOR WEB SERVICE APPLICATIONS L et al.: A Dstrbuted Reputato Broker Framework for Web Servce Applcatos A DISTRIBUTED REPUTATION BROKER FRAMEWORK FOR WEB SERVICE APPLICATIONS Kwe-Jay L Departmet of Electrcal Egeerg ad Computer Scece

More information

Dynamic Provisioning Modeling for Virtualized Multi-tier Applications in Cloud Data Center

Dynamic Provisioning Modeling for Virtualized Multi-tier Applications in Cloud Data Center 200 IEEE 3rd Iteratoal Coferece o Cloud Computg Dyamc Provsog Modelg for Vrtualzed Mult-ter Applcatos Cloud Data Ceter Jg B 3 Zhlag Zhu 2 Ruxog Ta 3 Qgbo Wag 3 School of Iformato Scece ad Egeerg College

More information

Web Service Composition Optimization Based on Improved Artificial Bee Colony Algorithm

Web Service Composition Optimization Based on Improved Artificial Bee Colony Algorithm JOURNAL OF NETWORKS, VOL. 8, NO. 9, SEPTEMBER 2013 2143 Web Servce Composto Optmzato Based o Improved Artfcal Bee Coloy Algorthm Ju He The key laboratory, The Academy of Equpmet, Beg, Cha Emal: heu0123@sa.com

More information

Agent-based modeling and simulation of multiproject

Agent-based modeling and simulation of multiproject Aget-based modelg ad smulato of multproject schedulg José Alberto Araúzo, Javer Pajares, Adolfo Lopez- Paredes Socal Systems Egeerg Cetre (INSISOC) Uversty of Valladold Valladold (Spa) {arauzo,pajares,adolfo}ssoc.es

More information

Bayesian Network Representation

Bayesian Network Representation Readgs: K&F 3., 3.2, 3.3, 3.4. Bayesa Network Represetato Lecture 2 Mar 30, 20 CSE 55, Statstcal Methods, Sprg 20 Istructor: Su-I Lee Uversty of Washgto, Seattle Last tme & today Last tme Probablty theory

More information

10.5 Future Value and Present Value of a General Annuity Due

10.5 Future Value and Present Value of a General Annuity Due Chapter 10 Autes 371 5. Thomas leases a car worth $4,000 at.99% compouded mothly. He agrees to make 36 lease paymets of $330 each at the begg of every moth. What s the buyout prce (resdual value of the

More information

Classic Problems at a Glance using the TVM Solver

Classic Problems at a Glance using the TVM Solver C H A P T E R 2 Classc Problems at a Glace usg the TVM Solver The table below llustrates the most commo types of classc face problems. The formulas are gve for each calculato. A bref troducto to usg the

More information

n. We know that the sum of squares of p independent standard normal variables has a chi square distribution with p degrees of freedom.

n. We know that the sum of squares of p independent standard normal variables has a chi square distribution with p degrees of freedom. UMEÅ UNIVERSITET Matematsk-statstska sttutoe Multvarat dataaalys för tekologer MSTB0 PA TENTAMEN 004-0-9 LÖSNINGSFÖRSLAG TILL TENTAMEN I MATEMATISK STATISTIK Multvarat dataaalys för tekologer B, 5 poäg.

More information

A particle swarm optimization to vehicle routing problem with fuzzy demands

A particle swarm optimization to vehicle routing problem with fuzzy demands A partcle swarm optmzato to vehcle routg problem wth fuzzy demads Yag Peg, Ye-me Qa A partcle swarm optmzato to vehcle routg problem wth fuzzy demads Yag Peg 1,Ye-me Qa 1 School of computer ad formato

More information

DIGITAL AUDIO WATERMARKING: SURVEY

DIGITAL AUDIO WATERMARKING: SURVEY DIGITAL AUDIO WATERMARKING: SURVEY MIKDAM A. T. ALSALAMI * MARWAN M. AL-AKAIDI ** * Computer Scece Dept. Zara Prvate Uversty / Jorda ** School of Egeerg ad Techology - De Motfort Uversty / UK Abstract:

More information

Using Data Mining Techniques to Predict Product Quality from Physicochemical Data

Using Data Mining Techniques to Predict Product Quality from Physicochemical Data Usg Data Mg Techques to Predct Product Qualty from Physcochemcal Data A. Nachev 1, M. Hoga 1 1 Busess Iformato Systems, Cares Busess School, NUI, Galway, Irelad Abstract - Product qualty certfcato s sometmes

More information

where p is the centroid of the neighbors of p. Consider the eigenvector problem

where p is the centroid of the neighbors of p. Consider the eigenvector problem Vrtual avgato of teror structures by ldar Yogja X a, Xaolg L a, Ye Dua a, Norbert Maerz b a Uversty of Mssour at Columba b Mssour Uversty of Scece ad Techology ABSTRACT I ths project, we propose to develop

More information

ROULETTE-TOURNAMENT SELECTION FOR SHRIMP DIET FORMULATION PROBLEM

ROULETTE-TOURNAMENT SELECTION FOR SHRIMP DIET FORMULATION PROBLEM 28-30 August, 2013 Sarawak, Malaysa. Uverst Utara Malaysa (http://www.uum.edu.my ) ROULETTE-TOURNAMENT SELECTION FOR SHRIMP DIET FORMULATION PROBLEM Rosshary Abd. Rahma 1 ad Razam Raml 2 1,2 Uverst Utara

More information

DECISION MAKING WITH THE OWA OPERATOR IN SPORT MANAGEMENT

DECISION MAKING WITH THE OWA OPERATOR IN SPORT MANAGEMENT ESTYLF08, Cuecas Meras (Meres - Lagreo), 7-9 de Septembre de 2008 DECISION MAKING WITH THE OWA OPERATOR IN SPORT MANAGEMENT José M. Mergó Aa M. Gl-Lafuete Departmet of Busess Admstrato, Uversty of Barceloa

More information

The impact of service-oriented architecture on the scheduling algorithm in cloud computing

The impact of service-oriented architecture on the scheduling algorithm in cloud computing Iteratoal Research Joural of Appled ad Basc Sceces 2015 Avalable ole at www.rjabs.com ISSN 2251-838X / Vol, 9 (3): 387-392 Scece Explorer Publcatos The mpact of servce-oreted archtecture o the schedulg

More information

The analysis of annuities relies on the formula for geometric sums: r k = rn+1 1 r 1. (2.1) k=0

The analysis of annuities relies on the formula for geometric sums: r k = rn+1 1 r 1. (2.1) k=0 Chapter 2 Autes ad loas A auty s a sequece of paymets wth fxed frequecy. The term auty orgally referred to aual paymets (hece the ame), but t s ow also used for paymets wth ay frequecy. Autes appear may

More information

Study on prediction of network security situation based on fuzzy neutral network

Study on prediction of network security situation based on fuzzy neutral network Avalable ole www.ocpr.com Joural of Chemcal ad Pharmaceutcal Research, 04, 6(6):00-06 Research Artcle ISS : 0975-7384 CODE(USA) : JCPRC5 Study o predcto of etwork securty stuato based o fuzzy eutral etwork

More information

Algorithm Optimization of Resources Scheduling Based on Cloud Computing

Algorithm Optimization of Resources Scheduling Based on Cloud Computing JOURNAL OF MULTIMEDIA, VOL. 9, NO. 7, JULY 014 977 Algorm Optmzato of Resources Schedulg Based o Cloud Computg Zhogl Lu, Hagu Zhou, Sha Fu, ad Chaoqu Lu Departmet of Iformato Maagemet, Hua Uversty of Face

More information

A Security-Oriented Task Scheduler for Heterogeneous Distributed Systems

A Security-Oriented Task Scheduler for Heterogeneous Distributed Systems A Securty-Oreted Tas Scheduler for Heterogeeous Dstrbuted Systems Tao Xe 1 ad Xao Q 2 1 Departmet of Computer Scece, Sa Dego State Uversty, Sa Dego, CA 92182, USA xe@cs.sdsu.edu 2 Departmet of Computer

More information

Fault Tree Analysis of Software Reliability Allocation

Fault Tree Analysis of Software Reliability Allocation Fault Tree Aalyss of Software Relablty Allocato Jawe XIANG, Kokch FUTATSUGI School of Iformato Scece, Japa Advaced Isttute of Scece ad Techology - Asahda, Tatsuokuch, Ishkawa, 92-292 Japa ad Yaxag HE Computer

More information

CH. V ME256 STATICS Center of Gravity, Centroid, and Moment of Inertia CENTER OF GRAVITY AND CENTROID

CH. V ME256 STATICS Center of Gravity, Centroid, and Moment of Inertia CENTER OF GRAVITY AND CENTROID CH. ME56 STTICS Ceter of Gravt, Cetrod, ad Momet of Ierta CENTE OF GITY ND CENTOID 5. CENTE OF GITY ND CENTE OF MSS FO SYSTEM OF PTICES Ceter of Gravt. The ceter of gravt G s a pot whch locates the resultat

More information

Regression Analysis. 1. Introduction

Regression Analysis. 1. Introduction . Itroducto Regresso aalyss s a statstcal methodology that utlzes the relato betwee two or more quattatve varables so that oe varable ca be predcted from the other, or others. Ths methodology s wdely used

More information

Load Balancing Control for Parallel Systems

Load Balancing Control for Parallel Systems Proc IEEE Med Symposum o New drectos Cotrol ad Automato, Chaa (Grèce),994, pp66-73 Load Balacg Cotrol for Parallel Systems Jea-Claude Heet LAAS-CNRS, 7 aveue du Coloel Roche, 3077 Toulouse, Frace E-mal

More information

Performance Attribution. Methodology Overview

Performance Attribution. Methodology Overview erformace Attrbuto Methodology Overvew Faba SUAREZ March 2004 erformace Attrbuto Methodology 1.1 Itroducto erformace Attrbuto s a set of techques that performace aalysts use to expla why a portfolo's performace

More information

Discrete-Event Simulation of Network Systems Using Distributed Object Computing

Discrete-Event Simulation of Network Systems Using Distributed Object Computing Dscrete-Evet Smulato of Network Systems Usg Dstrbuted Object Computg Welog Hu Arzoa Ceter for Itegratve M&S Computer Scece & Egeerg Dept. Fulto School of Egeerg Arzoa State Uversty, Tempe, Arzoa, 85281-8809

More information

A particle Swarm Optimization-based Framework for Agile Software Effort Estimation

A particle Swarm Optimization-based Framework for Agile Software Effort Estimation The Iteratoal Joural Of Egeerg Ad Scece (IJES) olume 3 Issue 6 Pages 30-36 204 ISSN (e): 239 83 ISSN (p): 239 805 A partcle Swarm Optmzato-based Framework for Agle Software Effort Estmato Maga I, & 2 Blamah

More information

RUSSIAN ROULETTE AND PARTICLE SPLITTING

RUSSIAN ROULETTE AND PARTICLE SPLITTING RUSSAN ROULETTE AND PARTCLE SPLTTNG M. Ragheb 3/7/203 NTRODUCTON To stuatos are ecoutered partcle trasport smulatos:. a multplyg medum, a partcle such as a eutro a cosmc ray partcle or a photo may geerate

More information

VIDEO REPLICA PLACEMENT STRATEGY FOR STORAGE CLOUD-BASED CDN

VIDEO REPLICA PLACEMENT STRATEGY FOR STORAGE CLOUD-BASED CDN Joural of Theoretcal ad Appled Iformato Techology 31 st Jauary 214. Vol. 59 No.3 25-214 JATIT & S. All rghts reserved. ISSN: 1992-8645 www.att.org E-ISSN: 1817-3195 VIDEO REPICA PACEMENT STRATEGY FOR STORAGE

More information

Optimal replacement and overhaul decisions with imperfect maintenance and warranty contracts

Optimal replacement and overhaul decisions with imperfect maintenance and warranty contracts Optmal replacemet ad overhaul decsos wth mperfect mateace ad warraty cotracts R. Pascual Departmet of Mechacal Egeerg, Uversdad de Chle, Caslla 2777, Satago, Chle Phoe: +56-2-6784591 Fax:+56-2-689657 rpascual@g.uchle.cl

More information

MDM 4U PRACTICE EXAMINATION

MDM 4U PRACTICE EXAMINATION MDM 4U RCTICE EXMINTION Ths s a ractce eam. It does ot cover all the materal ths course ad should ot be the oly revew that you do rearato for your fal eam. Your eam may cota questos that do ot aear o ths

More information

A Smart Machine Vision System for PCB Inspection

A Smart Machine Vision System for PCB Inspection A Smart Mache Vso System for PCB Ispecto Te Q Che, JaX Zhag, YouNg Zhou ad Y Lu Murphey Please address all correspodece to Departmet of Electrcal ad Computer Egeerg Uversty of Mchga - Dearbor, Dearbor,

More information

Using Phase Swapping to Solve Load Phase Balancing by ADSCHNN in LV Distribution Network

Using Phase Swapping to Solve Load Phase Balancing by ADSCHNN in LV Distribution Network Iteratoal Joural of Cotrol ad Automato Vol.7, No.7 (204), pp.-4 http://dx.do.org/0.4257/jca.204.7.7.0 Usg Phase Swappg to Solve Load Phase Balacg by ADSCHNN LV Dstrbuto Network Chu-guo Fe ad Ru Wag College

More information

How To Balance Load On A Weght-Based Metadata Server Cluster

How To Balance Load On A Weght-Based Metadata Server Cluster WLBS: A Weght-based Metadata Server Cluster Load Balacg Strategy J-L Zhag, We Qa, Xag-Hua Xu *, Ja Wa, Yu-Yu Y, Yog-Ja Re School of Computer Scece ad Techology Hagzhou Daz Uversty, Cha * Correspodg author:xhxu@hdu.edu.c

More information

Common p-belief: The General Case

Common p-belief: The General Case GAMES AND ECONOMIC BEHAVIOR 8, 738 997 ARTICLE NO. GA97053 Commo p-belef: The Geeral Case Atsush Kaj* ad Stephe Morrs Departmet of Ecoomcs, Uersty of Pesylaa Receved February, 995 We develop belef operators

More information

Software Aging Prediction based on Extreme Learning Machine

Software Aging Prediction based on Extreme Learning Machine TELKOMNIKA, Vol.11, No.11, November 2013, pp. 6547~6555 e-issn: 2087-278X 6547 Software Agg Predcto based o Extreme Learg Mache Xaozh Du 1, Hum Lu* 2, Gag Lu 2 1 School of Software Egeerg, X a Jaotog Uversty,

More information

Time Series Forecasting by Using Hybrid. Models for Monthly Streamflow Data

Time Series Forecasting by Using Hybrid. Models for Monthly Streamflow Data Appled Mathematcal Sceces, Vol. 9, 215, o. 57, 289-2829 HIKARI Ltd, www.m-hkar.com http://dx.do.org/1.12988/ams.215.52164 Tme Seres Forecastg by Usg Hybrd Models for Mothly Streamflow Data Sraj Muhammed

More information