Scalable Biomedical Named Entity Recognition: Investigation of a Database-Supported SVM Approach

Scalable Biomedical Named Etity Recogitio: Ivestigatio of a Database-Supported SVM Approach Moa Solima Habib * ad Jugal Kalita Departmet of Computer Sciece Uiversity of Colorado, 1420 Austi Bluffs Pkwy Colorado Sprigs, CO 80917 USA * Cotact author: mhabib@eas.uccs.edu Abstract his paper explores the scalability issues associated with solvig the Named Etity Recogitio (NER) problem usig Support Vector Machies (SVM) ad high-dimesioal features ad presets two implemetatios to address these issues. he NER domai chose for these experimets is the biomedical publicatios domai, especially selected due to its importace ad iheret challeges. he performace results of a set of experimets coducted usig existig biary ad multiclass SVM with icreasig traiig data sizes are examied ad compared to results obtaied usig our ew implemetatios. Our baselie machie learig approach elimiates prior laguage or domai-specific kowledge ad achieves good outof-the-box accuracy measures that are comparable to those obtaied usig more complex approaches. he traiig time of multi-class SVM is reduced by several orders of magitude, which would make support vector machies a more viable ad practical machie learig solutio for real-world problems with large datasets. he first implemetatio SVM-PerfMulti is a ew istatiatio of SVM-Struct v3.0 built as a stadaloe C executable. he secod implemetatio SVM- MultiDB is a embedded database solutio for both biary ad multi-class SVM, built as a server-side extesio of PostgreSQL. Idex erms Named etity recogitio, support vector machies, database extesio, bioiformatics. I. INRODUCION Named etity recogitio (NER) is oe of the importat tasks i iformatio extractio, which ivolves the idetificatio ad classificatio of words or sequeces of words deotig a cocept or etity. Examples of amed etities i geeral text are ames of persos, locatios, or orgaizatios. Domaispecific amed etities are those terms or phrases that deote cocepts relevat to oe particular domai. For example, protei ad gee ames are amed etities which are of iterest to the domai of molecular biology ad medicie. he massive growth of textual iformatio available i the literature ad o the Web ecessitates the automatio of idetificatio ad maagemet of amed etities i text. he task of idetifyig amed etities i a particular laguage is ofte accomplished by icorporatig kowledge about the laguage taxoomy i the method used. I the Eglish laguage, such kowledge may iclude capitalizatio of proper ames, kow titles, commo prefixes or suffixes, part of speech taggig, ad/or idetificatio of ou phrases i text. echiques that rely o laguage-specific kowledge may ot be suitable for portig to other laguages. Moreover, the compositio of amed etities i literature pertaiig to specific domais follows differet rules i each, which may or may ot beefit from those relevat to geeral NER. I previous work [8], a simple architecture that elimiates laguage ad domai-specific kowledge from the amed etity recogitio process is applied to the Eglish biomedical etity recogitio task, as a baselie for other laguages ad domais. he biomedical field NER remais a challegig task due to growig omeclature, ambiguity i the left boudary of etities caused by descriptive amig, difficulty of maually aotatig large sets of traiig data, strog overlap amog differet etities, to cite a few of the NER challeges i this domai. he approach used reduces the pread post-processig of the textual data to a miimum ad capitalizes o SVM s strog geeralizatio ability to classify the amed etities. he accuracy measures achieved are comparable to those obtaied usig more complex techiques, which ecourage us to explore ways to improve the scalability of multi-class support vector machies. I this paper, we preset a ew istatiatio of SVM-Struct v3.0 [13] that exteds the improved biary SVM algorithm SVM-Perf [14] with a multi-class cuttig plae algorithm that reduces the multi-class traiig time by several orders of magitude. We refer to the ew multi-class implemetatio as SVM- PerfMulti. We also preset a ew database-supported implemetatio that icorporates the learig ad optimizatio algorithms of both SVM-Perf ad SVM- PerfMulti. For the sake of simplicity, we will refer to the biary classificatio compoet of the database implemetatio as SVM-PerfDB ad will refer to the multiclass compoet as SVM-MultiDB. Fially, the results of a set of scalability experimets usig existig ad ew SVM solutios are reported. hese experimets use biary ad multi-class SVM with a large set of real-world data from the biomedical literature. I Sectio II, the theory of biary ad multi-class support vector machies is briefly itroduced. Sectio III describes the experimets desig ad summarizes the results of a baselie experimet coducted durig the previous work [8] i order to assess the feasibility of our laguage ad domai-idepedet machie learig NER approach usig SVM ad highdimesioal features. he baselie experimet desig reduces pre-processig to feature extractio ad elimiates the use of prior laguage or domai kowledge. he traiig time ad 129

performace results of the baselie experimet are vastly improved usig the ew SVM-PerfMulti ad SVM-MultiDB implemetatios, which achieve good out-of-the-box accuracy ad performace measures. We briefly describe the ew SVM- Perf structural SVM formulatio ad the cuttig plae algorithm of SVM-PerfMulti i Sectio IV. he database architecture ad schema of SVM-PerfDB ad SVM-MultiDB are preseted i Sectio V. Fially, we report the results of several sets of sigle-class ad multi-class scalability tests usig existig ad ew SVM implemetatios ad icreasig traiig data size i Sectio VI. II. SUPPOR VECOR MACHINES he Support Vector Machie (SVM) is a powerful machie learig tool based o firm statistical ad mathematical foudatios cocerig geeralizatio ad optimizatio theory. SVM is based o Vapik s statistical learig theory [28] ad falls at the itersectio of kerel methods ad maximum margi classifiers. Support vector machies have bee successfully applied to may real-world problems such as face detectio, itrusio detectio, hadwritig recogitio, iformatio extractio, ad others. Support Vector Machie is a attractive method due to its high geeralizatio capability ad its ability to hadle highdimesioal iput data. Compared to eural etworks or decisio trees, SVM does ot suffer from the local miima problem, it has fewer learig parameters to select, ad it produces stable ad reproducible results. However, SVM suffers from slow traiig especially with o-liear kerels ad with large iput data size. Support vector machies are primarily biary classifiers. Extesios to multi-class problems are most ofte performed by combiig several biary machies i order to produce the fial multiclassificatio results. he more difficult problem of traiig oe SVM to classify all classes uses much more complex optimizatio algorithms ad are much slower to trai tha biary classifiers. A. Biary Support Vector Classificatio Biary classificatio is the task of classifyig the members of a give set of objects ito two groups o the basis of whether they have some property or ot. May applicatios take advatage of biary classificatio tasks, where the aswer to some questio is either a yes or o. he mathematical foudatio of Support Vector Machies ad the uderlyig Vapik-Chervoekis dimesio (VC Dimesio) is described i details i the literature coverig the statistical learig theory [2, 3, 12, 28] ad may other sources. he mai objective of support vector machies is to fid the optimal hyperplae separatig positive ad egative examples by maximizig the margi betwee the two classes. I mathematical terms, the problem is to fid f ( x ) = ( w xi + b) with maximal margi, such that: w xi + b = 1 for data poits that are support vectors w xi + b > 1 for other data poits Assumig a liearly separable dataset, the task of learig coefficiets w ad b of support vector machie f ( x ) = ( w xi + b) reduces to solvig the followig costraied optimizatio problem: fid w ad b that miimize: 1 w w 2 (1) s.t. yi ( w xi + b) " 1,! i I the o-liearly separable case, the margi maximizatio techique may be relaxed by a degree of error i the separatio. Slack Variables!i are itroduced to represet the error degree for each iput data poit. he optimizatio goal i this case is to maximize the margi while miimizig the slack variables, i.e., to fid w ad b that miimize:! 1 w w + C " (2) 2 i s.t. i= 1 yi ( w xi + b) " 1# $ i, $ i " 0,! i B. Multi-class Support Vector Classificatio For classificatio problems with multiple classes, differet approaches are developed i order to decide whether a give data poit belogs to oe of the classes or ot. he most commo approaches are those that combie several biary classifiers ad use a votig techique to make the fial classificatio decisio. hese iclude: Oe-Agaist-All [28], Oe-Agaist-Oe [17], Directed Acyclic Graph (DAG) [22], ad Half-agaist-half method [19]. A more complex approach is oe that attempts to build oe Support Vector Machie that separates all classes at the same time. I this sectio, these multi-class SVM approaches are briefly itroduced. Fig. compares the decisio boudaries for three classes usig a Oe-Agaist-All SVM, a Oe-Agaist-Oe SVM, ad a All- ogether SVM [2]. Overlappig areas represet uclassifiable regios, where a poit x is either positively idetified as belogig to more tha oe class, or is egatively idetified relative to all classes. Fig. 1 Compariso of Multi-Class Boudaries 1) Oe-Agaist-All Multi-Class SVM Oe-Agaist-All [28] is the earliest ad simplest multiclass SVM. For a k-class problem, Oe-Agaist-All maximizes k hyperplaes separatig each class from all the rest by costructig k biary SVMs. he ith SVM is traied with all the samples from the ith class agaist all the samples from the other classes. o classify a sample x, x is evaluated by all of the k SVMs ad the label of the class that has the largest value of the decisio fuctio is selected. 130

2) Oe-Agaist-Oe or Pairwise SVM Oe-Agaist-Oe [17] costructs oe biary machie betwee pairs of classes. For a k-class problem, it costructs k ( k!1) 2 biary classifiers. o classify a sample x, the sample is evaluated by each of the k ( k!1) 2 machies. he class that gets the largest value of the decisio fuctio by most machies is chose as the classificatio of x. 3) Half-Agaist-Half SVM Half-Agaist-Half multi-class SVM [19] is useful for problems where there is a close similarity betwee groups of classes. Usig Half-Agaist-Half SVM, a biary classifier is built that evaluates oe group of classes agaist aother group. he traied model cosists of at most! log k 2 " 2 biary SVMs. o classify a sample x, this techique idetifies the group of classes where the sample x belogs, tha cotiues to evaluate x with a subgroup, ad so o, util the fial class label is foud. he classificatio process is similar to a decisio tree that requires! log k " evaluatios at most. 2 4) All-ogether or All-At-Oce SVM A All-ogether multi-classificatio approach is computatioally more expesive yet usually more accurate tha all other multi-classificatio methods. Hsu ad Li [9] ote that as it is computatioally more expesive to solve multi-class problems, comparisos of these methods usig large-scale problems have ot bee seriously coducted. he experimets reported i this paper are a attempt to classify a large-scale problem usig this approach. he All-ogether approach builds oe SVM that maximizes all separatig hyperplaes at the same time. raiig data represetig all classes is used to geerate the traied model. With this approach, there are o uclassifiable regios as each data poit belogs to some class represeted i the traiig dataset. Fig. illustrates the elimiatio of uclassifiable regios i this case. he All-together multi-class SVM poses a complex optimizatio problem as it maximizes all decisio fuctios at the same time [5]. he idea is similar to the Oe-Agaist-All approach. It costructs k two-class rules where the mth fuctio w m!( x) + b separates traiig vectors of the class m from the other vectors. here are k decisio fuctios but all are obtaied by solvig oe problem. he primal formulatio of the optimizatio problem [2, 9] is to fid: k l 1 mi w w + C " (3)! m m! i wm," i 2 m= 1 i= 1 w $ xi ) wm$ ( xi s.t. i y i (! ) " em!# i, i = 1,..., l m m where ei " 0, if yi = m, ad ei " 1, if yi! m ad the decisio fuctio is argmax w m! ( x). m= 1,..., k Algorithms to decompose the problem [9] ad to solve the optimizatio problem [26] have bee developed, however, the All-ogether multi-class SVM approach remais a dautig task. he traiig time is very slow which makes the approach so far uusable for real-world problems with a large data set ad/or a high umber of classes. I this paper, we preset a improved multi-class solutio ad use it with a large set of real-world data to idetify biomedical amed etities. C. SVM Scalability ad Usability Challeges Beett ad Campbell [4] discuss the commo usability ad scalability issues of support vector machies. I this sectio we summarize the SVM scalability challeges oted i the literature ad i practice, which iclude: Optimizatio requires O( 3 ) time ad O( 2 ) memory for sigle class traiig, where is iput size (depedig o algorithm used). Multi-class traiig time is much higher, especially for All-ogether optimizatio, ad larger umber of classes. Slow traiig, especially with large iput datasets ad/or o-liear kerels. I additio to the scalability issues, tuig support vector machies requires the selectio of a suitable kerel fuctio ad model parameters. Model parameters are ofte selected usig a grid search, cross-validatio, or heuristic-based methods. Selectio of a suitable kerel fuctio for the problem at had is aother desiger-determied factor. Moreover, maagemet ad orgaizatio of learig ad classificatio results i a maer that fosters their reusability ad itegratio with other pre- or post-processig modules is curretly ot easily achievable. III. BASELINE EXPERIMENS Our baselie experimet [8] aims to idetify biomedical amed etities usig Support Vector Machies (SVM) [28], due to their geeralizatio capability ad their ability to hadle high-dimesioal feature ad iput space. he traiig ad testig data use the JNLPBA-04 shared task [16] data, which is a subset of the GENIA aotated corpus [15] of MEDLINE articles. he ames of proteis, cell lies, cell types, DNA ad RNA etities are previously labeled. he amed etities are ofte composed of a sequece of words. he traiig data icludes 2,000 aotated abstracts (cosistig of 492,551 tokes). he testig data icludes 404 abstracts (cosistig of 101,039 tokes) aotated for the same classes of etities. he fractio of positive examples with respect to the total umber of tokes i the traiig set varies from about 0.2% to about 6%. Basic statistics about the data sets as well as the absolute ad relative frequecies for amed etities withi each set ca be foud i [16]. he traiig ad test data pre-processig ivolves morphological ad cotextual features extractio oly, usig the JFEX software [6]. No laguage-specific pre-processig such as part-of-speech or ou phrases taggig is used. No dictioaries, gazetteers, or other domai-specific kowledge are used. he geerated feature space is very large, icludig over a millio differet features. All features are biary, i.e., each feature deotes whether the curret toke possesses this feature (oe) or ot (zero). he morphological features extracted iclude checkig 131

whether a toke is capitalized, umeric, is a puctuatio, is all i uppercase, is all i lowercase, is a sigle character, is a special character, icludes a hyphe, icludes a slash, is alphaumeric, cotais caps ad digits, ad a geeral regular expressio summarizig word shape. I additio, a cotextual collocatio of tokes active over three positios aroud the toke itself is used i order to provide a movig widow of cosecutive tokes which describes the cotext of the toke relative to its surroudig. A compariso of the performace of our baselie multiclass experimet to other systems usig SVM for biomedical NER is preseted i [8]. Usig a low value of the regularizatio factor C=0.01, the overall recall measure achieved is 62.43%, with a precisio measure of 64.50%, ad a fial F-score of 63.45%. While the baselie experimet achieved performace measures that are comparable to those obtaied usig more complex approaches such as [6, 25, 29], the traiig time usig the All-ogether multi-class SVM implemetatio, SVM-Multiclass [5, 26] was very high. Learig with the complete traiig dataset completed i 97 hours o a Xeo quad-processor 3.6 GHz machie. ABLE I SVM-PERFMULI MULI-CLASS EXPERIMEN RESULS VS. SYSEMS USING JNLPBA-04 DAA System Overall Performace Recall/Precisio/F-Score Zhou [30] Habib 76.0 / 69.4 / 72.6 67.9 / 66.4 / 67.2 Giuliao [6] 64.4 / 69.8 / 67.0 Sog [25] 67.8 / 64.8 / 66.3 Rössler [23] 67.4 / 61.0 / 64.0 Habib [8] 62.3 / 64.5 / 63.4 Park [21] 66.5 / 59.8 / 63.0 Lee [18] 50.8 / 47.6 / 49.1 Baselie [16] 52.6 / 43.6 / 47.7 he traiig time is reduced by several orders of magitude usig our ew SVM-PerfMulti cuttig plae algorithm, briefly described i Sectio IV (see able III for a sample compariso). he performace measures are also improved to reach a overall recall measure of 67.9, with a precisio of 66.4 ad a F-score of 67.2 a almost 4% out-of-the-box performace improvemet. Protei amed etities are idetified with a 75.00 recall, 64.19 precisio, ad 69.17 F- score. his places the performace of SVM-PerfMulti i secod place as compared to other published results. able I compares the performace measures attaied with differet systems. It is importat to ote that our out-of-the-box performace is attaied with o pre- or post-processig other tha feature extractio ad labelig. No exteral kowledge is used. he baselie performace i [30] is 60.3 F-score, which is boosted to 72.6 durig post-processig with the use of additioal dictioaries to relabel misclassified etities. We cosider our performace of 67.2 F-score to be a baselie measure, which may be boosted i post-processig with the use of exteral dictioaries ad lists of kow amed etities, if available. 132 IV. IMPROVED LINEAR SVM RAINING IME While coductig the scalability experimets, preseted i details i Sectio VI, we examie the differeces betwee the learig algorithm of SVM-Light [10-12], ad that of SVM- Perf [13, 14, 26, 27] o the total traiig time, memory usage, ad umber of support vectors. Sice SVM-Multiclass [5, 26] uses SVM-Light s learig algorithm, ad give the similar observatios oted i the experimets usig SVM- Light ad SVM-Multiclass, we explore the possibility of extedig SVM-Perf s structural learig algorithm to the multi-class classificatio problem. A. SVM-Perf Structural Formulatio SVM-Perf [14] improves the traiig time for liear biary classificatio problems by usig a ew SVM formulatio that combies costraits from the origial SVM formulatio (2). 1 mi w w + C! (4) w, e" 0 2 s.t. 1 1 % c $ { 0,1} : w & ci yixi # & ci "! i= 1 i= 1 Each costrait c i i the structural formulatio correspods to the sum of a subset of costraits, where c i ca be see as the maximum fractio of traiig errors possible over each subset [14]. Oly oe slack variable _ is shared across all costraits that is a upper boud o the fractio of traiig errors. SVM-Perf s traiig algorithm iteratively costructs a set of mostly violated costraits (oe per iteratio) ad adds it to the workig set used by the optimizatio algorithm. he algorithm repeats util o more costraits ca be foud to violate by more tha the desired precisio _. he cuttig plae algorithm fids the most violated costrait after each iteratio that correspods to & 1 1 # c = argmax% ( ci ' ( ci yi ( w xi )" (5) c= {0,1} $ i= 1 i= 1! where, # 1 yi ( w xi ) < 1 ci = "! 0 otherwise B. SVM-PerfMulti Cuttig Plae Algorithm I order to ivestigate the potetial improvemet i traiig for multi-class learig usig the learig algorithms implemeted i SVM-Perf [14] for traiig liear machies, we developed a iitial prototype ad the prelimiary experimets resulted i a tremedous improvemet of the traiig time while achievig same or better accuracy measures as SVM-Multiclass. A sample of the traiig time improvemet is reported i able III. he iitial prototype is motivated by the observatio of the umber of support vectors produced i the biary case usig SVM-Light ad i the multiclass case usig SVM-Multiclass. Durig the iitial scalability experimets, we oted that the umber of support vectors i both cases is is O( 0.8 ) w.r.t. the traiig data size. Usig SVM-Perf combied costraits learig algorithm, the umber of support vectors i the biary case is reduced from several thousads to less tha a hudred. SVM-PerfMulti [7] expads the structural formulatio of SVM-Perf to solve the multi-class classificatio case. Usig

the oe-slack formulatio, the liear multi-class optimizatio problem i (2) is replaced by k 1 mi w w + C" (6)! m wm," 2 m= 1 s.t. w xi m y i! w mxi " e!#, i = 1,..., For the multi-class problem, we use a stack of costrait vectors c = c1, c2,..., c, where k is the umber of classes. k he algorithm iteratively fids the stacked vector of most violated costraits ad repeats util o more costraits ca be foud which violate by more tha the desired precisio _. he cuttig plae algorithm fids the differece betwee the classificatio score of the correct class w ad the best y i xi classificatio score amog all other classes m! y i w x m i, where. A costrait is violated if the differece is greater tha a fractio of the combied traiig error. o accelerate the learig process, the algorithm icreases the traiig error threshold value after each iteratio usig a acceleratio factor, thereby icreasig the gap betwee the violated ad the o-violated costraits. he acceleratio factor is a fractio of the maximum correct classificatio score, which provides a reasoable idicatio of the decisio boudary. Acceleratio ca be disabled by usig a zero valued factor. he default acceleratio factor i SVM-PerfMulti is 0.0001. ABLE II SVM-PERFMULI & SVM-MULIDB PERFORMANCE MEASURES PER NAMED ENIY YPE Named Etity Performace Recall/Precisio/F-Score Protei 75.00 / 64.19 / 69.17 DNA 57.89 / 70.64 / 63.63 RNA 59.30 / 62.96 / 61.08 Cell ype 58.49 / 78.58 / 67.06 Cell Lie 52.43 / 51.19 / 51.80 Overall 67.93 / 66.39 / 67.15 Correct Right 81.37 / 79.52 / 80.43 Correct Left 74.62 / 72.93 / 73.77 V. DAABASE-SUPPORED SVM IMPLEMENAION I this sectio, we preset a SVM solutio assisted by a special database schema ad embedded database modules. he solutio icorporates the learig ad optimizatio algorithms of SVM-Perf ad SVM-PerfMulti. he database schema desig allows storage of iput data, evolvig traiig model(s), precomputed kerel outputs ad dot products, ad output data. he aim of this approach is to improve scalability by reducig the olie memory requiremets ad to foster SVM usability by providig a framework for easy reusability ad maageability of the learig eviromet ad experimetatio results. Usig a relatioal database to support SVM has bee attempted i [24] ad a more complete yet differet solutio is icluded i the Oracle 10g data miig product (ODM) [20]. MySvmDB [24] addresses the high memory requiremets by usig a relatioal database to store the iput data ad parameters. It does ot hadle the computatioal time limitatios. I fact, commuicatig costatly with the database system is kow to egatively impact the performace due to the cost of fetchig stored data. he oly SVM database implemetatio that tackles usability ad scalability issues is Oracle 10g commercialized SVM itegratio ito the Oracle Data Miig (ODM) product [20]. Oracle s approach to reducig the umber of data poits cosidered for traiig uses adaptive learig where a small model is built the used to classify all iput data. New iformative data poits are selected from all remaiig iput data ad the process is repeated util covergece or util reachig the maximum allowed umber of support vectors. Our approach does ot reduce the iput data size i order to evaluate the efficacy of the database-embedded modules i providig a scalable solutio. Oracle s multi-class implemetatio uses a oe-agaist-all classificatio method where several biary machies are built ad scorig is performed by votig for the best classificatio. he umber of biary machies i this case is equal to the umber of classes i the traiig data. Our SVM-MultiDB approach uses all-together traiig ad classificatio where oly oe machie is built ad used for classificatio. he ew embedded database modules supportig both the sigle class case as well as the all-together multi-class case ca be used to implemet the other multi-class learig approaches combiig biary machies, if eeded. Buildig a growig list of previously idetified ad aotated amed etities will be made possible by the database repository, which would provide a valuable resource to costatly improve the classificatio performace. he evolvig gazetteer list ca be used durig preprocessig or post-processig to aotate ad/or correct the classificatio of ewly discovered amed etities thereby boostig the overall performace. A. SVM Database Architecture he PostgreSQL [1] ope-source database maagemet system is chose due to its rich features, adherece to stadards, ad the flexible optios to exted DBMS via iteral or embedded fuctios. I order to reduce the commuicatio overhead with the database backed, we exted the database server with embedded C fuctios. his also provides a better itegratio of all compoets. Database triggers are used for frequetly updated values to esure data itegrity ad improve the potetial parallelizatio of the learig ad database processes. Fig. 2 presets the architecture used i the curret implemetatio. Pre-processig (feature extractio) ad postprocessig (evaluatio) modules are kept outside of the database modules for simplicity. Additioal supportig modules exist to import/export traiig ad test examples, import/export traiig model(s), ad trigger fuctios to compute derived data fields. o improve the usability of the SVM solutio, we will provide a web-based user-friedly iterface that allows the user to defie ew learig problems ad parameters, import/export traiig ad testig data ad/or 133

model(s), ad moitor the executio of the learig process. raiig Data est Data Lexico Pre-Processig Formattig for Feature Extractor Class Mappig Feature Extractio Orthographic Feature Extractor Cotextual Feature Extractor Iput Data Vectors Embedded Database Modules SVM raiig Modules Kerel Selectio Parameters Selectio Kerel Evaluatio Optimizatio Support Vector Machie Classificatio Fial Aotated Documets Support Vector Selectio Model Shrikig Import/Export Examples Import/Export Model Fig. 2 Database Architecture with Embedded SVM Data Repository Features Lexico Iput Data Vectors Kerel Evaluatios Support Vectors raied Model Classified Documets Class Labelig Evaluatio For the sake of simplicity, we will refer to the biary classificatio compoet of the database implemetatio as SVM-PerfDB ad will refer to the multi-class compoet as SVM-MultiDB. he embedded database modules are writte i C usig PostgreSQL s Server-Side Programmig Iterface (SPI) to access ad maipulate the data. he mai objectives of usig a database-supported solutio are: Use improved learig algorithms i order to reduce the traiig time. his is achieved by usig SVM-Perf ad SVM-PerfMulti as a basis of the implemetatio. Reduce the olie memory requiremets by storig iput examples ad geerated costrait vectors. A kow cocer of usig a database i place of memory-based data structures is the potetially egative impact o computatioal time due to the eed of frequet access to permaet storage. o remedy for this adverse reactio, oe eeds to miimize the eed to refetch data, possibly by storig itermediate results of smaller size i memory. Improve usability of the SVM solutio ad provide a practical framework for SVM learig ad classificatio. B. SVM-MultiDB Database Schema he database schema desig of SVM-MultiDB aims to provide a practical framework for biary ad multi-class SVM learig ad classificatio. Fig. 3 presets the mai database schema. he objectives of the schema are the followig: Reduce olie memory eeds of a learig exercise by storig iput examples ad geerated costrait vectors. Provide a way to store traiig ad/or testig example datasets idepedet of a learig exercise. Be able to defie multiple SVM experimets usig the same traiig ad/or testig datasets. Be able to use a subset of existig datasets for a learig experimet. his would be useful to coduct a grid search of the best learig parameters. Be able to reuse the same SVM exercise defiitio with differet learig parameters. Be able to label the same example dataset differetly i differet learig exercises. For e.g., the same dataset 134 may be used for biary classificatio of differet amed etities or for multi-class classificatio usig a differet umber of classes as part of idividual learig exercises without the eed to reload the example dataset. Provide a way to store itermediate kerel evaluatios ad dot products of costrait feature vectors. Provide a way to store the geerated most violated costrait vectors ad leared model(s) for future classificatio use ad potetially for ew icremetal learig algorithms. Be able to classify differet test datasets at ay time usig existig leared model(s). Easily maitai the defiitio of learig experimets ad parameters ad examie their results. As preseted i Fig. 3, the mai table defiitios supportig SVM learig ad classificatio are the followig: Example_set: defies a ew set of traiig ad/or test ig examples. Example: idividual example iput vectors that belog to a give example set. Note that the label stored is the origial textual label, for e.g., B-protei, ad ot a give class umber i order to facilitate the use of class idetificatios withi differet exercises. Label_set: defies a set of class labels. Label: idividual textual to class umber mappig that belogs to a give label_set. Kerel: a lookup table of valid kerel types. SVM: defies a selectio of all or part of a example set ad a give label set, to be used for a learig exercise. Lear_Param: defies a set of learig parameters. SVM_Lear: defies a specific learig exercise for a SVM defiitio ad a set of learig parameters. SVM_Model: a set of geerated costraits belogig to a give learig exercise. Support vectors that are selected for the fial leared model are marked usig the selected Boolea field. Computed alphas of the fial leared model for each selected support vector are stored. SVM_Model_Kerel: may be used to store kerel evaluatios (dot products for the liear case) i order to reduce computatioal redudacy ad the eed to refetch feature vectors for kerel computatio. SVM_Classify: defies a classificatio exercise. Classified_example: stores computed predictios of classified examples for a give classificatio exercise.

SVM-PerfMulti Memory Size vs. Number of Examples -c 0.01 -w 3 -l 2 (64-bit) 6,000 5,000 Memory Size (MBytes) 4,000 3,000 2,000 1,000 0 # of raiig Examples SVM-PerfMulti SVM-Perf C=0.01 SVM-Perf C=0.14 SVM-Perf C=1.0 Fig. 4 SVM-Perf ad SVM-PerfMulti Memory Usage vs. Data Size op to Bottom: SVM-PerfMulti, SVM-Perf (C=1.0, C=0.14, C=0.01) Fig. 3 SVM-MultiDB Database Schema C. radeoff of raiig ime vs. Olie Memory Needs Usig the SVM structural formulatio for either biary of multi-class learig, the traiig time is improved by combiig feature vectors ito vector(s) of most violated costraits. he geerated vectors require larger memory as the size of each costrait vector is O(f) i the biary case ad O(fm) i the multi-class case, where f is the umber of features i the traiig set ad m is the umber of classes. For e.g., for a traiig set with 1,000,000 features ad 10 classes ad assumig 8-bytes per feature to store feature umber ad its weight a biary costrait vector may eed up to 8MB of memory while a multi-class vector may eed up to 80MB. hese estimates costitute a worse-case sceario, where all features are represeted i each vector for all classes. I practice, usig the JNLPBA-04 traiig dataset with over a millio features ad 11 classes, the multi-class costrait vector size was about 0.5MB. Fig. 4 presets the olie memory requiremets with varyig traiig data size for biary traiig (usig regularizatio factor C=0.01, 0.14, ad 1.0) as well as for the multi-class traiig. Note that although the multi-class costrait vector size is potetially 11 times larger i this experimet, the actual memory eeds are less tha this estimate, ad ot much larger tha that eeded for biary classificatio with a larger regularizatio factor C. By examiig the time spet i differet parts of the learig algorithm, it is oted that about 50% of the time is spet computig argmax to fid the most violated costraits usig the origial iput vectors, ad the other 50% is spet optimizig the model usig the costrait vectors. We will examie the impact of storig each vector type i the database o the overall traiig time. 1) Effect of Fetchig Examples from Database SVM-MultiDB provides a cofigurable example cachig with three differet optios: o cachig (i.e., examples are always fetched from the database), full cachig (all examples are prefetched ito memory), ad a limited cache size where a predefied umber of example records is fetched as eeded. As expected, icreasig the cache size miimizes the time impact up to a certai size after which we see a impact due to loger prefetch time. However, sice the example vectors are requested oly oce ad i a sequetial maer to compute the most violated costrait, the overall impact of keepig example vectors i the database ad fetchig them as eeded had a miimal impact o time i the biary case ad almost o impact o the multi-class case. A compariso of the traiig time usig a example cache size of 500 is preseted i Fig. 5 for the biary, ad i Fig. 6, Fig. 7 for the multiclass case. 2) Effect of Fetchig Support Vectors from Database Costrait ad support vectors occupy more memory tha example vectors ad would result i a huge savig of olie memory if maitaied i the database. However, the existig C implemetatio of SVM-Perf ad SVM-PerfMulti require frequet access to the feature vectors durig the optimizatio process, mostly to compute kerel products ad update liear weights. Moreover, usig a variable support vector cache size may ot be useful due to the frequet o-sequetial access to the support vectors. A iitial direct portig of the C implemetatios to database implemetatio without imemory support cache egatively impacted the overall traiig time, which was expected. A efficiet database implemetatio requires optimizig access to the costrait feature vectors by cachig kerel evaluatios i memory ad miimizig the umber of loops used to compute other itermediate results. A iitial test of kerel product cachig 135

(o costrait cachig) resulted i about 50% time reductio. VI. SVM SCALABILIY EXPERIMENS I this sectio, the results of several sets of scalability experimets usig sigle-class ad multi-class SVM are examied. hese experimets use the same traiig ad test datasets described i Sectio III. he datasets represet a realworld problem, amely the biomedical amed etity recogitio, to idetify the ames of proteis, DNA, RNA, cell lies, ad cell types i biomedical abstracts. he approach used promotes laguage ad domai idepedece by elimiatig the use of prior laguage-specific ad domaispecific kowledge. Pre-processig of the traiig ad test datasets is limited to extractig morphological ad cotextual features describig words i the biomedical abstracts ad represetig each vector with a high-dimesioal biary vector. he iput dimesioality of the traiig data exceeds a millio features. he traiig data is composed of 492,551 examples ad the test data icludes 101,039 tokes. he scalability experimets trai sigle-class ad multiclass support vector machies usig chuks of the traiig dataset with icreasig size. he traied model is the used to classify amed etities i the complete test dataset. he traiig time is oted i each experimet as well as the umber of support vectors ad the accuracy measures achieved. Several sets of experimets are coducted usig differet traiig data sizes, which iclude: Sigle-class experimets idetifyig protei ames usig horste Joachims popular SVM-Light [10-12], ad a regularizatio factor C=0.01 ad 0.1. Sigle-class experimets idetifyig protei ames oly usig the ew SVM implemetatio, SVM-Perf [13, 14, 26, 27], ad a regularizatio factor C=0.01, 0.14, ad 0.1. Sigle-class experimets idetifyig protei ames oly usig our database embedded solutio, SVM-PerfDB, ad a regularizatio factor C=0.01, 0.14, ad 0.1. Multi-class experimets idetifyig all five amed etities (protei, DNA, RNA, cell lie, cell type) usig Joachims multi-class implemetatio, SVM-Multiclass [5, 26] with a regularizatio factor C=0.01. Multi-class experimets idetifyig all five amed etities usig our ew multi-class istatiatio, SVM- PerfMulti [7] with a acceleratio factor=0.0001. Multi-class experimets idetifyig all five amed etities usig our database embedded solutio, SVM- MultiDB with a acceleratio factor=0.0001. he traiig data chuks rage from 1,000 examples to 492,551 examples (the complete traiig dataset). Each set of experimets cosists of 51 tests. All experimets use a liear kerel ad a margi error of 0.1. he tests ru o a Itel Core 2 quad-processor 2.66 GHz machie ad a Xeo quad-processor 3.6 GHz machie. Ruig the same test o both machies completed i similar traiig time. A. Sigle-Class Results Usig SVM-Light [10-12], a sigle-class support vector machie is traied to recogize protei ame sequeces. he traied machie is the used to classify proteis i the test data. Sice o pre-processig was performed o the traiig ad testig data besides features extractio, the positive examples i the data sets remaied scarce. raiig the SVM- Light machie with the complete traiig dataset ad a regularizatio factor C=0.01 completed i about 28.5 miutes. he recall, precisio, ad F-score achieved i this case are 62.72, 56.12, ad 59.23 respectively. Icreasig C to 1.0 raised the traiig time to about 269 miutes, ad improved the accuracy measures to 68.92, 58.58, ad 63.33 respectively. he same set of experimets is repeated usig SVM-Perf [13, 14, 26, 27], which improves traiig time of liear machies to be liear w.r.t. the traiig data size. he traiig time improvemet usig SVM-Perf is several orders of magitude as compared to that usig SVM-Light, with the same classificatio results whe traied with the same learig parameters. Fig. 5 compares the traiig time usig both SVM-Light, SVM-Perf, ad SVM-PerfDB with the same data ad learig parameters. he traiig time usig SVM-Light is polyomial O( 2 ) while beig liear usig SVM-Perf ad SVM-PerfDB. he umber of support vectors usig SVM- Light is O( 0.8 ) w.r.t. the traiig data size. However, usig SVM-Perf, the umber of support vectors was oly a very small fractio of the traiig data size ad icreased slightly with icreased data size. he reduced umber of support vectors is the mai basis for the improved traiig time of SVM-Perf. he best recall, precisio, ad F-score measures re achieved usig C=0.14 ad are 73.10, 62.30, ad 67.27, respectively, where learig completes i about 15 miutes. B. Multi-Class Results he SVM-Multiclass implemetatio by. Joachims is based o [5] ad uses a differet quadratic optimizatio algorithm described i [26]. Hsu ad Li [9] ote that as it is computatioally more expesive to solve multi-class problems, comparisos of these methods usig large-scale problems have ot bee seriously coducted. Especially for methods solvig multi-class SVM i oe step, a much larger optimizatio problem is required so up to ow experimets are limited to small data sets. he multi-class experimets preseted herei attempt to solve a real-world large-scale problem usig a All-ogether classificatio method. he traiig data is composed of 11 classes where each amed etity is represeted by two classes oe deotig the begiig of a etity ad the other deotig a cotiuatio toke withi the same etity i additio to oe class deotig o-amed etity tokes. 136

raiig ime (sec) 1,800 1,600 1,400 1,200 1,000 800 600 400 SVM-PerfDB vs. SVM-Perf & SVM-Light raiig ime -c 0.01 -w 3 -l 2 (cache size = 500) classificatio performace variatio with traiig data size usig the multi-class approach. Note that the protei performace measures i this case are superior to the best achieved usig biary classificatio. he fial protei F-score i the biary Compariso case of with SVM-PerfMulti, C=0.14 SVM-MultiDB is 67.27 & as SVM-Multiclass compared to a F- score of 69.17 i the multi-class raiig ime case. 1,000,000 200 100,000 0 # of raiig Examples SVM-Perf SVM-PerfDB SVM-Light Fig. 5 Compariso of SVM-Light (op), SVM-Perf (Bottom), ad SVM-PerfDB (Middle) raiig ime vs. raiig Data Size raiig ime (sec) 10,000 1,000 100 o explore the scalability issues of the All-ogether multiclass SVM implemetatio, a series of experimets usig differet traiig data sizes is coducted with a low value for the C learig parameter equal to 0.01. he traiig time with 1,000 examples was 3.187 secods ad it icreased cosiderably with icreased data size to reach 416,264.251 secods (6,937.738 miutes or 4.8 days) o the same machie. he SVM-Multiclass [5, 26] implemetatio is based o the learig implemetatio i SVM-Light [10-12]. he traiig time remais polyomial O( 2 ) w.r.t. the traiig data size with a factor of O(k 2 ) icrease i time as compared to the sigle-class SVM-Light time, where k is the umber of classes. he traiig time required for All-ogether multi-class traiig is prohibitig to usig this approach with large datasets. SVM-PerfMulti ad SVM-MultiDB address this issue by usig a improved cuttig plae algorithm i cojuctio with the liear learig algorithm of SVM-Perf. able III ad Fig. 6 compare the traiig time usig the three methods. Fig. 7 takes a closer look at the impact of examples cachig i SVM-MultiDB usig a cache size of 500 examples. Note the miimal impact o traiig time i this case. ABLE III COMPARISON OF SVM-PERFMULI, SVM-MULIDB, AND SVM- MULICLASS RAINING IME (SECONDS) VS. RAINING DAA SIZE raiig Data Size SVM-Multiclass SVM-PerfMulti SVM-MultiDB (Examples Cache Size=500) 5,000 94.213 15.370 15.553 10,000 355.101 38.110 34.798 20,000 1,453.887 75.724 79.963 50,000 7,784.148 254.326 268.901 100,000 23,531.920 788.916 716.954 200,000 91,632.975 1,866.653 1830.997 300,000 165,178.301 2,528.715 2594.110 400,000 310,408.398 4,488.086 4523.788 492,551 416,264.251 5,890.031 6292.167 Fig. 8 presets the impact of the traiig data size o the multi-class classificatio performace measures i terms of precisio, recall, ad F _=1 -score. Fig. 9 presets the protei 10 1 # of raiig Examples Fig. 6 Compariso of SVM-PerfMulti SVM-Multiclass SVM-Multiclass (op), SVM-MultiDB SVM-PerfMulti, ad SVM-MultiDB Compariso raiig of ime SVM-MultiDB vs. raiig ad SVM-PerfMulti Data Size raiig ime vs. Number of raiig Examples raiig ime (sec) 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 # of raiig Examples Fig. 7 Compariso of SVM-MultiDB (Edig SVM-PerfMulti Up) ad SVM- PerfMulti raiig SVM-PerfMulti ime vs. Overall raiig Performace Data Measures Size C=0.01 (Recall / Precisio / F-score) Performace (%) 80 70 60 50 40 30 20 10 # of raiig Examples Fig. 8 SVM-MultiDB Overall Overall Recall Performace Overall Precisio vs. Overall raiig F-score Data Size (Very Close Recall, F-Score, ad Precisio Measures) VII. CONCLUSION AND FUURE WORK I this paper, we preset a improved multi-class cuttig plae algorithm that exteds the ew SVM structural formulatio [14] to improve multi-class liear traiig time. We also preset a database-supported implemetatio of the structural biary ad multi-class algorithms that aims to combie the ehaced traiig time with a reductio of olie memory eeds, ad to provide a practical ad usable framework for SVM learig. 137

SVM-MultiDB Protei Performace Measures C=0.01 (Recall / Precisio / F-score) Performace (%) 80 70 60 50 40 30 20 10 # of raiig Examples Fig. 9 SVM-MultiDB Overall Protei Recall Performace Overall Precisiovs. raiig Overall F-score Data Size (From op to Bottom: Recall, F-Score, Precisio) A series of experimets is preseted i order to explore the scalability issues associated with solvig the amed etity recogitio problem usig multi-class support vector machies ad high-dimesioal features, ad compare the results usig the differet learig methods. Baselie experimet results have show that the proposed laguage ad domai-idepedet approach is capable of successfully recogizig ad classifyig amed etities with reasoable accuracy measures. he ew SVM-PerfMulti cuttig plae algorithm offers good out-of-the-box performace measures achieved i a traiig time that is several orders of magitude faster tha SVM-Multiclass. he ew database implemetatio of the biary ad multiclass SVM classifiers offers a practical framework for SVM learig that fosters reusability of leared model(s) ad ehaced usability of the solutio. Fetchig stored example vectors from the database is show to have miimal impact o the traiig time. However, frequet access to stored costrait vectors egatively impacts traiig time, uless a efficiet i-memory cachig of kerel products ad itermediate results is used. Iitial attempts to cache kerel evaluatios has prove to be effective. Future work will attempt to miimize access to the costrait vectors durig the optimizatio process thereby providig further reductio i olie memory requiremets. REFERENCES [1] PostgreSQL Ope Source Database, 8.3.3 ed: PostgreSQL Global Developmet Group, 1996 2008. Olie: http://www.postgresql.org. [2] S. Abe, Support Vector Machies for Patter Classificatio. Lodo: Spriger-Verlag, 2005. [3] E. Alpaydi, Itroductio to Machie Learig. Cambridge, MA: he MI Press, 2004. [4] K. P. Beett ad C. Campbell, "Support Vector Machies: Hype or Hallelujah?," SIGKDD Explor. Newsl., vol. 2, pp. 1-13, 2000. [5] K. Crammer ad Y. Siger, "O the Algorithmic Implemetatio of Multi-class SVMs," Joural of Machie Learig Research, vol. 2, pp. 265 292, 2001. [6] C. Giuliao, A. Lavelli, et al., Simple Iformatio Extractio (SIE): IC-irst, Istituto per la Ricerca Scietifica e ecologica, 2005. Olie: http://tcc.itc.it/research/textec/tools-resources/sie/giuliaosie.pdf. [7] M. S. Habib, "Addressig Scalability Issues of Named Etity Recogitio Usig Multi-Class Support Vector Machies," Itl Joural of Computatioal Itelligece, vol. 4, pp. 230-239, 2008. [8] M. S. Habib ad J. Kalita, "Laguage ad Domai-Idepedet Named Etity Recogitio: Experimet usig SVM ad High- Dimesioal Features," i Proc. of the 4th Biotechology ad Bioiformatics Symposium (BIO-2007), Colorado Sprigs, CO, 2007. [9] C.-W. Hsu ad C.-C. Li, "A Compariso of Methods for Multi-Class Support Vector Machies," IEEE rasactios o Neural Networks, vol. 13, pp. 415-425, 2002. [10]. Joachims, "Makig Large-Scale SVM Learig Practical," i Advaces i Kerel Methods - Support Vector Learig: Chapter 11, B. Schölkopf, C. Burges, ad A. Smola, Eds.: MI-Press, 1998. [11]. Joachims, "ext Categorizatio with Support Vector Machies: Learig with May Relevat Features," i Proc. of the Europea Coferece o Machie Learig, 1998. [12]. Joachims, Learig to Classify ext Usig Support Vector Machie. Norwell, MA: Kluwer Academic, 2002. [13]. Joachims, "A Support Vector Method for Multivariate Performace Measures," i Proc. of the Iteratioal Coferece o Machie Learig (ICML), 2005. [14]. Joachims, "raiig Liear SVMs i Liear ime," i Proc. of the ACM Cof. o Kowledge Discovery ad Data Miig (KDD), 2006. [15] J. D. Kim,. Ohta, et al., "GENIA Corpus--Sematically Aotated Corpus for Bio-extmiig," Bioiformatics, vol. 19 Suppl 1, pp. 180-182, 2003. [16] J.-D. Kim,. Ohta, et al., "Itroductio to the Bio-Etity Recogitio ask at JNLPBA," i Proc. of the 2004 Joit Workshop o Natural Laguage Processig i Biomedicie ad its Applicatios (JNLPBA'2004), Geeva, Switzerlad, 2004. [17] U. H.-G. Kreßel, "Pairwise Classificatio ad Support Vector Machies," i Advaces i Kerel Methods: Support Vector Learig. Cambridge, MA: MI Press, 1999, pp. 255-268. [18] K.-J. Lee, Y.-S. Hwag, et al., "Biomedical Named Etity Recogitio usig wo-phase Model Based o SVMs," Joural of Biomedical Iformatics, vol. 37, pp. 436-447, 2004. [19] H. Lei ad V. Govidaraju, "Half-Agaist-Half Multi-class Support Vector Machies," i Proc. of the 6th Iteratioal Workshop o Multiple Classifier Systems, Seaside, CA, USA, 2005. [20] B. L. Mileova, J. S. Yarmus, et al., "SVM i Oracle Database 10g: Removig the Barriers to Widespread Adoptio of Support Vector Machies," i Proc. of the 31st iteratioal coferece o Very large data bases, rodheim, Norway, 2005. [21] K.-M. Park, S.-H. Kim, et al., "Icorporatig Lexical Kowledge ito Biomedical NE Recogitio," i Proc. of the 2004 Joit Workshop o Natural Laguage Processig i Biomedicie ad its Applicatios (JNLPBA'2004), Geeva, Switzerlad, 2004. [22] J. C. Platt, N. Cristiaii, et al., "Large Margi DAGs for Multiclass Classificatio," i Advaces i Neural Iformatio Processig Systems, vol. 12, S. A. Solla,. K. Lee, ad K.-R. M uller, Eds. Cambridge, MA: MI Press, 2000, pp. 547-553. [23] M. Rössler, "Adaptig a NER-System for Germa to the Biomedical Domai," i Proc. of the 2004 Joit Workshop o Natural Laguage Processig i Biomedicie ad its Applicatios (JNLPBA'2004), Geeva, Switzerlad, 2004. [24] S. Rüpig, "Support Vector Machies i Relatioal Databases," i Patter Recogitio with Support Vector Machies - First Iteratioal Workshop, 2002. [25] Y. Sog, E. Kim, et al., "POSBIOM-NER i the Shared ask of BioNLP/NLPBA 2004," i Proc. of the 2004 Joit Workshop o Natural Laguage Processig i Biomedicie ad its Applicatios (JNLPBA'2004), Geeva, Switzerlad, 2004. [26] I. sochataridis,. Hofma, et al., "Support Vector Learig for Iterdepedet ad Structured Output Spaces," i Proc. of the 21st Itel Cof. o Machie Learig (ICML), Alberta, Caada, 2004. [27] I. sochataridis,. Joachims, et al., "Large Margi Methods for Structured ad Iterdepedet Output Variables," Joural of Machie Learig Research (JMLR), vol. 6, pp. 1453-1484, 2005. [28] V. N. Vapik, Statistical Learig heory. New York, NY: Joh Wiley & Sos, 1998. [29] G. Zhou, "Recogizig Names i Biomedical exts usig Hidde Markov Model ad SVM plus Sigmoid," i Proc. of the 2004 Joit Workshop o Natural Laguage Processig i Biomedicie ad its Applicatios (JNLPBA'2004), Geeva, Switzerlad, 2004. [30] G. Zhou ad J. Su, "Explorig Deep Kowledge Resources i Biomedical Name Recogitio," i Proc. of the 2004 Joit Workshop o Natural Laguage Processig i Biomedicie ad its Applicatios (JNLPBA'2004), Geeva, Switzerlad, 2004. 138