Machine Learning Classification Algorithms to Recognize Chart Types in Portable Document Format (PDF) Files

Internatonal Journal of Computer Applcatons (0975 8887) Volume 39 o. February 01 Machne Learnng Classfcaton Algorthms to Recognze Chart Types n Portable Document Format (PDF) Fles V. Karthkeyan Department of Computer Scence Government Arts College Salem-8. Tamladu. Inda S. agaraan Department of Computer Scence K.S.R College of Arts and Scence Truchengode amakkal (Dst)-17 Taml adu Inda ABSTRACT Chart recognton system from PDF fles s a relatvely young research feld where technques and algorthms are proposed to dentfy type of charts and nterpret them. Ths paper focus on recognton of chart type that s a part of PDF document usng texture features and classfcaton algorthm. Eleven types of texture features and three classfers namely Multlayer perceptron support vector machne and K nearest neghbour are used. Performance analyss of the proposed chart type recognton systems show that texture features for chart type recognton has promsng future and produces best result whle usng K and SVM algorthm. Keywords Chart Classfcaton Texture Feature eural etwork. Support Vector Machne K earest eghbour Classfer. 1. ITRODUCTIO Portable Document Format commonly referred to as PDF s open standard for document exchange created by Adobe systems n 1993. A typcal PDF fle encapsulates many obects whch contan text (n dfferent fonts and sze) graphcs tables fgures and other nformaton needed to dsplay the content of a document. Usage of PDF fles offers two man advantages. The frst advantage s that t preserves the layout and desgn of the document as determned by the author and the second reason s that t s entrely self-contaned that s all nformaton such as the varous fonts needed to dsplay the fles s ntegrated nsde the format tself. Moreover a PDF fle has the advantage of representng documents n an applcaton operatng system software and hardware ndependent format. These advantages have made PDF the most used format and s now consdered as a unversal document format. Document analyss s a feld of research whch dscovers knowledge from a scanned document mage. Owng to the wde usage of PDF fles by common people researchers and ndustres document analyss s also been extended to PDF fles. As a consequence the need for converson tools that can extract text tables fgures and graphs from PDF fle s also growng. The need for converson tools has arsen because many devces lke embedded devces cannot handle PDF formats and onlne users and often have dffculty n readng mult-column documents. Several researchers have focused on text knowledge extracton (data mnng) of PDF and mage documents ([10] [13]). Ths feld s termed as text mnng and many organzatons nternatonally have already realzed the potentalty of text mnng. The process of text mnng extract useful busness knowledge from the unstructured documents by frst convertng them nto structured text and then use data mnng technques lke clusterng and classfcaton on them to derve valuable nsghts. The accuracy of these converters depends on the effcency of the segmentaton algorthms that can separate dfferent obects n a PDF. However on the other hand only a few studes have devoted to extract mages from PDF/mage documents as t s more complcated and challengng. The dffculty arses because the graph obects consst of several small components whch have features that are smlar to text [].Identfcaton of graphs n a PDF fle s composed of three steps. The frst step s to locate the chart obect the second step s to extract the graph obect and the thrd step s to dentfy the type of chart. The frst two steps are dealt n [11]. Ths paper focus on the thrd step that s to dentfy the type of graph located from the PDF fle usng machne learnng classfcaton algorthms. Graph or Chart classfcaton s an area n mage processng where the prmary goal s to separate a set of chart mages accordng to ther vsual content nto one of a number of predefned categores. Eght types of charts are consdered namely D 3D bar chart D 3D pe chart D 3D doughnut chart Lne chart and mxed chart. The present work analyzes the applcablty of three classfers namely Mult Layer Perceptron (MLP) Support Vector Machne (SVM) and K earest eghbour Classfers for the recognton of eght chart types. The vsual content of the graph mage s dentfed usng feature extracton step where texture features that best represents the graph mage are extracted and stored as feature vector. These feature vectors are then used to tran and test the selected classfers. The rest of the paper s organzed as follows. Secton provdes a bref dscusson on some of the prevous work n the related area Secton 3 presents the proposed methodology and Secton 4 presents the results of expermentaton. The study s concluded wth future research deas n Secton 5.. PREVIOUS STUDIES Chart Recognton s an area of research work that has ganed attenton only n the past few decades. From the lterature revew t was found that studes related to scentfc chart recognton s mnmal even though t has been studed as early from 1990. Durng 199 mnng of fgure nformaton from x-y data graphs and gene dagrams was proposed by [6]. Later [19] presented a schema-based model that extracts bar-charts usng horzontal and vertcal layout proecton and relatonshp nformaton. Zhou and Tan [0] analyzed the usage of Hough transform wth Hdden Markov Model for recognzng bar charts n document mages. Other segmentaton technques lke 1

Internatonal Journal of Computer Applcatons (0975 8887) Volume 39 o. February 01 Hough Transformaton [3] curvature estmaton [15] and vector-based technques [5] are also used for lne graph recognton. It s well known fact that the usage of Hough transformaton s computaton expensve and do not work well wth all types of charts. To solve ths problem a raster-tovector converson algorthm was used to dentfy three types of charts namely D bar chart D pe chart and D lne chart [18]. Futrelle et al. [7] and [16] proposed a scheme for recognzng and classfyng vector format graphcs n PDF documents usng technques lke spatal analyss and classfed graphs nto fve categores namely lne bar curve tree and other charts. Another method s based on pattern dscovery algorthms that fnd local structures appearng frequently ([1] [9]) and these structures are used as features. The pattern-dscovery-based method has an advantage n that t can make use of unlabelled data. Yet another approach s to use kernel methods such as Support Vector Machnes (SVMs). From the lterature study t was understood that exstng algorthms have two man drawbacks. Frst most of the methods are desgned for a specfc chart type only. Moreover the exstng technques assume the avalablty of predefned structural models and constrants of all types of charts. To solve these problems classfers that use texture features are used to recognze the chart type. The methodology used s dscussed n the followng secton. 3. METHODOLOGY Chart classfcaton system nvolves the processng of two man tasks feature extracton (extracts mage features and forms a feature vectors) and classfcaton (uses the extracted features to dscrmnate the classes). Both these processes take place after locatng the chart mage n the PDF document. Feature extracton task s used to dentfy a set of texture features from the located graph mages. It s a well-known fact that when small portons of a bgger unt are ndependently processed texture features provde a better descrpton of the selected regon [8]. It captures the spatal varatons n ntenstes of an mage whch form certan repeated pattern. These features are extracted for all chart mages n a PDF database. The three classfers analyzed n ths paper are MLP SVM and K whch are used to perform a mult-class classfcaton of chart mages. Each of these steps are explaned below. 3.1. Features Extracted The GLCM (Gray Level Co-Occurrence Matrx) features were used as texture features n ths study. The selected features are area medan mnmum and maxmum ntensty contrast homogenety energy entropy mean varance standard devaton and correlaton. A bref explanaton of each of these features s gven n ths secton. Snce ts nventon the GLCM has played vtal role n many texture based mage analyss applcatons ([14] [1]). The GLCM uses co-occurrence matrx to extract texture features of an mage usng statstcal equatons. A cooccurrence matrx s a matrx or dstrbuton that s calculated from the dstrbuton of co-occurrng values of an mage at a gven offset. Features generated usng ths technque s usually called Haralck features named after ts founder. Area of an mage n square pxels s calculated by multplyng number of rows and number of columns of the mage. The Mnmum maxmum ntensty and medan values are calculated by consderng all the pxels n the mage. Equaton (1) s used to calculate the contrast of an mage. 1 0 contrast P (1) In ths equaton three condtons arses. The frst s when and values are equal ndcatng that the pxels are n dagonal postn and ts neghbours are smlar and (-) = 0. The second condton s when (-) s between 0 and 1. Ths ndcates a small contrast dfference between the pxels and weght value of 1 s used. The thrd condton s when the dfference between and s. Ths ndcates that the contrast s ncreasng n whch case the weght s assgned a value 4. Thus the weghts contnue to ncrease exponentally as (-) ncreases. The Homogenty feature s calculated as 1 P Homogenty () 0 1 When the contrast n a mage wndow s low energy s best calculated usng a measure called Homogenety. The energy of an mage s calculated as descrbed below. To calculate energy (also called unformty) frst the Angular Second Moment (ASM) s to be calculated. Both ASM and Energy use each P as a weght for tself. 1 ASM P (3) 0 Energy s now calculated as the square root of the ASM (Equaton 4) and the entropy s calculated usng the formula gven n Equaton 5. Energy ASM (4) 1 P 0 1 np Entropy (5) The GLCM mean varance and standard devaton for the horzontal and vertcal drectons are calculated as below. 1 1 P 0 P 0 1 0 P P Mean (6) var ance 1 0 S tan darddevaton (8) (7) The Correlaton feature s calculated usng Equaton (9)/ Correlaton 1 P 0 The features thus extracted are stored as usng a -dmensonal matrx vector data structure havng 13 columns and n rows where n s the number of mages n the dataset. The frst 1 columns are used to store the features whle the last one s used to ndcate the target (label) of the chart type. The structure used s gven below: (9)

Internatonal Journal of Computer Applcatons (0975 8887) Volume 39 o. February 01 Struct FeatureVector { float feature1; float feature; float feature3; float feature4; float feature5; float feature6; float feature7; float feature8; float feature9; float feature10; float feature11; float feature1; nt target; } 3.. Classfers As mentoned earler three classfers are used to perform a mult-class classfcaton durng chart recognton process. The workng of the three classfers s dscussed n ths secton. 3..1 SVM Classfcaton SVM s a classfcaton algorthm based on optmzaton theory and ntally developed by [4]. Here an obect s vewed as an n-dmensonal vector and t separates such obects wth an n-1 dmensonal hyperplane. Ths s called a lnear classfer. There are many hyperplanes to classfy data and ths paper s also emphaszed on fndng out the possblty of maxmum margn between the two data sets. (Fgure 1) The fgure shows three Hyperplanes n -dmentonal space. H3 does not separate the two classes; H1 does wth a small margn and H wth the maxmum margn. 3..3 K Classfcaton The K-earest eghbour(k) machne learnng algorthm s the most frequently used algorthm n many applcatons. Ths algorthm uses dstance measures durng classfcaton and assgns an data obect to a category whch s closest to the data beng examned. When K s 1 the K algorthms works lke nearest neghbour algorthm. In general scenaro the Eucldean dstance measure s used to calculate the dstance between two data ponts and s gven n Equaton (1). d p q (1) 1 p q where d s dstance and p (or q ) s the coordnate of p (or q) n dmenson 3.3. Chart Classfcaton System The schematc block dagram of scentfc chart mage recognton system conssts of varous stages as shown n Fgure. PDF Document Database Locate chart mage Input PDF Locate Chart Create of Feature Vector Extract Features Tranng set Fgure 1: Example of SVM 3.. MLP Classfcaton The MLP neural network has feedforword archtecture wthn nput layer a hdden layer and an output layer. Mult-Layer Perceptron (MLP) wth a back propagaton learnng algorthms s chosen for the proposed system because of ts smplcty robustness and hgh computaton rates. It s assumed that that tranng dataset access of l pars (x y) where x s a vector contanng the pattern whle y s the class of the correspondng pattern. In our case an 8-class task y can be coded 1 to 8(for dentfyng eght dfferent chart) [17]. The MLP model conssts of an nput layer that accepts the nput neuron used n the classfcaton hdden layers and an output layer. A summaton of each neuron n the hdden layer by ts nput neurons x after multplyng the connecton weght w gves the output y as a Actvaton functon of the sum that s y f w x (10) where f s the sgmod or hyperbolc tangent transfer functon. Usng the Back propagaton tranng algorthm the weghts are mnmzed based on the squared dfferences between the actual and desred output values n the output neurons gven by d y E 1/ (11) Where y s the actual output of the neuron and d s the desred output of neuron. Learnng Algorthm Learnng Model Classfcaton Result Fgure : Proposed Chart Classfcaton Model The proposed chart classfcaton system thus consders the use of the three machne learnng algorthms to classfy the charts nto eght types. The nput data for a classfcaton task s a set of 11 texture features arranged as n row-wse fashon (records). Each record otherwse termed as nstance or example s descrbed by as (X y) where X s the attrbute set and y s a specal attrbute desgnated as the class label (also known as category or target attrbute). The classfcaton step s then defned as a task of learnng a target functon f that maps each attrbute set X to one of the predefned class labels y. The target functon s also known nformally as a classfcaton model and s useful for classfcaton purpose. The classfer then uses a systematc approach to buld the classfcaton learnng model from an nput data set usng a learnng algorthm. The man goal of the learnng algorthm s to dentfy a model that dentfes the best correlaton relatonshp among the feature sets and class categores of the nput data. Satsfyng ths goal provdes dual advantages. The frst s t makes sure that both the nput data 3

Accuracy (%) Internatonal Journal of Computer Applcatons (0975 8887) Volume 39 o. February 01 and learnng algorthm ft each other n an effcent manner and the second t to mprove the performance predcton whle suppled wth new records. The classfer s traned usng a data set (tranng set) that conssts of records wth target category provded. The test dataset consst of records wth no knowledge of the target category. The classfer uses the traned knowledge and performs the classfcaton. 4. EXPERIMETAL RESULTS Experments were carred out wth a dataset havng 155 mages belongng to all seven knds of charts (Table 1). All the mages are 56*56 RGB color mages. Experment were conducted usng a Pentum IV dual processor wth 51MB RAM. Zhou and Tan [1] used feed forward backpropagaton neural network for chart type recognton. Ths model referred to as Zhou Model used model based matchng algorthm for chart recognton. The performance of the classfers proposed n ths paper s compared wth Zhou model. Table 1: Detals on Dataset Chart o of Chart o of Type Charts Type Charts DBar chart 40 Doughnut D 7 3DBar chart 16 Doughnut 3D 11 D Pe chart 13 Lne 35 3D Pe Chart 0 Mxed Chart 13 The performance of system s analyzed based on error rate classfcaton accuracy and speed of classfcaton. Durng experments a 10-fold cross-valdaton method s used. The average results were taken as the fnal outcome. As a preprocessng step all the mage features were calculated pror to classfcaton and was converted to a feature vector whch was gven as nput to the classfers. The formula for calculatng error rate s gven below o. of ncorrectly predcton Error Rate x100 Tranng Sze The accuracy of the classfers s calculated as 1 Error Rate. An effectve classfer should reduce the error rate whle ncreasng the accuracy. The tme taken by the classfers to classfy an nput chart mage nto any one of the selected seven chart types s taken as the speed of classfer. 4.1. Error Rate Table shows the error rates obtaned by the selected classfers usng the 11 derved texture features. Table : Error Rate Classfer Error Rate (%) MLP 0.30 K- 0. SVM 0.3 Zhou 0.19 One of the prmary ams of automatc chart recognton systems s to acheve low error rates. Wth regard to ths t could be seen from the results that K- classfer produces the lowest error rate followed by SVM and then MLP. Whle consderng the effcency gan obtaned wth respect to error rate the K classfer produced 6.67% whle t was 4.35% whle comparng K and SVM. 4.. Classfcaton Accuracy The next performance metrc used to evaluate the proposed classfcaton models s accuracy. Fgure 3 shows the results obtaned by all the proposed chart classfer systems. 80 70 60 50 40 30 0 10 0 MLP K- SVM Zhou Classfers Fgure 3: Classfcaton Accuracy The results wth regard to classfcaton accuracy agan prove that the K classfer produces mproved classfcaton results when compared wth MLP and SVM whle usng texture features. The accuracy obtaned by K classfer (78.06%) wll make a great mpact whle usng n a chart recognton system when compared to MLP (69.68%) and SVM (76.77%). 4.3. Classfcaton Speed The classfcaton tme of a model s calculated as the sum of tranng and testng tme. The results obtaned wth respect to classfcaton tme are shown n Table 3. Table 3 : Classfcaton Speed (Seconds) Classfer Tme Taken MLP 8.38 K- 0.6 SVM 0.31 Zhou 0.5 In par wth the prevous results the executon tme of the K classfer base system s lower when compared to MLP and SVM. Moreover the expermental results further prove that the usage of MLP K and SVM algorthms showed sgnfcant mprovement when compared wth the base model (Zhou Model). Thus from the varous results t can be understood that K classfers usng texture features produce best PDF chart classfcaton results. 5. COCLUSIO Research on chart recognton s relatvely young feld and ths paper analyzes the use of texture features wth three frequently used classfers. Whle all the three classfers produce hgh accuracy and low error rate the performance of K classfer shows promsng results. In future more features wth respect to shape and text are to be consdered and methods for ensemble classfcaton n chart classfcaton are also to be probed. 6. REFERECES [1] Caylak E. (010) The studes about phonologcal defct theory n chldren wth developmental dyslexa Revew. Am. J. eurosc. Vol. 1 Pp. 1-1. [] Chowdhury S.P. Mandal S. Das A.K. and Chanda B. (007) Segmentaton of Text and Graphcs from Document Images nth Internatonal Conference on 4

Internatonal Journal of Computer Applcatons (0975 8887) Volume 39 o. February 01 Document Analyss and Recognton ICDAR 007 Pp. 619-63. [3] Conker R.S. (1988) Dual Plane Varaton of the Hough Transform for Detectng on-concentrc Crcles of Dfferent Rad CVGIP Vol. 43 Pp 115-13. [4] Cortes C. and Vapnk V. (1995) Support Vector etworks Machne Learnng Vol. 0 Pp. 73-97. [5] Dor D. (1995) Vector-Based Arc Segmentaton n the Machne Drawng Understandng System Envronment IEEE Transactons on PAMI Vol. 17 o. 11 Pp 1057-1068 1995. [6] Futrelle R.P. Kakadars I.A. Alexander J. Carrero C.M. kolaks. and Futrelle J.M. (199) Understandng dagrams n techncal documents IEEE Computer Vol. 5 Issue 7 Pp. 75-78. [7] Futrelle R.P. Shao M. Ceslk C. and Grmes A.E. (003) Extracton layout analyss and classfcaton of dagrams n PDF documents Intl. Conf. Document Analyss & Recognton. Ednburgh Scotland Pp. 1007-1014. [8] Haralc R.M. Shanmugam K. and Dnsten I. (1973) Textural features for mage classfcaton IEEE Transactons on Systems Man and Cybernetcs Vol. SMC-3 o. 6 Pp. 610-61. [9] Inokuch A. Washo T. and Motoda H. (000) An Apror-based algorthm for mnng frequent substructures from graph data Proceedngs. of the 4th PKDD Pp.13 3. [10] Islam R. Saha R.S. and Hossan A.R. (009) Automatc Readng from Bangla PDF Document Usng Rule Based Concatenatve Synthess Internatonal Conference on Sgnal Processng Systems IEEE Computer Socety Pp. 51-55. [11] Karthkeyan V. and agaraan S. (011) Scentfc Chart Image Property Identfcaton usng Connected Component Labelng n PDF document 3 rd Internatonal Conference on Electroncs Computer Technology Kanyakumar Inda Vol.4 Pp.09-1. [1] Kramer S. and Raedt L.D. (001) Feature constructon wth verson spaces for bochemcal applcaton. Proceedngs of the 18 th ICML Conference [13] Martnez-Alvarez R.P. Costas-Rodrguez S. Gonzalez- Castao F.J. and Gl-Castera F. (010) Automated Document Converson System for Smple Multmeda Platforms 7th IEEE Consumer Communcatons and etworkng Conference (CCC) Pp. 1-. [14] Omama.A. (010) Improvng the performance of backpropagaton neural network algorthm for mage compresson/decompresson system J. Comput. Sc. Vol. 6 Pp. 1347-1354. [15] Rosn P.L. and West G. A. (1989) Segmentaton of Edges nto Lnes and Arcs Image and Vson Computng Vol. 7 o. Pp 109-114. [16] Shao M. and Futrelle R.P. (006) Recognton and Classfcaton of Fgures n PDF Documents W. Lu and J. Lladós (Eds.): Selected papers from Workshop on Graphcs Recognton GREC 005 LCS 396 Sprnger Pp. 31-4. [17] Smach F. Atr. M. Mteran J. and Abd M. (005) Desgn of a eural etworks Classfer for Face Detecton World Academy of Scence Engneerng and Technology Vol. 11 Pp. 13-17. [18] Song J. Su F. Chen J. Ta C. L. and Ca S. (000) Lne net global vectorzaton: an algorthm and ts performance analyss IEEE Conference on Computer Vson and Pattern Recognton South Carolna Pp. 383-388. [19] Yokokura. and Watanabe T. (1997) Layout-Based Approach for extractng constructve elements of barcharts GREC'97 Pp. 163-174. 1997 [0] Zhou Y. and Tan C.L. (001a) Hough-based Model for Recognzng Bar Charts n Document Images SPIE conference on Document mage and retreval Vol. 4307 Pp. 333-340. [1] Zhou Y. and Tan C.L. (001b) Learnng-based scentfc chart recognton 4th Internatonal Workshop on Graphcs Recognton GREC001 Pp. 48-49. 5