Rcognition of Handwittn Txtual Annotation uing Tact Opn Souc OCR Engin fo infomation Jut In Tim (ijit) Sandip Rakhit 1, Subhadip Bau 2, Hiahi Ikda 3 1 Tchno India Collg of Tchnology, Kolkata, India 2 Comput Scinc and Engining Dpatmnt, Jadavpu nivity, India 3 Intllignt Mdia Sytm Dpatmnt, Cntal Rach Laboatoty, Hitachi Limitd, Japan 1 Coponding autho. E-mail: ubhadip@i.og Abtact Objctiv of th cunt wok i to dvlop an Optical Chaact Rcognition (OCR) ngin fo infomation Jut In Tim (ijit) ytm that can b ud fo cognition of handwittn txtual annotation of low ca Roman cipt. Tact opn ouc OCR ngin und Apach Licn 2.0 i ud to dvlop u-pcific handwiting cognition modl, viz., th languag t, fo th aid ytm, wh ach u i idntifid by a uniqu idntification tag aociatd with th digital pn. To gnat th languag t fo any u, Tact i taind with labld handwittn data ampl of iolatd and f-flow txt of Roman cipt, collctd xcluivly fom that u. Th dignd ytm i ttd on fiv diffnt languag t with f- flow handwittn annotation a tt ampl. Th ytm could uccfully gmnt and ubquntly cogniz 87.92%, 81.53%, 92.88%, 86.75% and 90.80% handwittn chaact in th tt ampl of fiv diffnt u. 1. Intoduction In onlin chaact cognition, th tajctoi of pn tip movmnt a codd and analyzd to idntify th linguitic infomation xpd. With th latt tchnological advancmnt in pn input dvic, nw intfac a dignd to captu th pci pntajctoy infomation and ubqunt analyi of onlin handwittn data, with u comfot in witing. It i now poibl to wit on an odinay pap and immdiat wil tanmiion of handwittn annotation to a mot v [1]. With th tchnological advanc, handwittn annotation in digital notbook may b digitizd in no tim. Taditionally, documnt containing handwittn infomation a difficult to achiv in digital fom. Evn with th hlp of latt optical cann, contnt bad indxing tchniqu and ach tool; it i difficult to find digitizd vion of documnt pag bad on u qui. Som wok ha cntly bn don on contnt bad tival of handwittn documnt [2-4]. In [2], Btand t.al. hav dvlopd a tchniqu fo tuctual documnt cognition and cognition of handwittn nam. In anoth wok, Matthw t.al. [3] dvlopd a tok fatu bad tchniqu fo tival of handwittn Chin annotation bad on typd/handwittn quy. Sihai t.al. [4] had ud tok/hap fatu fo indxing and tival of handwittn documnt bad on wit chaactitic, txtual contnt and wit pofil. In on of ou ali wok [5], a cognition bad indxing tchniqu wa dicud fo al-tim tival of handwittn annotation bad on typd/handwittn quy. David Domann, in hi uvy [6], had highlightd ky iu involvd in indxing and tival of documnt imag. In any cognition bad indxing tchniqu, th ovall pfomanc pdominantly dpnd on accuacy of th undlying cognition ngin. Dvlopmnt of a handwittn OCR ngin with high cognition accuacy i a till an opn poblm fo th ach community. Lot of ach ffot hav alady bn potd [7-9] on diffnt ky apct of handwittn chaact cognition ytm. In thi wok, w hav ud Tact 2.01 [10], an opn ouc OCR Engin und Apach Licn 2.0, fo gmntation and ubqunt cognition of handwittn txtual annotation of low ca of Roman cipt.
Objctiv of th cunt wok i to dvlop an Optical Chaact Rcognition (OCR) ngin fo th ijit ytm that can b ud fo cognition of handwittn txtual annotation of low ca Roman cipt. Tact i ud to dvlop u-pcific handwiting cognition modl, viz., th languag t, fo th aid ytm. Each u of th ijit ytm may b idntifid by a uniqu idntification tag aociatd with th Anoto digital pn [1]. Tact OCR ngin i cutomizd to pfom u pcific taining on labld handwiting ampl of both iolatd and f-flow txt, wittn uing low ca Roman cipt. Th pfomanc i valuatd on both th catgoi of documnt pag fo obvation of gmntation and chaact cognition accuaci. Th following ction dcib an ovviw of th xiting ijit ytm, an ovviw of th Tact OCR ngin and th pnt xpimnt on digning an OCR ngin fo gmntation and cognition of handwittn txtual annotation. 2. Th ijit ytm Jut in tim availability of maningful infomation i th ky to any al-tim infomation tival ytm. Th infomation Jut In Tim (ijit) ytm [5], dvlopd at th Hitachi Cntal Rach Laboatoy, kp tack of all th digital documnt tod in th ijit v. ing th popod ytm, handwittn annotation on th pintd digital documnt pag uing Anoto digital pn [1] may b viwd/had/achd bad on typd/handwittn quy. Fig. 1. how a chmatic ovviw of th cognition bad quy tival chm dignd fo th ijit ytm. Th ijit ytm u odinay pap, attachd with digitally lgibl patnt-potctd dotpattn fom Anoto [1], fo pintout of ach digital documnt though th v. Th Anoto pattn conit of numou naly inviibl, intllignt black dot that can b ad by a digital pn. Th pattn on ach pap i uniqu o that ach pag can b kpt paat fom on anoth. Fig. 1. A chmatic achitctu of th cognition bad quy tival ytm. An Anoto digital pn [1], look lik it nomal ballpoint countpat, contain an intgatd digital cama, an advancd imag micopoco and a wil communication dvic. Th pn can tak aound 50 digital naphot p cond, can to up to 50 full A4/ltt iz pag of handwittn data and thn culy nd th infomation to th ijit v though wil communication o nival Sial Bu (SB). Evy naphot contain nough data to dtmin th xact poition of th pn in th pap, th tim of pn-tok and th uniqu idntification numb of th Anoto pap. Each pn i alo having uniqu idntification numb o that th ijit ytm can ditinguih btwn vy individual handwiting. 3. Ovviw of th Tact OCR ngin
Tact i an opn ouc (und Apach Licn 2.0) offlin optical chaact cognition ngin, oiginally dvlopd at Hwltt Packad fom 1984 to 1994. Tact i now patially fundd by Googl [10] and lad und th Apach licn, vion 2.0. Th latt vion, Tact 2.03 i lad in Apil, 2008. In th cunt wok, w hav ud Tact vion 2.01, lad in Augut 2007. Lik any tandad OCR ngin, Tact i dvlopd on top of th ky functional modul lik, lin and wod find, wod cogniz, tatic chaact claifi, linguitic analyz and an adaptiv claifi. Howv, it do not uppot documnt layout analyi, output fomatting and gaphical u intfac. Cuntly, Tact can cogniz pintd txt wittn in Englih, Spanih, Fnch, Italian, Dutch, Gman and vaiou oth languag. To tain Tact in Englih languag 8 data fil a quid in tdata ub dictoy. Th 8 fil ud fo Englih a to b gnatd a follow: tdata/ng.fq-dawg tdata/ng.wod-dawg tdata/ng.u-wod tdata/ng.inttmp tdata/ng.nompoto tdata/ng.pffmtabl tdata/ng.unichat tdata/ng.dangambig 4. Th pnt wok In th cunt wok, Tact 2.01 i ud fo dvloping u-pcific handwiting cognition modl, viz., th languag t, fo th ijit ytm. To gnat th languag t fo ach u, Tact i taind with labld handwittn data ampl of iolatd and f-flow txt of low ca Roman cipt. Ky functional modul of th dvlopd ytm a dicud in th following ub-ction. 4.1. Collction of th datat Fo ppaation of th datat fo th cunt xpimnt, digitizd handwittn ampl of low ca Roman cipt w collctd fom fiv diffnt u. Six handwittn documnt pag, coniting of iolatd chaact and f-flow wod w collctd fom ach of th u of th dignd ytm. Th pag a catgoizd into two datat. Datat-1 conit of fou pag of iolatd handwittn low ca Roman chaact and Datat-2 contitut two pag of f-flow handwittn wod, wittn fom tchnical aticl. Fo ach u, th pag fom th datat-1 and on pag fom th datat-2 w conidd fo taining th Tact OCR ngin. Th maining two pag, on fom ach datat, contitut th tt t fo th cunt xpimnt. Th ovall ditibution of th chaact ampl in th taining and th tt t fo th fiv u i hown in Tabl 1.
Tabl 1. Compoition of th taining and tt t chaact ampl fo diffnt u 1 2 3 4 5 Tain t 1185 659 1844 Tt t 442 691 1133 Tain t 1006 529 1535 Tt t 468 718 1186 Tain t 992 884 1876 Tt t 546 1004 1550 Tain t 619 578 1197 Tt t 260 751 1011 Tain t 467 255 722 Tt t 234 277 511 4.2. Labling taining data Fo labling th taining ampl of ach u uing Tact w hav takn hlp of a tool namd bbtact [13]. To gnat th taining fil fo a pcific u, w nd to ppa th box fil fo ach taining imag uing th following command: tact fontfil.tif fontfil batch.nochop makbox Th box fil i a txt fil that includ th chaact in th taining imag, in od, on p lin, with th coodinat of th bounding box aound th imag. Incoct labl in th taining t may b manually coctd uing th bbtact Tool. Thn w hav to nam th boxfil fontfil.txt to fontfil.box. Fig. 2 how a cnhot of th bbtact tool.
Fig.2. A ampl cnhot of a gmntd taining pag uing th bbtact tool 4.3. Taining th data uing Tact OCR ngin Fo taining a nw languag t fo any u, w hav to put in th ffot to gt on good box fil fo a handwittn documnt pag, un th t of th taining poc, dicud blow, to cat a nw languag t. Thn u Tact again uing th nwly catd languag t to labl th t of th box fil coponding to th maining taining imag uing th poc dicud in ction 4.2. Fo ach of ou taining imag, boxfil pai, un Tact in taining mod uing th following command: tact fontfil.tif junk nobatch box.tain Th output of thi tp i fontfil.t which contain th fatu of ach chaact of th taining pag. Th chaact hap fatu can b clutd uing th mftaining and cntaining pogam: mftaining fontfil_1.t fontfil_2.t... Thi will output th data fil: inttmp, pffmtabl and Micofat, and th following command: cntaining fontfil_1.t fontfil_2.t... Thi will output th nompoto data fil. Now, to gnat th unichat data fil, unichat_xtacto pogam i ud a follow: unichat_xtacto fontfil_1.box fontfil_2.box... Tact u 3 dictionay fil fo ach languag. Two of th fil a codd a a Dictd Acyclic Wod Gaph (DAWG), and th oth i a plain TF-8 txt fil. Th wodlit i fomattd a a TF-8 txt fil with on wod p lin. Th coponding command a: wodlit2dawg fqunt_wod_lit fq-dawg wodlit2dawg wod_lit wod-dawg Th thid dictionay fil nam i u-wod and i uually mpty. Th final data fil of Tact i DangAmbig fil. Thi fil cannot b ud to tanlat chaact fom on t to anoth. Th DangAmbig fil may b mpty alo. Now w hav to collct all th 8 fil and nam thm with a lang. pfix, wh lang i th 3- ltt cod fo ou languag and put thm in ou tdata dictoy. Tact can thn cogniz txt in ou languag t uing th command: tact imag.tif output -l lang
5. Expimntal ult Poc. Int. Conf. on Infomation Tchnology and Buin Intllignc (2009) 117-125 Fo conducting th cunt xpimnt, fiv u-pcific languag t a gnatd uing Tact opn ouc OCR ngin. Th taining and tt pattn of ach individual u a pad ov two typ of datat, a dcibd in Sc. 4.1. Th xpimnt i focud on tting th gmntation and co cognition accuacy of Tact OCR ngin on f flow handwittn annotation wittn uing digital pn by diffnt u. Th linguitic analyi modul of Tact, involving th languag fil fq-dawg, wod-dawg, u-wod and DangAmbig a not utilizd in th cunt xpimnt. To valuat th pfomanc of th pnt tchniqu th following xpion i dvlopd. Rcognition accuacy = (CB tb / (CB m B + CB B ))*100 wh CB tb = th numb of chaact gmnt poducing tu claification ult and CB mb = th numb of miclaifid chaact gmnt and CB B ignifi th numb of chaact Tact fail to gmnt, i.., poducing und gmntation. Th jctd chaact/wod ampl a xcludd fom computation of cognition accuacy of th dignd ytm. Tabl 2(a-) how an analyi of uccful claification (SC), miclaification (Mic), gmntation failu (SF) and jction (Rj) ult on th tt ampl of th th u. Fig. 3 how a chaact wi ditibution of ucc and failu accuaci on th ovall tt datat. A obvd fom th xpimntation a ignificant popotion jction ca volv out of th wod gmntation failu. Thi i o bcau Tact i oiginally dignd to cogniz pintd documnt pag with unifomity in balin and chaact/wod pacing. Anoth ouc of o i du to th intnal gmntation of om of th chaact. Mo pcifically, th chaact 'i' oftn gt intnally gmntd into two pat, lading to high individual o at. Fquncy 450 400 350 300 250 200 150 100 50 0 a b c d f g h i j k l m n o p q t u v w x y z Labl of Tt Chaact Succ Failu Fig. 3. Ditibution of ucc and failu ca ov th f flow tt pag. Tabl 2. Analyi of cognition pfomanc of th dvlopd ytm (a) Rcognition pfomanc of -1 tt datat SC 95.42 83.2 87.92 Mic 4.1 16.19 11.52 SF 0.48 0.61 0.56 Rj 6.10 4.34 5.03
(b) Rcognition pfomanc of -2 tt datat SC 91.62 76.45 81.53 Mic 8.38 18.31 15.00 SF 0.00 5.24 3.47 Rj 26.07 4.18 12.82 (c) Rcognition pfomanc of -3 tt datat SC 96.78 90.94 92.88 Mic 3.22 6.18 5.19 SF 0.00 2.88 1.93 Rj 8.97 0.00 3.16 (d) Rcognition pfomanc of -4 tt datat SC 90.38 85.49 86.75 Mic 8.85 7.32 7.72 SF 0.77 7.19 6.03 Rj 0 0 0 ) Rcognition pfomanc of -5 tt datat SC 91.88 89.89 90.80 Mic 8.12 10.11 9.20 SF 0 0 0 Rj 0 0 0 Fig. 4. Som of th uccfully gmntd and cognizd wod imag.
(a) (b) Fig. 5. Som of th miclaifid wod imag (a) Rcognition o in th 3 d chaact (b) Intnal gmntation in th 8 th chaact A hown in Tabl 2(a-), th ovall chaact-lvl cognition accuacy of th dvlopd ytm i aound 87.98%. Th ovall chaact miclaification at i obvd a aound 9.73%. Sgmntation failu in th documnt pag account fo aound 2.29% o ca. Th aon bhind high gmntation failu i du to th ov-gmntation of om of th contitunt chaact lik i, 'j' and alo du to und-gmntation of cuiv wod in th documnt pag. Th dignd ytm jct aound 9.24% chaact in th tt datat. Thi i mainly du to th pnc of multi-kwd handwittn txt lin in th tt documnt. Compltly cuiv wod w alo jctd compltly in many ca duing th xpimntation. Som of th ampl wod imag uccfully gmntd and cognizd by Tact a hown in Fig. 4. Fig. 5(a-b) how om of th wod imag with onou gmntation and cognition ult. A majo dawback of th cunt ytm i it failu to avoid ov-gmntation in om of th chaact. Alo th ytm fail to gmnt cuiv wod in many ca lading to undgmntation and jction. Th cognition pfomanc of th dignd ytm may futh b impovd by incopoating mo taining ampl fo ach u and incluion of wod-lvl dictionay matching tchniqu. Dpit th limitation, th dignd cognition ngin i uccfully intgatd with th ijit ytm fo onlin intptation of handwittn txtual annotation. Th wod-lvl cognition tim of th OCR ngin, a obvd on aonably powd comput hadwa, i alo found to b atifactoy. In a nuthll, th cunt wok ffctivly cutomiz an opn ouc OCR ngin fo gmntation and cognition of handwittn txtual annotation of multipl u within th dignd ijit ytm. 6. Rfnc [1] www.anoto.com [2] Btand Coüanon Jan Camillapp Ivan Lplumy, Acc by contnt to handwittn achiv documnt: gnic documnt cognition mthod and platfom fo annotation, IJDAR (2007) 9: 223 242. [3] Matthw Ma, Chi Zhang and Patick Wang, Studi of Radical Modl fo Rtival of Cuiv Chin Handwittn Annotation, SSPR&SPR 2000, LNCS 1876, pp. 407-416, 2000. [4] Sagu Sihai, Ananthaaman Ganh, Catalin Tomai, Yong-Chul Shin, and Chn Huang, Infomation Rtival Sytm fo Handwittn Documnt, DAS 2004, LNCS 3163, pp. 298 309, 2004. [5] S. Bau, K. Konihi, N. Fuukawa, H, Ikda, A novl chm fo tival of handwittn txtual annotation fo infomation Jut In Tim (ijit), pocding (CD) of IEEE Rgion 10 Confnc (TENCON) -2008. [6] David Domann, Th Indxing and Rtival of Documnt Imag: A Suvy, Comput Viion and Imag ndtanding achiv Volum 70, Iu 3. [7] R.M. Bozinovic and S.N. Sihai, Off-lin Cuiv Scipt Wod Rcognition, IEEE Tan. Pattn Analyi and Machin Intllignc, vol. 11,pp 68-83, 1989. [8] B. B. Chaudhui and. Pal, A Complt Pintd Bangla OCR Sytm, Pattn Rcognition, vol. 31, No. 5. pp. 531-549, 1998. [9] S. Bau, C. Chawdhui, M. Kundu, M. Naipui, D. K. Bau, A Two-pa Appoach to Pattn Claification, N.R. Pal t.al. (Ed.), ICONIP, LNCS 3316, pp. 781-786. [10] http://cod.googl.com/p/tact-oc