Readng LCD/LED Dsplays wth a Camera Cell Phone Huyng Shen and James Coughlan Smth-Kettlewell Eye Research Insttute San Francsco, CA 94115 {hshen, coughlan}@sk.org Abstract Beng able to read LCD/LED dsplays would be a very mportant step towards greater ndependence for persons who are blnd or have low vson. A fast graphcal model based algorthm s proposed for readng 7-segment dgts n LCD/LED dsplays. The algorthm s mplemented for Symban camera cell phones n Symban C++. The software reads one dsplay n about 2 seconds by a push of a button on the cell phone (Noka 6681, 220 MHz ARM CPU). 1. Introducton Electronc applances wth LCD/LED dsplays have become ubqutous n our daly lves. Whle they offer many convenences to sghted people, blnd people and those wth low vson have dffculty usng them. One way to allevate ths problem s to develop a devce that wll read the dsplay aloud for the targeted users. One canddate for mplementng such a devce s the camera cell phone, or "smart phone". So-called smart phones are actually small computers wth formdable computatonal power. For eample, the Noka 6681 has a 220 MHz ARM CPU and 22 MB RAM. It also has a 1.3M pel camera. Compared to the processng power afforded by typcal desktop computers used n computer vson, however, the smart phone has substantally less processng power. In our eperence, nteger-based calculatons are over an order of magntude slower on a cell phone than on a typcal desktop computer. Moreover, cell phones do not have a floatng pont processng unt (FPU) but use a software-smulated FPU nstead to do floatng pont calculatons, whch are slower stll. Thus, a computer vson algorthm mplemented on a cell phone must work wthn sgnfcant computatonal constrants n order to be practcal. We address these constrants by choosng an applcaton that s less computatonally demandng than typcal state-of-the-art computer vson applcatons desgned to run on non-embedded systems: our doman s restrcted to close-up mages of LCD/ LED numerc dsplays, wth only modest amounts of clutter that s typcally confned to areas n the mage a small dstance away from the LED/ LCD characters. We are developng a software applcaton for Symban cell phones, e.g. Noka 7610, Noka 6681/6682, to read seven-segment LCD dsplays. The user wll push the OK button to take a pcture, and the applcaton wll read out the dgts on the dsplay n dgtzed or synthetc speech. 2. Choce of Platform Usng the cell phone as a platform for ths applcaton offers many mportant advantages. The frst s that t s nepensve and most people already have one no addtonal hardware needs to be purchased. Ths s partcularly mportant snce many vsually mpared people have lmted fnancal resources (unemployment among the blnd s estmated at 70% [8]). The camera cell phone s also portable and becomng nearly ubqutous; t s mult-purpose and doesn t burden the user wth the need to carry an addtonal devce. Another advantage of the cell phone s that t s a manstream consumer product whch rases none of the cosmetc concerns that mght arse wth other assstve technology requrng custom hardware [9]. Our past eperence wth blnd people shows that they can hold a cell phone camera roughly horzontal and stll enough to avod moton blur, so that satsfactory mages can be taken wthout the need for a trpod or other mountng. We have chosen to use cell phones usng the Symban operatng system for several reasons. Frst, Symban cell phones (most produced by Noka) have the bggest market share. Second, the Symban operatng system and C++ compler are open and well documented, so that anyone can develop software for Symban OS. In the future we plan to allow open access to our source code, whch wll allow other researchers and developers to modfy or mprove our software. Fnally, the camera API s an ntegrated part of the OS, whch allows straghtforward control of the mage acquston process. We note that the cell phone platform allows us to bypass the need for manufacturng and dstrbutng a physcal product altogether (whch s necessary even for custom hardware assembled usng off-the-shelf components). Our fnal product wll ultmately be an eecutable fle that can be downloaded for free from our 1
webste and nstalled on any Symban camera phone. 3. Related Work We are aware of no publshed work specfcally tacklng the problem of readng mages of LCD/LED dsplays, although ths functon has been proposed for a vsual-to-audtory sensory substtuton devce called The voice [10], and a commercal product to perform ths task s under development at Blndsght [1]. A large body of work addresses the more general problem of detectng and readng prnted tet, but so far ths problem s consdered solved only n the doman of OCR (optcal character recognton). Ths doman s lmted to the analyss of hgh-resoluton, hgh-contrast mages of prnted tet wth lttle background clutter. Recently we have developed a camera cell phone-based system to help blnd/low vson users navgate ndoor envronments [4], but ths system requres the use of specal machne-readable barcodes. The broader challenge of detectng and readng tet n hghly cluttered scenes, such as ndoor or outdoor scenes wth nformatonal sgns, s much more dffcult and s a topc of ongong research. We draw on a common algorthmc framework used n ths feld of research, n whch bottom-up processes are used to group tet features nto canddate tet regons usng features such as edges, color or teture [5,6,7,14], n some cases usng a flter cascade learned from a manually segmented mage database [2]. Our approach combnes a bottom-up search for lkely dgt features, based on groupng sets of smple, rapdly detected features, wth a graphcal model framework that allows us to group the canddate features nto fgure (.e. target dgts) and ground (clutter). Ths framework s based on graphcal models that are data-drven n that ther structure and connectvty s determned by the set of canddate tet features detected n each mage. Such a model provdes a way of prunng out false canddates usng the contet of nearby canddates. Besdes provdng a natural framework for modelng the role of contet n segmentaton, another beneft of the graphcal model framework s the ablty to learn the model parameters automatcally from labeled data (though we have not done ths n our prelmnary eperments). Recent work related to ours also uses a graphcal model framework for tet segmentaton n documents [18] and n natural scenes [17]. Unlke our approach, these works requre ether mages wth lttle clutter or colored tet to ntate the segmentaton. By contrast, we have desgned our algorthm to process cluttered grayscale mages wthout relyng on color cues, snce dgts come n a varety of colors (black for LCDs and green, blue or red for LEDs). 4. Algorthm An eample of a pcture of an LCD dsplay s shown n Fg 1. The dsplay has low contrast, and the LCD dgts are surrounded by clutter such as the dsplay case and controls. Our goal s to construct an algorthm to fnd and read the group of 7-segment dgts n the mage. Fgure 1: An electronc current/voltage meter. It can be seen from Fg. 1 that 1) all the dgts are of smlar heght (h) and wdth (w), 2) dgts are horzontally net to each other and 3) neghborng dgts are appromately at the same level. One can also see that for each dgt, the rato w/h s a number around 0.5. Our algorthm wll eplot these observatons. 4.1. Feature Etracton and Buldng Compared to today s powerful desktop computers, a cell phone has very lmted computatonal resources. Comple feature etracton algorthms and those usng etensve floatng pont computatons must be avoded. Therefore, we wll only etract smple features, and buld up needed features herarchcally. The basc features we are etractng from the mage are horzontal and vertcal edge pels. Each has two polartes: from lght to dark, and from dark to lght. Fg. 2 shows horzontal edge pels of two polartes: green pels are edge transtons from lght to dark (traversng the mage downwards), and the blue are ones from dark to lght. The edge pels are determned by fndng local mama and mnma n the horzontal and vertcal dervatves of the mage ntensty. Fgure 2: Horzontal edge pels of two polartes: green for edges from lght to dark, gong downwards, and blue for ones from dark to lght. 2
When two edge pels of opposte polartes are net to each other, we construct an edge par pel. In Fg. 2, when there s a green pel rght above blue pel, one can fnd a horzontal edge par pel, shown n yellow, n Fg. 3. Fgure 3: Horzontal edge par pels: when two edge pels of opposte polartes are net to each other, an edge par pel s constructed, located between them. We can group horzontal edge par pels nto horzontal strokes. Smlarly, we can fnd vertcal strokes. Fg. 4 shows both horzontal strokes (yellow) and vertcal ones (red). Note that long strokes are not shown n Fg. 4, as they are too large for the scale of dgts we are lookng for and are elmnated from further consderaton. Fgure 4: Horzontal (yellow) and vertcal (red) strokes. When vertcal and horzontal strokes are suffcently close, we can construct stroke clusters, as shown n Fg. 5. These stroke clusters serve as canddates for 7-segment dgts. Fgure 5: Stroke clusters: when vertcal and horzontal strokes are close to each other, stroke clusters are constructed. 4.2. Fgure-Ground Segmentaton Whle smple clusterng gves good segmentaton results n many cases, there are stll false postves that need to be elmnated (as well as some false negatves to be flled n ). We use a fgure-ground segmentaton algorthm to elmnate the false postves from the clusterng results, buldng on our prevous work on detectng pedestran crosswalks [3]. Ths approach was nspred by work on clusterng wth graphcal models [11], normalzed cut-based segmentaton [12] and obectspecfc fgure-ground segmentaton [16]. In ths study, a data-drven graphc model s constructed for each mage, and belef propagaton s used for fgure-ground segmentaton of stroke clusters. Ths technque may be overly comple for the mages shown n ths paper, but we antcpate that t wll be useful for noser mages taken by blnd users, and t wll be straghtforward to etend to alphanumerc characters n the future. Each stroke cluster, represented by ts boundng rectangle ( mn, y mn, ma, y ma ), defnes a node n the datadrven graph. Two nodes nteract wth each other when they are close enough. The goal of the fgure-ground process s to assgn fgure labels to the nodes that belong to the target (LED/LCD dgts) and ground labels to the other nodes. 4.3. Belef Propagaton for Fed Pont Computaton Most embedded systems, ncludng handheld computers and smart cell phones, do not have a floatng pont processng unt (FPU). Symban cell phones are no ecepton. Symban OS does have a software smulated FPU, but ths s one to two orders of magntude slower than nteger computaton. Tradtonal belef propagaton (BP) algorthms are computatonally ntensve, and typcally requre floatng pont computaton. In ths study, we perform ma-product BP [15] n the log doman so that all message updates can be performed wth addton and subtracton. Further, the messages can be appromated as ntegers by a sutable rescalng factor, so that only nteger arthmetc s requred. The ma-product message update equaton s epressed as follows: m ( ) = c ma { ψ (, ) ψ ( ) m ( )} k N ( )\ k where m ) s the message from node to node state ( of node., ) between state ψ ( s the compatblty functon of node and of node. ψ ( ) s untary potental of node for state. N() s the set of 3
nodes neghborng (.e. drectly connected to) node, and N()\ denotes the set of nodes neghborng ecept for. c s an arbtrary normalzaton constant. Takng the log of both sdes of the equaton, we have: L ( ) = ma{ E (, ) + E ( ) + L k ( )} + k N( )\ where L ( ) = log( m ( ) ), E (, ) = log( ψ (, ) ), E ( ) = log( ψ ( ) ), and z = log( c ). z s chosen such that L ( ) wll not over-or underflow. In our fgure-ground segmentaton applcaton, each node has only two possble states: =0 for the ground state and =1 for the fgure state. One can see from the equaton above that only addton/subtracton s needed for message updatng. For C++, whch we choose to use on the Symban cell phone, we can perform the addton/subtracton usng only nteger calculatons and no floatng pont. Ths allows the algorthm to run fast enough to be practcal on the cell phone. 4.4. Untary Energy The untary energy E ( ) represents how lkely node s to be at state. Wthout losng generalty, we set E ( = 0) = 0 for all nodes n the graph, snce only the dfference between E ( = 0) and E ( = 1), matters. As stated prevously, each stroke cluster s represented by a rectangle ( mn, y mn, ma, y ma ), and ts wdth and wdth are w = ma,- mn,and h = y ma - y mn, respectvely. For the fgure state, E ( = 1) represents how lkely a stroke cluster s a 7-segment dgt by lookng at the cluster tself. We use the wdth/heght rato (R wh ) to determne ths value: E ( = 1) =0 when R wh >0.3 and R wh <0.6, E ( = 1) =0.5 when R wh >0.6 and R wh <1.0, and E ( = 1) =2.0 otherwse. 4.5. Bnary Energy Bnary E (, ) represents the compatblty of node havng state and node havng state. Snce E ( = 0, = 0), the ground-ground energy, and E ( = 0, = 1), the ground-fgure energy, are z dffcult to learn, we wll set them to the same constant, E b (say, 1.5) for all the nodes. E ( = 1, = 1) represents how lkely t s that nodes and are both fgure. E ( = 1, = 1) = c + c y y + c h h + c w w where = mn( mn ma, ma mn ), y = mn( y mn y mn, yma yma ), h = mn( h h, h 2h, 2h h ), w = h h. The c s are coeffcents to be determned by eperence and/or statstcal learnng. There s a cutoff value for E ( = 1, = 1) : when t s greater than E b, t s set to E b. In other words, when node and can t send postve messages to help each other be classfed as fgure, they don t say anythng negatve ether. 4.6. Read the Dgts After stroke clusters are dentfed as fgure, they are mapped to the 7-segment template, see Fg. 6. Fgure 6: Seven-segment dgt template. The numbers n the mage ndcate the orderng of the segments. A mappng result s a seres of seven 0 s and 1 s, wth 1 s ndcatng the stroke ests, and 0 ndcatng the stroke s mssng. For eample, a mappng result of 1110101 ndcates that strokes 4 and 6 are mssng, whch consequently means the dgt s 3. 1111011 means the dgt s a 6. To determne each dgt, each strng of 0 s and 1 s s matched to the dgt wth the most smlar sequence. Sometmes a segment can be mssng (.e. false negatve). In ths case the cluster s then mapped to the closest dgt. For eample, the cluster on top of the dgt 3 n Fg. 5 s mssng segment 1, and the mappng result wll be 0110101. Stll t s best mapped to dgt 3. 4
5. Results The algorthm s mplemented and nstalled on a Noka 6681 cell phone. The eecutable.sis fle (compled on a desktop computer) s only about 73 KB, whch means that t leaves plenty of space on the cell phone s flash memory for other applcatons and data. After the applcaton s launched, t s n vdeo prevew mode: the screen shows that the camera s capturng. (The dsplay s used for debuggng purposes but obvously may not be useful for a low vson or blnd user.) When the user pushes the OK button, the software wll take a pcture, run the dsplay reader algorthm, and read aloud the numbers on the screen. (Ths s currently done usng pre-recorded.wav fles for each dgt, but a tet-to-speech system sutable for the Symban OS wll be used n the future.) The whole process takes appromately 2 seconds. We show several results n Fg. 7. Note that the dsplays are only roughly horzontal n the mages. There are few false postves, and those that occur (as n the last mage n Fg. 7) are reected by the dgt-readng algorthm. Fgure 8: Epermental result for LED dsplay. Left: orgnal mage. Rght: results (same conventon as n prevous fgure). 6. Summary and Dscusson Beng able to read LCD/LED dsplays would be a very mportant step to help blnd/low vson persons gan more ndependence. Ths paper presents an algorthm to perform ths task, mplemented on a cell phone. It reads 7- segment LCD/LED dgts n about 2 seconds by the push of a button on the phone. The algorthm etracts only very smple features (edges n four orentatons) from the mage, and bulds up comple features herarchcally: edge pars, vertcal and horzontal strokes, and stroke clusters. A data-drven graph s constructed and a belef propagaton (BP) algorthm s used to classfy stroke clusters as fgure or ground. The stroke clusters labeled as fgure are read by matchng them to dgt templates (0 through 9). Future work wll nclude thorough testng of the algorthm by blnd and vsually mpared users, who wll furnsh a dataset of dsplay mages that wll be useful for mprovng and tunng the algorthm. We also are n the process of etendng the fgure-ground framework to handle alphanumerc dsplays, as well as to detect tet sgns n natural scenes, such as street names and addresses. Fgure 7: Epermental results for LCD dsplays. Stroke clusters assgned to fgure are shown n green and ground n blue. False postve n bottom of last mage s reected by the algorthm for readng ndvdual dgts. We also show a result for an LED dsplay n Fg. 8. In order to read ths dsplay, the mage contrast was manually nverted so that the dgts became dark on a lght background, the same as for LCD dgts. In the future we wll search for dgts wth both mage polartes so that both types of dsplay are accommodated. Acknowledgments We would lke to thank John Brabyn for many helpful dscussons. The authors were supported by the Natonal Insttute on Dsablty and Rehabltaton Research (grant no. H133G030080), the Natonal Scence Foundaton (grant no. IIS0415310) and the Natonal Eye Insttute (grant no. EY015187-01A2). References [1] http://www.blndsght.com [2] X. Chen and A. L. Yulle. ``Detectng and Readng Tet n Natural Scenes.'' CVPR 2004. [3] J. Coughlan and H. Shen. A Fast Algorthm for Fndng Crosswalks usng Fgure-Ground Segmentaton. 2nd Workshop on Applcatons of Computer Vson, n conuncton wth ECCV 2006. Graz, Austra. May 2006. [4] J. Coughlan, R. Manduch and H. Shen. "Cell Phone-based Wayfndng for the Vsually Impared." 1st Internatonal Workshop on Moble Vson, n conuncton wth ECCV 2006. Graz, Austra. May 2006. [5] J. Gao and J. Yang. ``An Adaptve Algorthm for Tet Detecton from Natural Scenes.'' CVPR 2001. [6] A.K. Jan and B. Tu. ``Automatc Tet Localzaton n Images and Vdeo Frames.'' Pattern Recognton. 31(12), pp 2055-2076. 1998. 5
[7] H. L, D. Doermann and O. Ka. Automatc tet detecton and trackng n dgtal vdeos. IEEE Transactons on Image Processng, 9(1):147-156, January 2000. [8] The Natonal Federaton for the Blnd. What s the Natonal Federaton of the Blnd? http://www.nfb.org/whats.htm [9] M. J. Scherer. ``Lvng n the State of Stuck: How Assstve Technology Impacts the Lves of People Wth Dsabltes.'' Brooklne Books. 4th edton. 2005. [10] http://www.seengwthsound.com/ocr.htm [11] N. Shental, A. Zomet, T. Hertz and Y. Wess. ``Parwse Clusterng and Graphcal Models.'' NIPS 2003. [12] J. Sh and J. Malk. "Normalzed Cuts and Image Segmentaton." IEEE Transactons on Pattern Analyss and Machne Intellgence, 22(8), 888-905, August 2000. [13] http://www.seengwthsound.com/voce.htm [14] V. Wu, R. Manmatha, and E. M. Rseman. Fndng Tet In Images. Proc. of the 2nd ntl. conf. on Dgtal Lbrares. Phladapha, PA, pages 1-10, July 1997. [15] J.S. Yedda, W.T. Freeman, Y. Wess. ``Bethe Free Energes, Kkuch Appromatons, and Belef Propagaton Algorthms''. 2001. MERL Cambrdge Research Techncal Report TR 2001-16. [16] S. X. Yu and J. Sh. ``Obect-Specfc Fgure-Ground Segregaton.'' CVPR 2003. [17] D.Q. Zhang and S.F. Chang, ``Learnng to Detect Scene Tet Usng a Hgher-Order MRF wth Belef Propagaton.'' CVPR 04. [18] Y. Zheng, H. L and D. Doermann, ``Tet Identfcaton n Nosy Document Images Usng Markov Random Feld.'' Proceedngs of the Seventh Internatonal Conference on Document Analyss and Recognton (ICDAR 2003). 6