Boosted Decision Trees for Word Recognition in Handwritten Document Retrieval

Transcription

1 5 February 2009 Padova, Italy Boosted Decision Trees for Word Recognition in Handwritten Document Retrieval Howe, N.R., Rath, T.M. and Manmatha, R. Department of Computer Science, University of Massachusetts SIGIR 2005 published by ACM, New York Information Management Research Group (IMS) Department of Information Engineering University of Padua, Italy

2 Outline Introduction to recognition and retrieval of handwritten documents Classification Algorithm: AdaBoost and Decision trees Classification Experiments Language Models for Retrieval Conclusions 2

3 Introduction Recognition and retrieval of off-line hand-written documents based upon word classification Decision tree with normalized pixels as feature form the basis for AdaBoost Problem of skewed distribution of class frequencies Experiments done on the GW20 and GW100 corpus Retrieval is done using a language model over recognized words 3

4 Introduction The main goal is to offer access to world historical handwritten documents Often HW works on limited vocabularies (postal address) Historical documents add complexity due to ink bleeding or dirt on the paper Use pixels in normalized word image at multiple scales (image pyramids) as features Propose an innovative procedure to create additional training data 4

5 The Boosting Approach Boosting is a classification technique that determines its prediction via the weighted vote of a diverse set of base classifiers each of which has been trained on a different weighting of the trained data AdaBoost trains successive version of its base classifier focusing on hard-to-classify examples It can use a simple classifier but stronger classifiers get better results 5

6 AdaBoost in brief Introduced in 1995 by Freund and Schapire in A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):

7 AdaBoost in brief Introduced in 1995 by Freund and Schapire in A decision-theoretic generalization of the on-line learning and an application , to boosting. Journal of Computer and System Sciences, 55(1): Reference: Freund, Y. and Schapire, R. E. A Short Introduction to Boosting, Journal of Japanese Society for Artificial Intelligence, 14(5): 6

8 AdaBoost in brief Introduced in 1995 by Freund and Schapire in A decision-theoretic generalization of the on-line learning and an application , to boosting. Journal of Computer and System Sciences, 55(1): Reference: Freund, Y. and Schapire, R. E. A Short Introduction to Boosting, Journal of Japanese Society for Artificial Intelligence, 14(5): Binary case 6

9 AdaBoost in brief Introduced in 1995 by Freund and Schapire in A decision-theoretic generalization of the on-line learning and an application , to boosting. Journal of Computer and System Sciences, 55(1): Reference: Freund, Y. and Schapire, R. E. A Short Introduction to Boosting, Journal of Japanese Society for Artificial Intelligence, 14(5): All weights are set equally 6

10 AdaBoost in brief Introduced in 1995 by Freund and Schapire in A decision-theoretic generalization of the on-line learning and an application , to boosting. Journal of Computer and System Sciences, 55(1): Reference: Freund, Y. and Schapire, R. E. A Short Introduction to Boosting, Journal of Japanese Society for Artificial Intelligence, 14(5): Find a weak hypothesis appropriate for the distribution Dt 6

11 AdaBoost in brief Introduced in 1995 by Freund and Schapire in A decision-theoretic generalization of the on-line learning and an application , to boosting. Journal of Computer and System Sciences, 55(1): Reference: Freund, Y. and Schapire, R. E. A Short Introduction to Boosting, Journal of Japanese Society for Artificial Intelligence, 14(5): The error measures the goodness of the hypothesis 6

12 AdaBoost in brief Introduced in 1995 by Freund and Schapire in A decision-theoretic generalization of the on-line learning and an application , to boosting. Journal of Computer and System Sciences, 55(1): Reference: Freund, Y. and Schapire, R. E. A Short Introduction to Boosting, Journal of Japanese Society for Artificial Intelligence, 14(5): AdaBoost chooses the parameter αt that measures the importance assigned to ht αt 0 if εt 1 /2 6

13 AdaBoost in brief Introduced in 1995 by Freund and Schapire in A decision-theoretic generalization of the on-line learning and an application , to boosting. Journal of Computer and System Sciences, 55(1): Reference: Freund, Y. and Schapire, R. E. A Short Introduction to Boosting, Journal of Japanese Society for Artificial Intelligence, 14(5): Dt is updated increased the weight of misclassified examples concentrate on hard examples 6

14 AdaBoost in brief Introduced in 1995 by Freund and Schapire in A decision-theoretic generalization of the on-line learning and an application , to boosting. Journal of Computer and System Sciences, 55(1): Reference: Freund, Y. and Schapire, R. E. A Short Introduction to Boosting, Journal of Japanese Society for Artificial Intelligence, 14(5): H is a weighted majority vote for the T weak hypothesis where αt is the weight assigned to ht 6

15 AdaBoost in brief In Schapire, R. E. and Singer, Y. Improved boosting algorithms using confidence-rated predictions, Machine Learning 37(3) , 1999 is shown how AdaBoost can handle weak hypothesis which output real values. Consider x, h t outputs ht(x) R whose sign is the predicted label (-1 or +1) and whose magnitude ht(x) gives the measure of confidence in the prediction. AdaBoost.M1 is the extension to the multi-class case it is adequate when the weak learner is strong to achieve an accuracy of at least 50%. Extensions: AdaBoost.MH and AdaBoost.MR reducing multi-class to a larger binary problem 7

16 Choices and Problems The recognition process uses values sampled directly from the word image at varying resolutions The choice is to divide word image and not letters recognizing letters become a limiting step segmentation of individual word images is easier (image classification problem) Skewed distribution of class frequencies (Zipfian distribution)and paucity of training data for most word classes 8

17 Classification Algorithm HW words belonging to a single class have similar ink distribution (bur not identical) The position of individual features within the word will shift from example to example The pixel representation contains information about word identity that can be amplified by boosting clearer areas will contain more reliable features blurring indicates areas of inconsistency 9

18 Classification Algorithm HW words belonging to a single class have similar ink distribution (bur not identical) The position of individual features within the word will shift from example to example Composite image of 21 examples of the word Instructions. Straightforward use of the pixel is ineffective. The pixel representation contains information about word identity that can be amplified by boosting clearer areas will contain more reliable features blurring indicates areas of inconsistency 9

19 Common framework Pixels used as features for word image classification Word image mapped into a common pixel grid Images scaled and translated horizontal line span: (0,0) to (1,0) resampling each image to a common grid will produce common pixel representation Long and short words (horizontal and vertical dimensions) astronomic data sizes Pyramid approach 10

20 Pyramid Approach Define a family of standard grids base grid Φ o covering ([0,1], [-0.5, 0.5]) broken into 32x32 px Refine grids cover the same square region with double resolution 64x64 px Like a tree in which each Φ k has 4 children in Φk+1 The standard image usually don t cover the full vertical extent of the grid portions above and below the edges of standard image may be represented using a single default value Data need only be stored for Φk with resolution up to that of the reference image. 11

21 Pyramid Approach Define a family of standard grids base grid Φ o covering ([0,1], [-0.5, 0.5]) broken into 32x32 px Refine grids cover the same square region with double resolution 64x64 px Like a tree in which each Φ k has 4 children in Φk+1 The standard image usually don t cover the full vertical extent of the grid portions above and below the edges of standard This square image area captures may be all represented the detail of interest for most words using a single default value Data need only be stored for Φk with resolution up to that of the reference image. 11

22 Pyramid Approach Define a family of standard grids base grid Φ o covering ([0,1], [-0.5, 0.5]) broken into 32x32 px Refine grids cover the same square region with double resolution 64x64 px Like a tree in which each Φ k has 4 children in Φk+1 The standard image usually don t cover the full vertical extent of the grid portions above and below the edges of standard image may be represented using a single default value Data need only be stored for Φk with resolution up to that of the reference image. 11

23 Boosting and Decision Trees Word image recognition has many potential class to use AdaBoost, a classifier with at least 50% accuracy is needed Decision Trees are the foremost option well understood achieve arbitrary accuracy on the training data in practice Se si va avanti fino ad avere un solo training At each node the training examples are split into 2 example per foglia si sub-groups by comparing the value of raggiunge a chosen accuratezza pixel in 100%. -> overfit the tree each to a chosen threshold che deve essere pruned rimuovendo rami statisticamente poco bell A tree branch growth is stopped when the contained subset is dominated by a majority class 12

24 C4.5 C4.5 provides the algorithm for building the decision tree, with some modification designed to support the grid pyramid data structure C4.5 builds decision trees from a set of training data using the concept of information entropy Training data is a set S = s1, s2,..., sn of already classify samples, where si = x1, x2,..., xm xj = feature Training data is augmented with a vector C = c1, c2,..., cv where ci represents the class that each sample belongs to. It uses the fact that each attribute of the data can be used to make a decision that splits the data into smaller subsets 13

25 C4.5 C4.5 Reference: provides Quinlan, the J. algorithm R. C4.5: Programs for building for Machine the Learning. decision tree, with some modification Morgan Kaufmann, designed to support the grid pyramid data structure C4.5 builds decision trees from a set of training data using the concept of information entropy Training data is a set S = s1, s2,..., sn of already classify samples, where si = x1, x2,..., xm xj = feature Training data is augmented with a vector C = c1, c2,..., cv where ci represents the class that each sample belongs to. It uses the fact that each attribute of the data can be used to make a decision that splits the data into smaller subsets 13

26 C4.5 for images pyramid At each node a feature (i.e. pixel location) and a threshold value must be chosen as split value exhaustiveness is not possible Only Φ o is exhaustively examined location and threshold offering the greatest information gain is retained The search proceeds selectively to its children in Φ 1, from there to the children of the best of those locations, and so on until the maximum resolution available is reached The grid level, location and threshold with the highest information gain becomes the decision criterion for the node 14

27 Boosting Single trees do not generalize well for hand-written word images (1) The base classifier is normally generated from the training data (2)AdaBoost raises the weights of misclassify elements forcing base classifier to work harder (3)After many rounds of boosting, a weighted vote classifies the training set perfectly and shows good generalization to unseen examples (4)In practice after a certain # of rounds (here: 200) the results don t improve significantly 15

28 Supplementary Training Examples Problem: paucity of training examples for many classes makes generalization difficult Zipf law few examples for many words 57% of the words appear only one time in the test collection Solution: generate new training examples for low frequencies classes via stochastical distortion of the available example Improve overall word classification accuracy 16

29 Supplementary Training Examples Sample from the original using a grid of points whose portions have been perturbed from a uniform lattice Nearby points should be perturbed by similar amounts New image is the distortion of the old one 17

30 Classification Experiments Test collection: GW20 previously used and GW100 non overlapping with GW20 written by multiple hands manually segmented to extract images of individual words (4856 in GW20 and in GW100) all images labeled with their ASCII equivalent GW20 experiments. 19 pages for training and 1 for tests. 18

31 Classification Experiments Test collection: GW20 previously used and GW100 non overlapping with GW20 written by multiple hands manually segmented to extract images of individual words (4856 in GW20 and in GW100) all images labeled with their ASCII equivalent Single decision tree standard C4.5 grown to completion, then pruned GW20 experiments. 19 pages for training and 1 for tests. 18

32 Classification Experiments Test collection: GW20 previously used and GW100 non overlapping with GW20 written by multiple hands manually segmented to extract images of individual words (4856 in GW20 and in GW100) all images labeled with their ASCII equivalent AdaBoost + Decision Tree as base learner GW20 experiments. 19 pages for training and 1 for tests. 18

33 Classification Experiments Test collection: GW20 previously used and GW100 non overlapping with GW20 written by multiple hands manually segmented to extract images of individual words (4856 in GW20 and in GW100) all images labeled with their ASCII equivalent AdaBoost + Decision Tree + Synthetic Data No experiments with AdaBoost and simple classifier because 50% accuracy cannot be achieved GW20 experiments. 19 pages for training and 1 for tests. 18

34 Classification Experiments Test collection: GW20 previously used and GW100 non overlapping with GW20 written by multiple hands GW100: performances = manually segmented to extract images of individual words (4856 in GW20 and in GW100) + OOV words and image all images labeled qualitywith their ASCII equivalent GW20 experiments. 19 pages for training and 1 for tests. 18

35 Retrieval Language Modeling approach to retrieval Ref: Ponte, J. and Croft, W.B. A language modeling approach to Information Retrieval, SIGIR Use query likelihood formulation where documents are ranked according to P(Q D) AdaBoost provides classification rather than probabilities only the most likely label for each word image is preserved An approach can be that the probabilities are equal to their frequencies in each recognized document but many words can be misclassified 19

36 Retrieval: Regularization Schema Regularization schema based upon classification rank information Hypothesis: Rank info may be more important than actual probabilities top terms very imp. some moderate imp. etc. Infer probabilities from the rank ordered output of AdaBoost classification algorithm rank the top n classes according to scores Associate a probability to classes fitting the Zipfian distribution to rank classes 20

37 Retrieval: Regularization Schema Instead a document has one possible word for each position, now it contains a probability distribution at each position Test on Lemur with the query-likelihood ranking method Because of limited size of GW20 line retrieval is performed relevant = line containing all query terms stop-words removed GW100 allows for full page retrieval with GW20 as training examples 21

38 Retrieval: Regularization Schema Instead a document has one possible word for each position, now it contains a probability distribution at each position Test on Lemur with the query-likelihood ranking method Because of limited size of GW20 line retrieval is performed relevant = line containing all query terms stop-words removed GW100 allows for full page retrieval with GW20 as training examples 21

39 Conclusions Learning algorithms are not designed to deal with training data that exhibits highly skewed distribution of class frequencies The methodology described does not always work fine because the synthetic training data are not truly independent of the originals Performances are good for GW20 The problem is challenging for GW100 larger dataset, noise using soft classification decisions can improve the results for shorter queries 22