Integrating unstructured data into relational databases

Transcription

1 Integrating unstructured data into relational databases Imran R. Mansuri IIT Bombay Sunita Sarawagi IIT Bombay Abstract In this paper we present a system for automatically integrating unstructured text into a multi-relational database using state-of-the-art statistical models for structure extraction and matching. We show how to extend current highperforming models, Conditional Random Fields and their semi-markov counterparts, to effectively exploit a variety of recognition clues available in a database of entities, thereby signicantly reducing the dependence on manually labeled training data. Our system is designed to load unstructured records into columns spread across multiple tables in the database while resolving the relationship of the extracted text with existing column values, and preserving the cardinality and link constraints of the database. We show how to combine the inference algorithms of statistical models with the database imposed constraints for optimal data integration. 1 Introduction Database systems are islands of structure in a sea of unstructured data sources. Several real world applications now need to create bridges for smooth integration of semistructured sources with existing structured databases for seamless querying. Although a lot of research has gone in the NLP [, 1], machine learning [8, 18] and web mining community [9] on extracting structured data from unstructured sources, most of the proposed methods depend on tediously labeled unstructured data. The database is at best treated as a store for the structured data. In this paper we attempt to bridge this gap by more centrally exploiting existing large databases of the multirelational entities to aid the extraction process on one hand, and on the other hand, augmenting the database itself with the extracted information from unstructured text in a way that allows multiple noisy variants of entities to co-exist and aid future extraction tasks. 1.1 Applications There are a number of applications where unstructured data needs to be integrated with structured databases on an ongoing basis so that at the time of extraction a large database is available. Consider the example of publications portals like Citeseer and Google Scholar. When integrating publication data from personal homepages, it does not make sense to perform the extraction task in isolation. Structured databases from, say ACM digital library or DBLP are readily available and should be exploited for better integration. Another example is resume databases in HR departments of large organizations. Resumes are often stored with several structured fields like experience, education and references. When a new resume arrives in unstructured format, for example, via text, we might wish to integrate it into the existing database and the extraction task can be made easier by consulting the database. Another interesting example is from personal information management (PIM) systems where the goal is to organize personal data like documents, s, projects and people in a structured inter-linked format. The success of such systems will depend on being able to automatically extract structure from the existing predominantly file-based unstructured sources. Thus, for example we should be able to automatically extract from a PowerPoint file, the author of a talk and link the person to the presenter of a talk announced in an . Again, this is a scenario where we will already have plenty of existing structured data into which we have to integrate a new unstructured file. 1. Challenges and Contribution A number of challenges need to be addressed in designing an effective solution to the data integration task. Useful clues are scattered in various ways across the structured database and the unstructured text records. Exploiting them effectively is a challenge. First, the existing structured databases of entities are organized very dferently from labeled unstructured text. 1

2 Most prior work has been on extracting information from unstructured text which can be modeled as a sequence of tokens. The input labeled data is thus a collection of token sequences. In contrast, the database is a graph of entity relationships and there is no information in the database about inter-entity sequencing. A good extraction system should exploit both the inter-entity sequencing information from the labeled unstructured text and the semantic relationship between entities in the database. Second, there is signicant format variation in the names of entities in the database and the unstructured text. The entities in a structured database provide useful pattern-level information that could help in recognizing entities in the unstructured text. Finally, in most cases the database will be large whereas labeled text data will be small. Features designed from the databases should be efficient to apply and should not dominate features that capture contextual words and positional information from the limited labeled data. We design a data integration system that builds upon state-of-the-art semi-markov models [7, 16] for information extraction to exploit useful information in both structured data and labeled unstructured data in spite of their format, structure and size variations. Our experiments show that our proposed model signicantly improves the accuracy of data integration tasks by exploiting databases more effectively than has been done before. Outline In Section, we describe our problem setup and present the overall architecture of the system. In Section, we present background on statistical models for extraction and matching. In Section we show how we augment these models to exploit features from the database, and integrate data so as to meet the cardinality and link constraints in the database. In Section 5, we present integration performance on real-le datasets and present related work in Section 6. Setup and Architecture.1 Relational data representation A relational database represents a normalized set of entities with the relationships between them expressed through foreign keys. An example of such a multi-relational database is shown in Figure 1 representing bibtex entries for journal articles. The database consists of three dferent entities: articles, journals and authors. The foreign keys from the article to the journal tables shows how journals name have been normalized across dferent articles. The author name is a set-valued attribute of articles and this relationship is stored in the writes table. In real-le the same physical entity is often represented by multiple names that can be thought of as noisy variants Id Title Writes Article Articles Update Semantics Author 11 Year 198 Journal 10 Authors Id 11 Name M Y Vardi J. Ullman Ron Fagin Jeffrey Ullman Canonical Id Canonical Journals Name ACM TODS AI ACM Trans. Databases Canonical Figure 1. A sample normalized database for storing journal bibitems Id 7 R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also Articles Title Update Semantics Writes Article 7 7 Author Year Journal Authors Id Name M Y Vardi J. Ullman Ron Fagin Canonical Jeffrey Ullman R Fagin J Helpern Journals Id Name ACM TODS AI ACM Trans. Databases Canonical 17 Canonical Figure. Integrating an unstructured citation into the database of Figure 1 of each other. The database allows these variants to co-exist and canonical links are placed from a noisy variant of an entry to its canonical form. In the sample database of Figure 1, we show such canonical links between two journal entries ACM TODS and ACM Trans. Databases and between entries Jeffrey Ullman and J. Ullman in the authors table. Each variant of an entity has a unique identier and thus the canonical link is a foreign key to this unique identier. These foreign keys can be within the same table as in the examples in Figure 1 or they could be across tables. This data representation allows a user to preserve an entity in a format that is dferent from its canonical form. For example, a user might want author names in a citation to be abbreviated although the canonical entry might contain the full name. In this paper we do not address the issue of which entry to pick as the canonical entry or how to construct a merged 17 10

3 canonical entry. These can be handled as part of follow-up operations once the canonical links are created. A bit of notation to summarize the above: The database consists of a set of entities (each of which map to a database table) where each entity has a set of attributes or column values (these exclude columns like primary or foreign keys). Let refer to the union of all attributes over all entities. In Figure 1, is and includes title, year, journal name and author name. Each entity in the database has a unique identier and optionally a canonical identier that is a foreign key to the of another entity of which it is assumed to be a noisy variant.. Labeled unstructured text The unstructured data is a text record representing a single top-level entity like articles, but embedded in a background of irrelevant text. A labeled set is provided by someone manually labeling portions of the text with the name of the entity to which it belongs. We will show in the experimental section that the size of can be very small (5 to 10).. The task We start with an existing multi-entity, multi-variant database and a small labeled set. We use this to train models for information extraction and matching so that given an unstructured text string, we can perform the following: Extract attributes from corresponding to all attributes in the entity database. We call this the information extraction problem. Map the extracted entities to existing entries in the database they match, otherwise, create new entries. Assign relationships through foreign keys when attributes span multiple linked tables. We call this the matching problem. As a result of these operations the entry is integrated across the multiple tables of the database with the canonical links resolved correctly. In actual deployment a user might also wish to inspect the extracted entries and correct for any errors of the automated integration step. We continue with our sample database of Figure 1 and show how to integrate into it a noisy unstructured citation R. Fagin and J. Helpern, Belief, awareness and reasoning. In AI 1988 [10] also (see in Figure ). We extracted from this two authors, a title, a journal name and a year field and integrated them with the database as shown in Figure. On matching with the database, we find the journal AI to be already present and the author R. Fagin to be a variant of an existing author name Ron Fagin. This leads to the canonical link from the new entry and the existing author name.. Available clues The database with its entities and canonical links, and the labeled data can be exploited in a variety of ways for learning how to integrate an incoming unstructured entry into the database. We go over these first and later show how these are exploited by the statistical models...1 Clues in the database As a database gets large, in most cases the entity to be extracted from unstructured text will exist in the database. We therefore need models that can effectively exploit this valuable piece of information. The entities in the database could be useful even when unstructured text does not match a database entity. Dferent entity attributes have very distinctive patterns that can be generalized to recognize these entities in unstructured text. For example, by inspecting entries in a publications database we can detect that author names often follow a pattern of the form [A-Z]. [A-Z]. [A- Z][a-z]+. The schema of the database provides information on the number of entities that are allowed in each record. In Figure 1, the table structure makes it clear that an article can have at most one year, journal and title field but it can have multiple author fields. Also, a foreign key in the database provides information on what entity is allowed to go with what. For example, we label a text segment as a city name and the segment matches the entry San Francisco in the cities column, then any text segment that is labeled as a state needs to match entry CA. In data integration tasks, respecting cardinality and link constraints could be a requirement even it does not improve accuracy... Information in the labeled unstructured data Labeled unstructured data is the classical source of information for extraction model. Traditionally, three kinds of information have been derived from unstructured records. First, the labeled entities hold the same kind of pattern information as the database of entities as discussed above. Second, the context in which an entity appears is a valuable clue that is present solely in the labeled unstructured data. Context is typically captured via few words before or after the labeled entity. For example, the word In often appears before a journal name. Finally, the likely ordering of entity labels within a sequence is also useful. For example, author names usually appear before or after a title.

4 Background In this section we review state-of-the-art statistical models for extraction and matching. These models are designed to exploit clues only in labeled unstructured data and in Section we will discuss how to extend these to include clues from the database. The problem of extracting structured entities from unstructured data is an extensively researched topic. A number of models have been proposed ranging from the earliest rule-learning models to probabilistic approaches based on generative models like HMMs [17, ] and conditional models like maxent taggers [1]. Almost all the probabilistic models treat extraction as a sequence labeling problem. The input unstructured text is treated as a sequence of tokens which needs to be assigned a corresponding sequence of labels from the set of labels. A promising recently proposed model for information extraction is Conditional Random Fields (CRFs). CRFs have been shown [8, 18] to out-perform previous generative models like Hidden Markov Models and local conditional models like maxent taggers [1]..1 Conditional Random Fields CRFs [8] model a conditional probability distribution over label sequence given the input sequence. The label assigned to the -th word can be a function of any property of the -th word or the property of the word defining its context and the label of the word before it. These properties are represented as a vector of feature functions where each is a scalar function of the form. Examples of such features are: is capitalized Author name is journal title where the indicator function true and zero otherwise; The CRF model associates a weight to each of these features, thus there is a weight vector corresponding to the feature vector. These are used to design a global conditional model that assigns a probability for each possible label sequence given (the features of) the input sequence as follows: where and is a normalizing factor equal to. (1) When deployed for extraction of entities from a text sequence, CRFs find the best label sequence as: An efficient inference algorithm is possible because all features depend on just the previous label. Let denote the set of all partial labels starting from 1 (the first index of the sequence) to, such that the -th label is. Let denote the largest value of for any. The following recursive calculation (called the Viterbi algorithm) finds the best sequence: The best label sequence corresponds to the path traced by. This algorithm is linear in the length of. Learning is performed by setting parameters to maximize the likelihood of the training set expressed in logarithmic terms as We wish to find a that maximizes. The above equation is convex and can thus be maximized by gradient ascent or one of many related methods. (In our implementation, we have used a limited-memory quasi-newton method[10, 11].) More computational details of training can be found in [18]. The training algorithm makes multiple linear scans through the data sequences and for each sequence computes the value of the likelihood function and its gradient in two scans over the features of each sequence.. Semi-Markov CRFs (Semi-CRFs) Semi-CRFs [16, 7] extend CRFs so that features can be defined over multi-word segments of text instead of individual words. This provides for a more natural encoding of the information extraction problem, which is now defined as segmenting a text sequence such that each segment corresponds to a whole entity. In CRFs, all contiguous words which get assigned the same label implicitly represent an entity. This is not very intuitive since entities mostly consist of multiple words and higher accuracy is achieved features could be defined over the entire proposed entity as shown in [16, 7]. More formally, a segmentation of an input sequence is a sequence of segments such that the last segment ends at, the first segment starts at 1, and segment () () ()

5 begins right after segment ends. Each segment consists of a start position, an end position, and a label. A Semi-CRF models the conditional probability distribution over segmentation for a given input sequence as follows: where again is a weight vector for and and. Like in the case of normal CRFs, the label of a segment depends on the label of the previous segment and the properties of the tokens comprising this segment. Thus a feature for segment is of the form. An example of such features for a segment starting at position and ending at 5 is: (5) appears in a journal list journal During inference, the goal is to find a segmentation of the input sequence such that (as defined by Equation 5) is maximized. Let be an upper bound on segment length. Let denote set of all partial segmentation starting from 1 (the first index of the sequence) to, such that the last segment has the label and ending position. Let denote the largest value of for any. Omitting the subscripts, the following recursive calculation implements a semi-markov analog of the usual Viterbi algorithm: The best segmentation then corresponds to the path traced by. The running time of semi- Markov Viterbi (Equation 6) is just times the running time of Markov Viterbi. is usually a small number. The training of semi-crfs generalizes the training algorithm of CRFs in a manner analogous to the generalization of the Viterbi algorithm. We refer for details to [16].. Matching Matching or deduplicating entities using statistical learning models is also a popular research topic [1,, 15]. At the core of most of these models is a binary classier on various similarity measures between a pair of records (6) where the prediction is 1 the records match and 0 otherwise. This is a straight-forward binary classication problem where the features typically denote various kinds of attribute level similarity functions like Edit distance, Soundex, N-grams overlap, Jaccard, Jaro-Winkler and Subset match [6]. Thus, we can use any binary classier like SVM, decision trees, and logistic regression. We use a CRF with a single variable for extensibility. Thus, given a text segment pair, the CRF predicts a that can take values 0 or 1 as The feature vector corresponds to various similarity measures between the text segments when, i.e., when and match. Our models for data integration In this section we describe how all the clues available across the labeled data and the multi-relational database can be combined to solve our data integration task. Let denote the input unstructured string with tokens obtained from the text using any tokenization method. Typically, text is tokenized based on white space and a limited set of delimiters like, and.. Our goal is to extract from, segments of text belonging to one of the entity attributes in the database. We also need to assign to each segment labeled with an attribute a canonical id of an entry in the column of the database with which it matches. If it matches none of the ids, then is a special value 0 denoting none-of-the-above. Also, canonical ids connected through foreign links need to respect the foreign keys in the database and the labels need to follow the cardinality constraints imposed by the schema..1 Model for extraction We deploy Semi-CRFs since they provide high-accuracy and an elegant mechanism for extracting entities in the presence of databases. The main challenge in their usage is how to incorporate clues obtained from the existing entities in the database. A straightforward mechanism would be to add the existing list of entity values as additional labeled training examples. However, in practice we found this approach to be very sensitive to the training set and size and often leads to a drop in accuracy. We therefore designed models that separate out the roles of database entities and unstructured records. We use the database to add three types of entity-level features in the sequential model trained over unstructured text. (7) 5

6 .1.1 Similarity to a Word-level dictionary Very often an entity in an unstructured database might already be existing in the database albeit in a slightly dferent form. A number of similarity functions have been proposed in the record-linkage literature for defining efficient and effective similarity functions which are tolerant to commonly occurring variations in named entities. We can exploit these to perform better extraction by adding similarity features as proposed in [7]. An example feature based on this is: Authors in,e Author This assigns as feature value the maximum cosine similarity that segment has to the Author name column of the database..1. Similarity to Pattern-level dictionary Each entity in the database contains valuable information at the pattern-level that can be exploited even the entity in the unstructured text is not the same. We design a succinct set of features to exploit such patterns as follows. We first choose a sequence of regular expression patterns. Then for each entry in the database, we generalize each of the tokens of to the first pattern it matches, concatenate the pattern-ids over all tokens in and insert the concatenated pattern-ids of in the index. For example, suppose our regular expression pattern sequence consisted of the following three patterns: id 1 pattern Then the text string R. L, Stevenson will get the pattern-id sequence: 1,, assuming the text is tokenized on space. We find the pattern representation of all entities in the database and index them separately for each entity type. We then design boolean features for segments that check for the presence of the segment s pattern in each entity s pattern dictionary..1. Entity classier The database contains a variety of other clues including typical length of entities, characteristics words and multiple patterns per tokens, that is not captured by the above two dictionaries. We capture all these by building an entity classier where an entity is characterized by as many such features as required and a label that denotes its type. In addition, we also include subsets of this entity with a negative label. This defines a multi-class classication problem. The CRF model (or any other classier like SVM) can be used to train a multi-class classier. Let denote such a classier. When is applied on a text segment, it returns a vector of scores that denotes the probability that is of each of the given entity types. Let denote the extraction model that we are constructing from the unstructured examples. We use classier to define a new set of semi-markov features for model. If the number of distinct entity types is, we will have features where the feature captures the strength of the th score obtained from in identying the th entity in.. Model for matching Given a segment of for which the attribute label resolves to one of the attributes, we need to find a canonical id that denotes which existing values of the attribute it matches in the database and = 0 it matches none of the existing entities. We adopt the pair-wise match model of Section. where for each canonical labeled segment in the training data we form pairs with the database entities. The entities with the same canonical id get a label of 1 and others get a label of 0. In addition to examples in the training data we can easily also generate pairs of records from the database and assign a label of 1 to those with the same canonical-id and zero otherwise. An important concern about the pair-wise matching process is efficiency. We cannot afford to attempt potential match with each and every possible canonical id for which there is an entry in the database. We solve this problem by using a top-k index for each database column. Given a text segment, we probe the index to get a candidate set of segments and evaluate for each candidate with canonical label. If the predicted value is 0 for each candidate, we claim that the segment resolves to none-of-the-existing entities. When presented with a record with multiple extracted attribute segments, we need to simultaneously assign canonical-ids to each of these so as to not violate any of the integrity constraints. We first search the attribute index to find the top-k candidate canonical id for each attribute segment. We then pose database queries to find all canonical ids which have direct or indirect foreign key link(s) from each of the top-k canonical ids associated with the attribute segment. We denote the set of link constraints as, which is a set of tuples of the form, where is the attribute type of an attribute segment, is one of the top-k canonical ids for the segment, is an attribute having a foreign key relationship with, and is the canonical id of the entry with the attribute and is being referred by the entry. Assuming that a segment sequence has two attribute segment and such that they are of attribute types and respectively, then a link constraint is said to be violated 6

7 Unstructured labeled data (L) Database Unstructured unlabeled data Similarity indices Collect DB features Pattern indices Entity classier DB records Unstructured labeled data (L) with DB features Train matching model (semi CRF) Matching model Figure. Training process Similarity indices Collect DB features Pattern indices Entity classier Unstructured unlabeled data with DB features Link constraints Cardinality constraints Matching model Other extraction features Train extraction model (semi CRF) Extraction model Extraction model Other extraction features Perform extraction using A* 1: Perform backward Viterbi scan and cache all the possible suffix upper bound scores (See eq: 8) : Initialize priority queue with the start state. : Initialize global-lb. : while is not empty do 5: Retrieve a state with the maximum sequence upper bound estimate from the queue. 6: is a goal state then 7: Return the solution 8: end 9: Generate successors of the state, and add valid successors to provided global-lb. 10: Periodically update global-lb via beam search from. 11: end while 1: Return failure Figure 5. Constrained inference using A* Database Matched record(s) Perform matching using A* Figure. Deploy process any of the following is true: Extracted record(s) Attribute has a foreign link to the attribute in the database schema, and canonical id assigned to the segment points to a canonical id other than assigned to, i.e.. same as above with the roles of and reversed. signicantly reduces the time required for finding matches. We do not consider such optimizations in this paper because of lack of space and because our main focus here is accuracy of extraction via skillful feature design. We cache the index lookups but compute all the other features on the fly for each training iteration. We then train a semi-crf model for extraction that outputs the weight values for each feature type. We train the match model by forming pairs both out of top-k matches of the training segments with the database entities and also with pairs formed purely out of database entities. The trained models are stored on disk for the deployment stage.. Overall training and deployment process The overall training and deployment process for extraction and matching appears in Figures and...1 Training process During the offline training phase, we first train the entity classier out of the entities in the database. The entity classier could be trained incrementally but we do not support that currently. For the similarity and pattern features, we maintain an inverted index over each of the entity columns that is always kept upto-date with the entries in the database. Next, for each labeled document we create its normal intrinsic features and then probe the entity classier and indices to create the three kinds of database derived features. The index probes for computing the various dictionary features is performed using a Top-K search on the inverted index. Also, since this requires multiple index probes on overlapping segments of the input, we have designed a batch top-k index search algorithm [5] where instead of independently finding top-k matches for all candidate segments, we batch the top-k queries over overlapping segments. This.. Deployment process While making inference on the labels of an unstructured text, we also need to follow the cardinality constraints imposed by the database schema. The Viterbi algorithm presented earlier cannot easily meet these constraints because this upsets the Markovian property that Viterbi depends on. Section. presents an A* based search algorithm that can efficiently find the optimal segmentation even with constraints. After extracting the attribute segments, we assign canonical ids while maintaining the link constraints using another run of the same A* algorithm.. The A* inference algorithm A simple mechanism to extend Viterbi for constrained labeling is via a beam search which maintains not just the best path but the best paths at any position. When we extend the path by one more segment, we check for conflicts and drop invalid extensions. This is guaranteed to find the optimal solution when is very large and the time and space requirement is multiplied by. In practice, conflicts 7

8 are rare and the beam search algorithm might be unnecessarily maintaining alternatives that are never used. We propose a better optimized inference algorithm based on the A* search paradigm. This algorithm is guaranteed to find the optimal solution given unbounded space but because of the non-markovian constraints we cannot guarantee optimality in polynomial time. The algorithm allows us to tradeoff optimality and efficiency by controlling the number of search nodes expanded. Even when it is made to terminate early the solution will be no worse than a normal Viterbi run with a fixed beam size. A* is a branch and bound search algorithm that depends on upper and lower bounds guarantees associated with states in the search space to reach the optimal solution via the minimum possible number of states. An outline of the inference algorithm based on the A* technique is given in Figure 5. Our state space consists of the set of partial segmentation starting from the beginning of the sequence to an intermittent position. Thus, each state in the state space can be expressed as a sequence of segments such that the first one starts at 1 and the last one ends at, i.e., using the segment notation from Section., and. We denote a state as, the start state as and a goal state as, where. Each state has associated with it an upper bound score based on which it is ordered in a priority queue. Initially, has only the start state. In each iteration we dequeue the state with the largest upper bound and generate its successors as follows: Generating successors The successors of a given state are all the states which extend the segmentation with a segment which starts at, i.e.,, ends at a position no greater than and has a label that does not violate any constraint given the sequence of labels in the segmentation of. The constraints imposed can be any arbitrary constraints on label assignments. In our case, cardinality constraints derived from the database schema are imposed during extraction. While for matching, we enforce link constraints derived from the database entities as discussed in Section.. Calculating upper bound We now discuss how to compute the upper bound score associated with a state. This score consists of two parts: the first part is the actual score over the partial segment sequence ; thus. This value can be computed incrementally when a state is generated from its parent with the extension of a single segment. The second part is an upper bound on the maximum score possible is extended with segments that take it to a goal state. The sum thus is an upper bound on the maximum possible score of the entire sequence from the start position 1 to an ending position with the prefix of the path constrained to be. The value of is computed for all states in a single backward Viterbi pass that ignores all constraints, and thus provides an upper bound on the maximum possible score. Formally, let denote the set of all partial segmentations starting from the position to, such that the segment ending at position has the label. Let denote the largest value of for any. Omitting the subscripts, the following recursive calculation implements a semi-markov, backward Viterbi algorithm: The value is the suffix upper bound score for any partial segmentation such that ends at and has label...1 Pruning on lower bounds For most of the inference tasks, the number of state expansions before reaching a goal state is small. But sometimes, with many label constraints, the number of expansions can be large. For such cases, we limit the number of expansions and the size of the priority queue to some predefined maximum value. Another idea is to prune nodes which will never be part of an optimal solution. We can determine a lower bound solution using beam search Viterbi for constrained inference described earlier. Here, the beam size need not be very large, actually the value of -5 suffices in practice. The solution can be used as a lower bound, and the states with the sequence upper bound less than the lower bound can be pruned. The lower bound can be dynamically updated by extending the partial solution of the current best state using the beam search Viterbi technique. The frequency of updates of lower bound is a function of the number of expansions and queue size to optimize the overall performance. We skip details due to lack of space. 5 Experiments We evaluated our system on the following datasets. Personal Bibtex In this case our goal was to simulate the scale of databases as would arise in a PIM. One of the authors of this paper had a set of 00 journal entries over her.bib files collected over 10 years. This was loaded on a normalized relational database following a schema similar to that shown in Figure 1 with one intermediate table called (8) 8

9 Journal issues to further normalize journals at the level of individual issues. The articles table points to Journal issues which in turn points to Journals. The set of attributes consisted of title and page numbers from articles, year and volume from journal-issues, author names, and journal names. Of these, year, page-numbers and volume are fairly easy to detect and provide close to 95% accuracy with or without a DB. We therefore concentrate only on the remaining three fields in reporting accuracy values. The unstructured records consisted of 100 citations obtained by searching Citeseer for citations of a random subset of authors appearing in the bibtex database. These were manually tagged with attribute and match ids. Of these 5% titles, 78% journals and 75% authors already appeared in the database. These were not exact matches but noisy inexact matches that needed manual resolution. We have published the dataset 1 for others to use. Address data The Address dataset consists of 95 home addresses of students in IIT Bombay India. These addresses are less regular than US addresses, and extracting even fields like city names is challenging []. We purchased a postal database that contains a hierarchy of India pin-codes, cities and states arranged over three tables linked via foreign keys. All state names and 95% of the city names appeared in the database but the database was clean and contained no canonical variants whereas in the unstructured text there were several noisy variants. The labeled data had other fields also like Road names and building names totaling to a total of 16. We trained the extraction model over all sixteen labels but report accuracy over only the three labels for which we had columns in the database. Platform We used Postgres to store the structured database and Lucene to index the text columns of the database. Our extraction and matching code was in Java. The learner was the open source CRF and semi- CRF package that we have contributed to sourceforge ( Features Features followed the template in Section 5.. Each token contributed two types of features: (1) the token itself it was encountered in the training set, otherwise a special feature called Unknown and, () the set of regular expressions like digit or not, capitalized or not that the token matches. In the semi-markov model for each segment we invoke features on the first, last and middle tokens of the segment. The context features consist of features on the tokens one left and one right of the segment. These are all Markov features that can be handled by any Begin, 1 sunita/data/personalbib.tar.gz Continue, End state encoding []. In addition, edge features captured the dependency between labels assigned to adjacent segments. The semi-markov non-database features were segment length and counts for each possible regular expression that fires. The semi-markov database features were of three types as described in Section.1. We used two similarity functions JaroWinkler and SoftTFIDF [6] in addition to Lucene scores. Experiment settings Since our system is targeted to settings when labeled data ( ) is limited, by default we used roughly 5% of the available data for training. The rest was used for testing. We also report numbers with increasing fraction of training data. All numbers are averaged over five random selections of training and test data. We measure accuracy in terms of the correctness of the entire extracted entity (i.e., partial extraction gets no credit). We report for each label recall, precision and F1 values. Our main focus in this paper is accuracy and we will study that in Sections 5.1 to 5.. However, to establish practical feasibility of using a learning-based system, we will also report summarized results on training and test time in Section Overall data integration performance In this section, we first present accuracy of the full integration task (extraction + matching) by more centrally exploiting the database. We summarize the results in Table 1 where we report precision, recall and F1 values of integration for each text field for our datasets introduced earlier. The first set of numbers is where the database does not contribute to the learning models at all and the models are trained using the unstructured data alone. The second set of numbers is due to our model trained using both database specic features and features from unstructured text for both extraction and matching. The table shows that for all labels there is a signicant improvement in F1 accuracy. For some fields like journals of the PersonalBib dataset and state names of address dataset, F1 improves by more than 50%. The table also shows separate extraction and matching accuracies. For all labels there is a signicant improvement ranging from F1 dferences of 7 to 5 of extraction performance. For example, for journal we were able to increase F1 from 5 to 51 whereas for state we got a jump from to 9. The overall average F1 went up by 10 points which is a 0% reduction in error. In case of matching with databases, we added approximately 100 examples from the database as training instances. Matching results for IITB database with and without databases are almost the same. For personal bibtex F1 is defined as *precision*recall/(precision+recall). 9

10 Integration Extraction Match Only-L L + DB Only-L L + DB Only-L L + DB Dataset Label P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 PersonalBib author journal title Address city state pincode Table 1. Accuracy of overall integration, extraction and matching with and without an external database (P=precision, R=recall) database, there is a signicant improvement in the accuracy of title when the database tuples are added in the training, while for other labels (journal and author) the accuracies are comparable. We now present a finer grained analysis of the results and study the effects of various factors like data sizes, feature sets and databases on extraction since that is what contributes to most of the dference. 5. Effect of various features We assess the usefulness of various kinds of clues for aiding the extraction tasks by evaluating performance with various feature combinations for two dferent train sizes of PersonalBib. In Figure 6, we show F1 numbers, starting with an extraction model trained only on features of and then we add four dferent database features in sequence: schema derived cardinality information, similarity features, entity classier and regex-level match features. Accuracy improves with each additional feature type. The first signicant improvement was on using the similarity features of the database, followed by the improvement due to regex features. Next we evaluate the importance of features derived purely from by dropping features of three types: entity, context and label order while keeping all the database features. Dropping the entity features of raises accuracy slightly, probably because given the small training set these tend to overfit the data. The context and edge (label dependency) features are important as seen by the drop in accuracy. Interestingly, context and edge features are removed in parallel before removing entity features (not shown in figure), accuracy does not drop much. Thus, one of context and edge features is important and the two together are not needed. 5. Effect of increasing training data We show how the relative accuracy of extraction with and without a database is affected by increasing labeled training data. For these experiments we fixed the test set to only_l (nodb) "+cardinality "+db_similarity "+db_classier "+db_regex "-L_entity "-L_context "-L_edge Train=5% only_l (nodb) "+cardinality "+db_similarity "+db_classier "+db_regex "-L_entity "-L_context "-L_edge Train=10% Figure 6. Effect of various features from the database (db) and labeled data (L) on F1 accuracy 50% of available and selected dferent random subsets for training. Figure 7 shows average F1 with increasing training sizes. Accuracy increases as expected in both cases and the database continues to be useful even for larger training set. 5. Effect of changing database In order to evaluate the sensitivity of our results to particular databases, we report numbers where our personal bibtex database is changed to the publicly available DBLP database while keeping unchanged. In Table, we report performance on the two databases. We notice that title performance remains unchanged whereas author and journal accuracy drop by and F1 points respectively. However, there is still an overall 1% reduction in extraction error due to use of the database. 5.5 Running time As the time required to train an extraction model dominates the integration process, we report performance number for extraction only. We compare the time required for extraction with and without databases. 10

11 F withdb nodb Training % F withdb nodb Training % Time (sec) nodb with_pb with_dblp Training % (a) Personal Bibtex (b) Address data Figure 7. Increasing fraction of labeled data ( ) versus F1 with and without a database Only-L L + PersonalBib L + DBLP Label P R F1 P R F1 P R F1 author journal title overall Table. Accuracy of extraction with changing databases (P=precision, R=recall) We report training times for extraction models in three settings: without a database, with the smaller personal bibtex and the larger DBLP database. The personal bibtex database has 00 article tuples, 1800 author entries and 00 journals. The DBLP database is relatively large and contains approximately articles, authors, and 00 journals. Figure 8 shows the average time required to train an extraction model for various training set sizes under these three settings. Even with 50 training records and the larger DBLP database, the time required is no more than 15 minutes, and less than double the time required by the nodatabase setting. It is possible to reduce the gap further by more efficient index probes as discussed in [5]. Once an extraction models is trained, it is applied on unstructured instances using the A* algorithm discussed in Section.. For extraction without databases, inference on a single bibtex entry took on average 800 ms. For extraction with personal bibtex database average inference time is around 1800 ms, while with DBLP database it is 600 ms. This time can also be reduced considerably via efficient index lookups [5]. 6 Related work External dictionaries have been exploited to improve accuracy of NER tasks based on conditional models. A com- Figure 8. Increasing fraction of labeled data (L) versus running time (seconds) with and without databases mon scheme is to define boolean features based on exact match of a word with terms that appear in an entity dictionary []. This is extended in [7] to handle noisy dictionaries through features that capture various forms of similarity measures with dictionary entries. In both these cases the goal is to exploit entity lists in isolation whereas our goal is to handle multiple inter-linked entity lists. Also, we exploit entity lists more effectively by building an entity classier and pattern dictionaries. Another mechanism is to treat the dictionary as a collection of training examples and this has been explored for the case of generative models in [17, ] This method suffers from a number of drawbacks: there is no obvious way to apply it in a conditional setting; it is highly sensitive to misspellings within a token; and when the dictionary is too large or too dferent from the training text, it may degrade performance. In [1], some of these drawbacks are addressed by more carefully training a HMM so as to allow small variations in the extracted entities, but this approach being generative is still restricted in its scope of the variety of features that it can handle compared to recent conditional models. Also, it cannot effectively combine information available in both labeled unstructured data and external databases. Recently there has also been work on exploiting links for matching entities in a relational database, including conditional graphical models for grouped matching of multiattribute records [1] and CRFs for matching a group of records enforcing the transitivity constraint [1]. These are techniques for batched deduplication of a group of records whereas we are addressing an online scenario where unstructured records get inserted one-at-a-time in the database. 7 Conclusions In this paper we showed how to integrate unstructured text records into existing multi-relational databases using 11

12 models that combine clues from both existing entities in the database and labeled unstructured text. We extend state-ofthe-art semi-markov CRFs with a succinct set of features to capture pattern-level and entity-level information in the database, use these to extract entities and integrate them in a database while respecting its key constraints. Experiments show that our proposed set of techniques lead to signicant improvement in accuracy on real-le datasets. This is one of the first papers on statistical models for integrating unstructured records into existing databases and there is lot of opportunity for future work. We plan to develop other ways of exploiting a database of entities particularly inter-entity correlation expressed via soft-constraints. Another important aspect is creating standardized version of entities either by choosing the best of the existing variants or merging several noisy variants. Acknowledgements This work was supported by Microsoft Research and an IBM faculty award. The last author thanks them for their generous support. References [1] E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, USA, 00. [] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name-matching in information integration. IEEE Intelligent Systems, 00. [] V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic text segmentation for extracting structured records. In Proc. ACM SIGMOD International Conf. on Management of Data, Santa Barabara,USA, 001. [] A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Sixth Workshop on Very Large Corpora New Brunswick, New Jersey. Association for Computational Linguistics., [5] A. Chandel, P. Nagesh, and S. Sarawagi. Efficient batch topk search for dictionary-based entity recognition. In Proc. of the nd IEEE Int l Conference on Data Engineering (ICDE), 006. [6] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-00 Workshop on Information Integration on the Web (IIWeb-0), 00. To appear. [7] W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration methods. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 00. [8] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML-001), Williams, MA, 001. [9] S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, (6):67 71, [10] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large-scale optimization. Mathematic Programming, 5:50 58, [11] R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of The Sixth Conference on Natural Language Learning (CoNLL-00), pages 9 55, 00. [1] A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In Proceedings of the IJCAI-00 Workshop on Information Integration on the Web, pages 79 86, Acapulco, Mexico, Aug. 00. [1] Parag and P. Domingos. Multi-relational record linkage. In Proceedings of rd Workshop on Multi-Relational Data Mining at ACM SIGKDD, Seattle, WA, Aug. 00. [1] A. Ratnaparkhi. Learning to parse natural language with maximum entropy models. Machine Learning,, [15] S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-00), Edmonton, Canada, July 00. [16] S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In NIPs, 00. [17] K. Seymore, A. McCallum, and R. Rosenfeld. Learning Hidden Markov Model structure for information extraction. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 7, [18] F. Sha and F. Pereira. Shallow parsing with conditional random fields. In In Proceedings of HLT-NAACL, 00. 1