Integrating unstructured data into relational databases

Size: px
Start display at page:

Download "Integrating unstructured data into relational databases"

Transcription

1 Integrating unstructured data into relational databases Imran R. Mansuri IIT Bombay Sunita Sarawagi IIT Bombay Abstract In this paper we present a system for automatically integrating unstructured text into a multi-relational database using state-of-the-art statistical models for structure extraction and matching. We show how to extend current highperforming models, Conditional Random Fields and their semi-markov counterparts, to effectively exploit a variety of recognition clues available in a database of entities, thereby signicantly reducing the dependence on manually labeled training data. Our system is designed to load unstructured records into columns spread across multiple tables in the database while resolving the relationship of the extracted text with existing column values, and preserving the cardinality and link constraints of the database. We show how to combine the inference algorithms of statistical models with the database imposed constraints for optimal data integration. 1 Introduction Database systems are islands of structure in a sea of unstructured data sources. Several real world applications now need to create bridges for smooth integration of semistructured sources with existing structured databases for seamless querying. Although a lot of research has gone in the NLP [, 1], machine learning [8, 18] and web mining community [9] on extracting structured data from unstructured sources, most of the proposed methods depend on tediously labeled unstructured data. The database is at best treated as a store for the structured data. In this paper we attempt to bridge this gap by more centrally exploiting existing large databases of the multirelational entities to aid the extraction process on one hand, and on the other hand, augmenting the database itself with the extracted information from unstructured text in a way that allows multiple noisy variants of entities to co-exist and aid future extraction tasks. 1.1 Applications There are a number of applications where unstructured data needs to be integrated with structured databases on an ongoing basis so that at the time of extraction a large database is available. Consider the example of publications portals like Citeseer and Google Scholar. When integrating publication data from personal homepages, it does not make sense to perform the extraction task in isolation. Structured databases from, say ACM digital library or DBLP are readily available and should be exploited for better integration. Another example is resume databases in HR departments of large organizations. Resumes are often stored with several structured fields like experience, education and references. When a new resume arrives in unstructured format, for example, via text, we might wish to integrate it into the existing database and the extraction task can be made easier by consulting the database. Another interesting example is from personal information management (PIM) systems where the goal is to organize personal data like documents, s, projects and people in a structured inter-linked format. The success of such systems will depend on being able to automatically extract structure from the existing predominantly file-based unstructured sources. Thus, for example we should be able to automatically extract from a PowerPoint file, the author of a talk and link the person to the presenter of a talk announced in an . Again, this is a scenario where we will already have plenty of existing structured data into which we have to integrate a new unstructured file. 1. Challenges and Contribution A number of challenges need to be addressed in designing an effective solution to the data integration task. Useful clues are scattered in various ways across the structured database and the unstructured text records. Exploiting them effectively is a challenge. First, the existing structured databases of entities are organized very dferently from labeled unstructured text. 1

2 Most prior work has been on extracting information from unstructured text which can be modeled as a sequence of tokens. The input labeled data is thus a collection of token sequences. In contrast, the database is a graph of entity relationships and there is no information in the database about inter-entity sequencing. A good extraction system should exploit both the inter-entity sequencing information from the labeled unstructured text and the semantic relationship between entities in the database. Second, there is signicant format variation in the names of entities in the database and the unstructured text. The entities in a structured database provide useful pattern-level information that could help in recognizing entities in the unstructured text. Finally, in most cases the database will be large whereas labeled text data will be small. Features designed from the databases should be efficient to apply and should not dominate features that capture contextual words and positional information from the limited labeled data. We design a data integration system that builds upon state-of-the-art semi-markov models [7, 16] for information extraction to exploit useful information in both structured data and labeled unstructured data in spite of their format, structure and size variations. Our experiments show that our proposed model signicantly improves the accuracy of data integration tasks by exploiting databases more effectively than has been done before. Outline In Section, we describe our problem setup and present the overall architecture of the system. In Section, we present background on statistical models for extraction and matching. In Section we show how we augment these models to exploit features from the database, and integrate data so as to meet the cardinality and link constraints in the database. In Section 5, we present integration performance on real-le datasets and present related work in Section 6. Setup and Architecture.1 Relational data representation A relational database represents a normalized set of entities with the relationships between them expressed through foreign keys. An example of such a multi-relational database is shown in Figure 1 representing bibtex entries for journal articles. The database consists of three dferent entities: articles, journals and authors. The foreign keys from the article to the journal tables shows how journals name have been normalized across dferent articles. The author name is a set-valued attribute of articles and this relationship is stored in the writes table. In real-le the same physical entity is often represented by multiple names that can be thought of as noisy variants Id Title Writes Article Articles Update Semantics Author 11 Year 198 Journal 10 Authors Id 11 Name M Y Vardi J. Ullman Ron Fagin Jeffrey Ullman Canonical Id Canonical Journals Name ACM TODS AI ACM Trans. Databases Canonical Figure 1. A sample normalized database for storing journal bibitems Id 7 R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also Articles Title Update Semantics Writes Article 7 7 Author Year Journal Authors Id Name M Y Vardi J. Ullman Ron Fagin Canonical Jeffrey Ullman R Fagin J Helpern Journals Id Name ACM TODS AI ACM Trans. Databases Canonical 17 Canonical Figure. Integrating an unstructured citation into the database of Figure 1 of each other. The database allows these variants to co-exist and canonical links are placed from a noisy variant of an entry to its canonical form. In the sample database of Figure 1, we show such canonical links between two journal entries ACM TODS and ACM Trans. Databases and between entries Jeffrey Ullman and J. Ullman in the authors table. Each variant of an entity has a unique identier and thus the canonical link is a foreign key to this unique identier. These foreign keys can be within the same table as in the examples in Figure 1 or they could be across tables. This data representation allows a user to preserve an entity in a format that is dferent from its canonical form. For example, a user might want author names in a citation to be abbreviated although the canonical entry might contain the full name. In this paper we do not address the issue of which entry to pick as the canonical entry or how to construct a merged 17 10

3 canonical entry. These can be handled as part of follow-up operations once the canonical links are created. A bit of notation to summarize the above: The database consists of a set of entities (each of which map to a database table) where each entity has a set of attributes or column values (these exclude columns like primary or foreign keys). Let refer to the union of all attributes over all entities. In Figure 1, is and includes title, year, journal name and author name. Each entity in the database has a unique identier and optionally a canonical identier that is a foreign key to the of another entity of which it is assumed to be a noisy variant.. Labeled unstructured text The unstructured data is a text record representing a single top-level entity like articles, but embedded in a background of irrelevant text. A labeled set is provided by someone manually labeling portions of the text with the name of the entity to which it belongs. We will show in the experimental section that the size of can be very small (5 to 10).. The task We start with an existing multi-entity, multi-variant database and a small labeled set. We use this to train models for information extraction and matching so that given an unstructured text string, we can perform the following: Extract attributes from corresponding to all attributes in the entity database. We call this the information extraction problem. Map the extracted entities to existing entries in the database they match, otherwise, create new entries. Assign relationships through foreign keys when attributes span multiple linked tables. We call this the matching problem. As a result of these operations the entry is integrated across the multiple tables of the database with the canonical links resolved correctly. In actual deployment a user might also wish to inspect the extracted entries and correct for any errors of the automated integration step. We continue with our sample database of Figure 1 and show how to integrate into it a noisy unstructured citation R. Fagin and J. Helpern, Belief, awareness and reasoning. In AI 1988 [10] also (see in Figure ). We extracted from this two authors, a title, a journal name and a year field and integrated them with the database as shown in Figure. On matching with the database, we find the journal AI to be already present and the author R. Fagin to be a variant of an existing author name Ron Fagin. This leads to the canonical link from the new entry and the existing author name.. Available clues The database with its entities and canonical links, and the labeled data can be exploited in a variety of ways for learning how to integrate an incoming unstructured entry into the database. We go over these first and later show how these are exploited by the statistical models...1 Clues in the database As a database gets large, in most cases the entity to be extracted from unstructured text will exist in the database. We therefore need models that can effectively exploit this valuable piece of information. The entities in the database could be useful even when unstructured text does not match a database entity. Dferent entity attributes have very distinctive patterns that can be generalized to recognize these entities in unstructured text. For example, by inspecting entries in a publications database we can detect that author names often follow a pattern of the form [A-Z]. [A-Z]. [A- Z][a-z]+. The schema of the database provides information on the number of entities that are allowed in each record. In Figure 1, the table structure makes it clear that an article can have at most one year, journal and title field but it can have multiple author fields. Also, a foreign key in the database provides information on what entity is allowed to go with what. For example, we label a text segment as a city name and the segment matches the entry San Francisco in the cities column, then any text segment that is labeled as a state needs to match entry CA. In data integration tasks, respecting cardinality and link constraints could be a requirement even it does not improve accuracy... Information in the labeled unstructured data Labeled unstructured data is the classical source of information for extraction model. Traditionally, three kinds of information have been derived from unstructured records. First, the labeled entities hold the same kind of pattern information as the database of entities as discussed above. Second, the context in which an entity appears is a valuable clue that is present solely in the labeled unstructured data. Context is typically captured via few words before or after the labeled entity. For example, the word In often appears before a journal name. Finally, the likely ordering of entity labels within a sequence is also useful. For example, author names usually appear before or after a title.

4 Background In this section we review state-of-the-art statistical models for extraction and matching. These models are designed to exploit clues only in labeled unstructured data and in Section we will discuss how to extend these to include clues from the database. The problem of extracting structured entities from unstructured data is an extensively researched topic. A number of models have been proposed ranging from the earliest rule-learning models to probabilistic approaches based on generative models like HMMs [17, ] and conditional models like maxent taggers [1]. Almost all the probabilistic models treat extraction as a sequence labeling problem. The input unstructured text is treated as a sequence of tokens which needs to be assigned a corresponding sequence of labels from the set of labels. A promising recently proposed model for information extraction is Conditional Random Fields (CRFs). CRFs have been shown [8, 18] to out-perform previous generative models like Hidden Markov Models and local conditional models like maxent taggers [1]..1 Conditional Random Fields CRFs [8] model a conditional probability distribution over label sequence given the input sequence. The label assigned to the -th word can be a function of any property of the -th word or the property of the word defining its context and the label of the word before it. These properties are represented as a vector of feature functions where each is a scalar function of the form. Examples of such features are: is capitalized Author name is journal title where the indicator function true and zero otherwise; The CRF model associates a weight to each of these features, thus there is a weight vector corresponding to the feature vector. These are used to design a global conditional model that assigns a probability for each possible label sequence given (the features of) the input sequence as follows: where and is a normalizing factor equal to. (1) When deployed for extraction of entities from a text sequence, CRFs find the best label sequence as: An efficient inference algorithm is possible because all features depend on just the previous label. Let denote the set of all partial labels starting from 1 (the first index of the sequence) to, such that the -th label is. Let denote the largest value of for any. The following recursive calculation (called the Viterbi algorithm) finds the best sequence: The best label sequence corresponds to the path traced by. This algorithm is linear in the length of. Learning is performed by setting parameters to maximize the likelihood of the training set expressed in logarithmic terms as We wish to find a that maximizes. The above equation is convex and can thus be maximized by gradient ascent or one of many related methods. (In our implementation, we have used a limited-memory quasi-newton method[10, 11].) More computational details of training can be found in [18]. The training algorithm makes multiple linear scans through the data sequences and for each sequence computes the value of the likelihood function and its gradient in two scans over the features of each sequence.. Semi-Markov CRFs (Semi-CRFs) Semi-CRFs [16, 7] extend CRFs so that features can be defined over multi-word segments of text instead of individual words. This provides for a more natural encoding of the information extraction problem, which is now defined as segmenting a text sequence such that each segment corresponds to a whole entity. In CRFs, all contiguous words which get assigned the same label implicitly represent an entity. This is not very intuitive since entities mostly consist of multiple words and higher accuracy is achieved features could be defined over the entire proposed entity as shown in [16, 7]. More formally, a segmentation of an input sequence is a sequence of segments such that the last segment ends at, the first segment starts at 1, and segment () () ()

5 begins right after segment ends. Each segment consists of a start position, an end position, and a label. A Semi-CRF models the conditional probability distribution over segmentation for a given input sequence as follows: where again is a weight vector for and and. Like in the case of normal CRFs, the label of a segment depends on the label of the previous segment and the properties of the tokens comprising this segment. Thus a feature for segment is of the form. An example of such features for a segment starting at position and ending at 5 is: (5) appears in a journal list journal During inference, the goal is to find a segmentation of the input sequence such that (as defined by Equation 5) is maximized. Let be an upper bound on segment length. Let denote set of all partial segmentation starting from 1 (the first index of the sequence) to, such that the last segment has the label and ending position. Let denote the largest value of for any. Omitting the subscripts, the following recursive calculation implements a semi-markov analog of the usual Viterbi algorithm: The best segmentation then corresponds to the path traced by. The running time of semi- Markov Viterbi (Equation 6) is just times the running time of Markov Viterbi. is usually a small number. The training of semi-crfs generalizes the training algorithm of CRFs in a manner analogous to the generalization of the Viterbi algorithm. We refer for details to [16].. Matching Matching or deduplicating entities using statistical learning models is also a popular research topic [1,, 15]. At the core of most of these models is a binary classier on various similarity measures between a pair of records (6) where the prediction is 1 the records match and 0 otherwise. This is a straight-forward binary classication problem where the features typically denote various kinds of attribute level similarity functions like Edit distance, Soundex, N-grams overlap, Jaccard, Jaro-Winkler and Subset match [6]. Thus, we can use any binary classier like SVM, decision trees, and logistic regression. We use a CRF with a single variable for extensibility. Thus, given a text segment pair, the CRF predicts a that can take values 0 or 1 as The feature vector corresponds to various similarity measures between the text segments when, i.e., when and match. Our models for data integration In this section we describe how all the clues available across the labeled data and the multi-relational database can be combined to solve our data integration task. Let denote the input unstructured string with tokens obtained from the text using any tokenization method. Typically, text is tokenized based on white space and a limited set of delimiters like, and.. Our goal is to extract from, segments of text belonging to one of the entity attributes in the database. We also need to assign to each segment labeled with an attribute a canonical id of an entry in the column of the database with which it matches. If it matches none of the ids, then is a special value 0 denoting none-of-the-above. Also, canonical ids connected through foreign links need to respect the foreign keys in the database and the labels need to follow the cardinality constraints imposed by the schema..1 Model for extraction We deploy Semi-CRFs since they provide high-accuracy and an elegant mechanism for extracting entities in the presence of databases. The main challenge in their usage is how to incorporate clues obtained from the existing entities in the database. A straightforward mechanism would be to add the existing list of entity values as additional labeled training examples. However, in practice we found this approach to be very sensitive to the training set and size and often leads to a drop in accuracy. We therefore designed models that separate out the roles of database entities and unstructured records. We use the database to add three types of entity-level features in the sequential model trained over unstructured text. (7) 5

6 .1.1 Similarity to a Word-level dictionary Very often an entity in an unstructured database might already be existing in the database albeit in a slightly dferent form. A number of similarity functions have been proposed in the record-linkage literature for defining efficient and effective similarity functions which are tolerant to commonly occurring variations in named entities. We can exploit these to perform better extraction by adding similarity features as proposed in [7]. An example feature based on this is: Authors in,e Author This assigns as feature value the maximum cosine similarity that segment has to the Author name column of the database..1. Similarity to Pattern-level dictionary Each entity in the database contains valuable information at the pattern-level that can be exploited even the entity in the unstructured text is not the same. We design a succinct set of features to exploit such patterns as follows. We first choose a sequence of regular expression patterns. Then for each entry in the database, we generalize each of the tokens of to the first pattern it matches, concatenate the pattern-ids over all tokens in and insert the concatenated pattern-ids of in the index. For example, suppose our regular expression pattern sequence consisted of the following three patterns: id 1 pattern Then the text string R. L, Stevenson will get the pattern-id sequence: 1,, assuming the text is tokenized on space. We find the pattern representation of all entities in the database and index them separately for each entity type. We then design boolean features for segments that check for the presence of the segment s pattern in each entity s pattern dictionary..1. Entity classier The database contains a variety of other clues including typical length of entities, characteristics words and multiple patterns per tokens, that is not captured by the above two dictionaries. We capture all these by building an entity classier where an entity is characterized by as many such features as required and a label that denotes its type. In addition, we also include subsets of this entity with a negative label. This defines a multi-class classication problem. The CRF model (or any other classier like SVM) can be used to train a multi-class classier. Let denote such a classier. When is applied on a text segment, it returns a vector of scores that denotes the probability that is of each of the given entity types. Let denote the extraction model that we are constructing from the unstructured examples. We use classier to define a new set of semi-markov features for model. If the number of distinct entity types is, we will have features where the feature captures the strength of the th score obtained from in identying the th entity in.. Model for matching Given a segment of for which the attribute label resolves to one of the attributes, we need to find a canonical id that denotes which existing values of the attribute it matches in the database and = 0 it matches none of the existing entities. We adopt the pair-wise match model of Section. where for each canonical labeled segment in the training data we form pairs with the database entities. The entities with the same canonical id get a label of 1 and others get a label of 0. In addition to examples in the training data we can easily also generate pairs of records from the database and assign a label of 1 to those with the same canonical-id and zero otherwise. An important concern about the pair-wise matching process is efficiency. We cannot afford to attempt potential match with each and every possible canonical id for which there is an entry in the database. We solve this problem by using a top-k index for each database column. Given a text segment, we probe the index to get a candidate set of segments and evaluate for each candidate with canonical label. If the predicted value is 0 for each candidate, we claim that the segment resolves to none-of-the-existing entities. When presented with a record with multiple extracted attribute segments, we need to simultaneously assign canonical-ids to each of these so as to not violate any of the integrity constraints. We first search the attribute index to find the top-k candidate canonical id for each attribute segment. We then pose database queries to find all canonical ids which have direct or indirect foreign key link(s) from each of the top-k canonical ids associated with the attribute segment. We denote the set of link constraints as, which is a set of tuples of the form, where is the attribute type of an attribute segment, is one of the top-k canonical ids for the segment, is an attribute having a foreign key relationship with, and is the canonical id of the entry with the attribute and is being referred by the entry. Assuming that a segment sequence has two attribute segment and such that they are of attribute types and respectively, then a link constraint is said to be violated 6

7 Unstructured labeled data (L) Database Unstructured unlabeled data Similarity indices Collect DB features Pattern indices Entity classier DB records Unstructured labeled data (L) with DB features Train matching model (semi CRF) Matching model Figure. Training process Similarity indices Collect DB features Pattern indices Entity classier Unstructured unlabeled data with DB features Link constraints Cardinality constraints Matching model Other extraction features Train extraction model (semi CRF) Extraction model Extraction model Other extraction features Perform extraction using A* 1: Perform backward Viterbi scan and cache all the possible suffix upper bound scores (See eq: 8) : Initialize priority queue with the start state. : Initialize global-lb. : while is not empty do 5: Retrieve a state with the maximum sequence upper bound estimate from the queue. 6: is a goal state then 7: Return the solution 8: end 9: Generate successors of the state, and add valid successors to provided global-lb. 10: Periodically update global-lb via beam search from. 11: end while 1: Return failure Figure 5. Constrained inference using A* Database Matched record(s) Perform matching using A* Figure. Deploy process any of the following is true: Extracted record(s) Attribute has a foreign link to the attribute in the database schema, and canonical id assigned to the segment points to a canonical id other than assigned to, i.e.. same as above with the roles of and reversed. signicantly reduces the time required for finding matches. We do not consider such optimizations in this paper because of lack of space and because our main focus here is accuracy of extraction via skillful feature design. We cache the index lookups but compute all the other features on the fly for each training iteration. We then train a semi-crf model for extraction that outputs the weight values for each feature type. We train the match model by forming pairs both out of top-k matches of the training segments with the database entities and also with pairs formed purely out of database entities. The trained models are stored on disk for the deployment stage.. Overall training and deployment process The overall training and deployment process for extraction and matching appears in Figures and...1 Training process During the offline training phase, we first train the entity classier out of the entities in the database. The entity classier could be trained incrementally but we do not support that currently. For the similarity and pattern features, we maintain an inverted index over each of the entity columns that is always kept upto-date with the entries in the database. Next, for each labeled document we create its normal intrinsic features and then probe the entity classier and indices to create the three kinds of database derived features. The index probes for computing the various dictionary features is performed using a Top-K search on the inverted index. Also, since this requires multiple index probes on overlapping segments of the input, we have designed a batch top-k index search algorithm [5] where instead of independently finding top-k matches for all candidate segments, we batch the top-k queries over overlapping segments. This.. Deployment process While making inference on the labels of an unstructured text, we also need to follow the cardinality constraints imposed by the database schema. The Viterbi algorithm presented earlier cannot easily meet these constraints because this upsets the Markovian property that Viterbi depends on. Section. presents an A* based search algorithm that can efficiently find the optimal segmentation even with constraints. After extracting the attribute segments, we assign canonical ids while maintaining the link constraints using another run of the same A* algorithm.. The A* inference algorithm A simple mechanism to extend Viterbi for constrained labeling is via a beam search which maintains not just the best path but the best paths at any position. When we extend the path by one more segment, we check for conflicts and drop invalid extensions. This is guaranteed to find the optimal solution when is very large and the time and space requirement is multiplied by. In practice, conflicts 7

8 are rare and the beam search algorithm might be unnecessarily maintaining alternatives that are never used. We propose a better optimized inference algorithm based on the A* search paradigm. This algorithm is guaranteed to find the optimal solution given unbounded space but because of the non-markovian constraints we cannot guarantee optimality in polynomial time. The algorithm allows us to tradeoff optimality and efficiency by controlling the number of search nodes expanded. Even when it is made to terminate early the solution will be no worse than a normal Viterbi run with a fixed beam size. A* is a branch and bound search algorithm that depends on upper and lower bounds guarantees associated with states in the search space to reach the optimal solution via the minimum possible number of states. An outline of the inference algorithm based on the A* technique is given in Figure 5. Our state space consists of the set of partial segmentation starting from the beginning of the sequence to an intermittent position. Thus, each state in the state space can be expressed as a sequence of segments such that the first one starts at 1 and the last one ends at, i.e., using the segment notation from Section., and. We denote a state as, the start state as and a goal state as, where. Each state has associated with it an upper bound score based on which it is ordered in a priority queue. Initially, has only the start state. In each iteration we dequeue the state with the largest upper bound and generate its successors as follows: Generating successors The successors of a given state are all the states which extend the segmentation with a segment which starts at, i.e.,, ends at a position no greater than and has a label that does not violate any constraint given the sequence of labels in the segmentation of. The constraints imposed can be any arbitrary constraints on label assignments. In our case, cardinality constraints derived from the database schema are imposed during extraction. While for matching, we enforce link constraints derived from the database entities as discussed in Section.. Calculating upper bound We now discuss how to compute the upper bound score associated with a state. This score consists of two parts: the first part is the actual score over the partial segment sequence ; thus. This value can be computed incrementally when a state is generated from its parent with the extension of a single segment. The second part is an upper bound on the maximum score possible is extended with segments that take it to a goal state. The sum thus is an upper bound on the maximum possible score of the entire sequence from the start position 1 to an ending position with the prefix of the path constrained to be. The value of is computed for all states in a single backward Viterbi pass that ignores all constraints, and thus provides an upper bound on the maximum possible score. Formally, let denote the set of all partial segmentations starting from the position to, such that the segment ending at position has the label. Let denote the largest value of for any. Omitting the subscripts, the following recursive calculation implements a semi-markov, backward Viterbi algorithm: The value is the suffix upper bound score for any partial segmentation such that ends at and has label...1 Pruning on lower bounds For most of the inference tasks, the number of state expansions before reaching a goal state is small. But sometimes, with many label constraints, the number of expansions can be large. For such cases, we limit the number of expansions and the size of the priority queue to some predefined maximum value. Another idea is to prune nodes which will never be part of an optimal solution. We can determine a lower bound solution using beam search Viterbi for constrained inference described earlier. Here, the beam size need not be very large, actually the value of -5 suffices in practice. The solution can be used as a lower bound, and the states with the sequence upper bound less than the lower bound can be pruned. The lower bound can be dynamically updated by extending the partial solution of the current best state using the beam search Viterbi technique. The frequency of updates of lower bound is a function of the number of expansions and queue size to optimize the overall performance. We skip details due to lack of space. 5 Experiments We evaluated our system on the following datasets. Personal Bibtex In this case our goal was to simulate the scale of databases as would arise in a PIM. One of the authors of this paper had a set of 00 journal entries over her.bib files collected over 10 years. This was loaded on a normalized relational database following a schema similar to that shown in Figure 1 with one intermediate table called (8) 8

9 Journal issues to further normalize journals at the level of individual issues. The articles table points to Journal issues which in turn points to Journals. The set of attributes consisted of title and page numbers from articles, year and volume from journal-issues, author names, and journal names. Of these, year, page-numbers and volume are fairly easy to detect and provide close to 95% accuracy with or without a DB. We therefore concentrate only on the remaining three fields in reporting accuracy values. The unstructured records consisted of 100 citations obtained by searching Citeseer for citations of a random subset of authors appearing in the bibtex database. These were manually tagged with attribute and match ids. Of these 5% titles, 78% journals and 75% authors already appeared in the database. These were not exact matches but noisy inexact matches that needed manual resolution. We have published the dataset 1 for others to use. Address data The Address dataset consists of 95 home addresses of students in IIT Bombay India. These addresses are less regular than US addresses, and extracting even fields like city names is challenging []. We purchased a postal database that contains a hierarchy of India pin-codes, cities and states arranged over three tables linked via foreign keys. All state names and 95% of the city names appeared in the database but the database was clean and contained no canonical variants whereas in the unstructured text there were several noisy variants. The labeled data had other fields also like Road names and building names totaling to a total of 16. We trained the extraction model over all sixteen labels but report accuracy over only the three labels for which we had columns in the database. Platform We used Postgres to store the structured database and Lucene to index the text columns of the database. Our extraction and matching code was in Java. The learner was the open source CRF and semi- CRF package that we have contributed to sourceforge ( Features Features followed the template in Section 5.. Each token contributed two types of features: (1) the token itself it was encountered in the training set, otherwise a special feature called Unknown and, () the set of regular expressions like digit or not, capitalized or not that the token matches. In the semi-markov model for each segment we invoke features on the first, last and middle tokens of the segment. The context features consist of features on the tokens one left and one right of the segment. These are all Markov features that can be handled by any Begin, 1 sunita/data/personalbib.tar.gz Continue, End state encoding []. In addition, edge features captured the dependency between labels assigned to adjacent segments. The semi-markov non-database features were segment length and counts for each possible regular expression that fires. The semi-markov database features were of three types as described in Section.1. We used two similarity functions JaroWinkler and SoftTFIDF [6] in addition to Lucene scores. Experiment settings Since our system is targeted to settings when labeled data ( ) is limited, by default we used roughly 5% of the available data for training. The rest was used for testing. We also report numbers with increasing fraction of training data. All numbers are averaged over five random selections of training and test data. We measure accuracy in terms of the correctness of the entire extracted entity (i.e., partial extraction gets no credit). We report for each label recall, precision and F1 values. Our main focus in this paper is accuracy and we will study that in Sections 5.1 to 5.. However, to establish practical feasibility of using a learning-based system, we will also report summarized results on training and test time in Section Overall data integration performance In this section, we first present accuracy of the full integration task (extraction + matching) by more centrally exploiting the database. We summarize the results in Table 1 where we report precision, recall and F1 values of integration for each text field for our datasets introduced earlier. The first set of numbers is where the database does not contribute to the learning models at all and the models are trained using the unstructured data alone. The second set of numbers is due to our model trained using both database specic features and features from unstructured text for both extraction and matching. The table shows that for all labels there is a signicant improvement in F1 accuracy. For some fields like journals of the PersonalBib dataset and state names of address dataset, F1 improves by more than 50%. The table also shows separate extraction and matching accuracies. For all labels there is a signicant improvement ranging from F1 dferences of 7 to 5 of extraction performance. For example, for journal we were able to increase F1 from 5 to 51 whereas for state we got a jump from to 9. The overall average F1 went up by 10 points which is a 0% reduction in error. In case of matching with databases, we added approximately 100 examples from the database as training instances. Matching results for IITB database with and without databases are almost the same. For personal bibtex F1 is defined as *precision*recall/(precision+recall). 9

10 Integration Extraction Match Only-L L + DB Only-L L + DB Only-L L + DB Dataset Label P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 PersonalBib author journal title Address city state pincode Table 1. Accuracy of overall integration, extraction and matching with and without an external database (P=precision, R=recall) database, there is a signicant improvement in the accuracy of title when the database tuples are added in the training, while for other labels (journal and author) the accuracies are comparable. We now present a finer grained analysis of the results and study the effects of various factors like data sizes, feature sets and databases on extraction since that is what contributes to most of the dference. 5. Effect of various features We assess the usefulness of various kinds of clues for aiding the extraction tasks by evaluating performance with various feature combinations for two dferent train sizes of PersonalBib. In Figure 6, we show F1 numbers, starting with an extraction model trained only on features of and then we add four dferent database features in sequence: schema derived cardinality information, similarity features, entity classier and regex-level match features. Accuracy improves with each additional feature type. The first signicant improvement was on using the similarity features of the database, followed by the improvement due to regex features. Next we evaluate the importance of features derived purely from by dropping features of three types: entity, context and label order while keeping all the database features. Dropping the entity features of raises accuracy slightly, probably because given the small training set these tend to overfit the data. The context and edge (label dependency) features are important as seen by the drop in accuracy. Interestingly, context and edge features are removed in parallel before removing entity features (not shown in figure), accuracy does not drop much. Thus, one of context and edge features is important and the two together are not needed. 5. Effect of increasing training data We show how the relative accuracy of extraction with and without a database is affected by increasing labeled training data. For these experiments we fixed the test set to only_l (nodb) "+cardinality "+db_similarity "+db_classier "+db_regex "-L_entity "-L_context "-L_edge Train=5% only_l (nodb) "+cardinality "+db_similarity "+db_classier "+db_regex "-L_entity "-L_context "-L_edge Train=10% Figure 6. Effect of various features from the database (db) and labeled data (L) on F1 accuracy 50% of available and selected dferent random subsets for training. Figure 7 shows average F1 with increasing training sizes. Accuracy increases as expected in both cases and the database continues to be useful even for larger training set. 5. Effect of changing database In order to evaluate the sensitivity of our results to particular databases, we report numbers where our personal bibtex database is changed to the publicly available DBLP database while keeping unchanged. In Table, we report performance on the two databases. We notice that title performance remains unchanged whereas author and journal accuracy drop by and F1 points respectively. However, there is still an overall 1% reduction in extraction error due to use of the database. 5.5 Running time As the time required to train an extraction model dominates the integration process, we report performance number for extraction only. We compare the time required for extraction with and without databases. 10

11 F withdb nodb Training % F withdb nodb Training % Time (sec) nodb with_pb with_dblp Training % (a) Personal Bibtex (b) Address data Figure 7. Increasing fraction of labeled data ( ) versus F1 with and without a database Only-L L + PersonalBib L + DBLP Label P R F1 P R F1 P R F1 author journal title overall Table. Accuracy of extraction with changing databases (P=precision, R=recall) We report training times for extraction models in three settings: without a database, with the smaller personal bibtex and the larger DBLP database. The personal bibtex database has 00 article tuples, 1800 author entries and 00 journals. The DBLP database is relatively large and contains approximately articles, authors, and 00 journals. Figure 8 shows the average time required to train an extraction model for various training set sizes under these three settings. Even with 50 training records and the larger DBLP database, the time required is no more than 15 minutes, and less than double the time required by the nodatabase setting. It is possible to reduce the gap further by more efficient index probes as discussed in [5]. Once an extraction models is trained, it is applied on unstructured instances using the A* algorithm discussed in Section.. For extraction without databases, inference on a single bibtex entry took on average 800 ms. For extraction with personal bibtex database average inference time is around 1800 ms, while with DBLP database it is 600 ms. This time can also be reduced considerably via efficient index lookups [5]. 6 Related work External dictionaries have been exploited to improve accuracy of NER tasks based on conditional models. A com- Figure 8. Increasing fraction of labeled data (L) versus running time (seconds) with and without databases mon scheme is to define boolean features based on exact match of a word with terms that appear in an entity dictionary []. This is extended in [7] to handle noisy dictionaries through features that capture various forms of similarity measures with dictionary entries. In both these cases the goal is to exploit entity lists in isolation whereas our goal is to handle multiple inter-linked entity lists. Also, we exploit entity lists more effectively by building an entity classier and pattern dictionaries. Another mechanism is to treat the dictionary as a collection of training examples and this has been explored for the case of generative models in [17, ] This method suffers from a number of drawbacks: there is no obvious way to apply it in a conditional setting; it is highly sensitive to misspellings within a token; and when the dictionary is too large or too dferent from the training text, it may degrade performance. In [1], some of these drawbacks are addressed by more carefully training a HMM so as to allow small variations in the extracted entities, but this approach being generative is still restricted in its scope of the variety of features that it can handle compared to recent conditional models. Also, it cannot effectively combine information available in both labeled unstructured data and external databases. Recently there has also been work on exploiting links for matching entities in a relational database, including conditional graphical models for grouped matching of multiattribute records [1] and CRFs for matching a group of records enforcing the transitivity constraint [1]. These are techniques for batched deduplication of a group of records whereas we are addressing an online scenario where unstructured records get inserted one-at-a-time in the database. 7 Conclusions In this paper we showed how to integrate unstructured text records into existing multi-relational databases using 11

12 models that combine clues from both existing entities in the database and labeled unstructured text. We extend state-ofthe-art semi-markov CRFs with a succinct set of features to capture pattern-level and entity-level information in the database, use these to extract entities and integrate them in a database while respecting its key constraints. Experiments show that our proposed set of techniques lead to signicant improvement in accuracy on real-le datasets. This is one of the first papers on statistical models for integrating unstructured records into existing databases and there is lot of opportunity for future work. We plan to develop other ways of exploiting a database of entities particularly inter-entity correlation expressed via soft-constraints. Another important aspect is creating standardized version of entities either by choosing the best of the existing variants or merging several noisy variants. Acknowledgements This work was supported by Microsoft Research and an IBM faculty award. The last author thanks them for their generous support. References [1] E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, USA, 00. [] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name-matching in information integration. IEEE Intelligent Systems, 00. [] V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic text segmentation for extracting structured records. In Proc. ACM SIGMOD International Conf. on Management of Data, Santa Barabara,USA, 001. [] A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Sixth Workshop on Very Large Corpora New Brunswick, New Jersey. Association for Computational Linguistics., [5] A. Chandel, P. Nagesh, and S. Sarawagi. Efficient batch topk search for dictionary-based entity recognition. In Proc. of the nd IEEE Int l Conference on Data Engineering (ICDE), 006. [6] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-00 Workshop on Information Integration on the Web (IIWeb-0), 00. To appear. [7] W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration methods. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 00. [8] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML-001), Williams, MA, 001. [9] S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, (6):67 71, [10] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large-scale optimization. Mathematic Programming, 5:50 58, [11] R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of The Sixth Conference on Natural Language Learning (CoNLL-00), pages 9 55, 00. [1] A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In Proceedings of the IJCAI-00 Workshop on Information Integration on the Web, pages 79 86, Acapulco, Mexico, Aug. 00. [1] Parag and P. Domingos. Multi-relational record linkage. In Proceedings of rd Workshop on Multi-Relational Data Mining at ACM SIGKDD, Seattle, WA, Aug. 00. [1] A. Ratnaparkhi. Learning to parse natural language with maximum entropy models. Machine Learning,, [15] S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-00), Edmonton, Canada, July 00. [16] S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In NIPs, 00. [17] K. Seymore, A. McCallum, and R. Rosenfeld. Learning Hidden Markov Model structure for information extraction. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 7, [18] F. Sha and F. Pereira. Shallow parsing with conditional random fields. In In Proceedings of HLT-NAACL, 00. 1

MEMBERSHIP LOCALIZATION WITHIN A WEB BASED JOIN FRAMEWORK

MEMBERSHIP LOCALIZATION WITHIN A WEB BASED JOIN FRAMEWORK MEMBERSHIP LOCALIZATION WITHIN A WEB BASED JOIN FRAMEWORK 1 K. LALITHA, 2 M. KEERTHANA, 3 G. KALPANA, 4 S.T. SHWETHA, 5 M. GEETHA 1 Assistant Professor, Information Technology, Panimalar Engineering College,

More information

A Systematic Cross-Comparison of Sequence Classifiers

A Systematic Cross-Comparison of Sequence Classifiers A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel grurgrur@gmail.com, feldman@cs.biu.ac.il,

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Record Deduplication By Evolutionary Means

Record Deduplication By Evolutionary Means Record Deduplication By Evolutionary Means Marco Modesto, Moisés G. de Carvalho, Walter dos Santos Departamento de Ciência da Computação Universidade Federal de Minas Gerais 31270-901 - Belo Horizonte

More information

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

More information

Efficient Recovery of Secrets

Efficient Recovery of Secrets Efficient Recovery of Secrets Marcel Fernandez Miguel Soriano, IEEE Senior Member Department of Telematics Engineering. Universitat Politècnica de Catalunya. C/ Jordi Girona 1 i 3. Campus Nord, Mod C3,

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute

More information

Multi-Relational Record Linkage

Multi-Relational Record Linkage Multi-Relational Record Linkage Parag and Pedro Domingos Department of Computer Science and Engineering University of Washington Seattle, WA 98195, U.S.A. {parag,pedrod}@cs.washington.edu http://www.cs.washington.edu/homes/{parag,pedrod}

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Personalized Hierarchical Clustering

Personalized Hierarchical Clustering Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Automatic Annotation Wrapper Generation and Mining Web Database Search Result Automatic Annotation Wrapper Generation and Mining Web Database Search Result V.Yogam 1, K.Umamaheswari 2 1 PG student, ME Software Engineering, Anna University (BIT campus), Trichy, Tamil nadu, India

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2

More information

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Indexing Techniques for Data Warehouses Queries. Abstract

Indexing Techniques for Data Warehouses Queries. Abstract Indexing Techniques for Data Warehouses Queries Sirirut Vanichayobon Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK, 739 sirirut@cs.ou.edu gruenwal@cs.ou.edu Abstract Recently,

More information

KEYWORD SEARCH IN RELATIONAL DATABASES

KEYWORD SEARCH IN RELATIONAL DATABASES KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

Segmentation and Classification of Online Chats

Segmentation and Classification of Online Chats Segmentation and Classification of Online Chats Justin Weisz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 jweisz@cs.cmu.edu Abstract One method for analyzing textual chat

More information

Research Statement Immanuel Trummer www.itrummer.org

Research Statement Immanuel Trummer www.itrummer.org Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction Conditional Random Fields: An Introduction Hanna M. Wallach February 24, 2004 1 Labeling Sequential Data The task of assigning label sequences to a set of observation sequences arises in many fields, including

More information

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Binomol George, Ambily Balaram Abstract To analyze data efficiently, data mining systems are widely using datasets

More information

Cluster Description Formats, Problems and Algorithms

Cluster Description Formats, Problems and Algorithms Cluster Description Formats, Problems and Algorithms Byron J. Gao Martin Ester School of Computing Science, Simon Fraser University, Canada, V5A 1S6 bgao@cs.sfu.ca ester@cs.sfu.ca Abstract Clustering is

More information

Automatic segmentation of text into structured records

Automatic segmentation of text into structured records Automatic segmentation of text into structured records Vinayak Borkar Kaustubh Deshmukh Sunita Sarawagi Indian Institute of Technology, Bombay ABSTRACT In this paper we present a method for automatically

More information

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Modeling System Calls for Intrusion Detection with Dynamic Window Sizes

Modeling System Calls for Intrusion Detection with Dynamic Window Sizes Modeling System Calls for Intrusion Detection with Dynamic Window Sizes Eleazar Eskin Computer Science Department Columbia University 5 West 2th Street, New York, NY 27 eeskin@cs.columbia.edu Salvatore

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

Binary Coded Web Access Pattern Tree in Education Domain

Binary Coded Web Access Pattern Tree in Education Domain Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: kc.gomathi@gmail.com M. Moorthi

More information

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research Algorithmic Techniques for Big Data Analysis Barna Saha AT&T Lab-Research Challenges of Big Data VOLUME Large amount of data VELOCITY Needs to be analyzed quickly VARIETY Different types of structured

More information

Network (Tree) Topology Inference Based on Prüfer Sequence

Network (Tree) Topology Inference Based on Prüfer Sequence Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 vanniarajanc@hcl.in,

More information

Compact Representations and Approximations for Compuation in Games

Compact Representations and Approximations for Compuation in Games Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

More information

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database. Physical Design Physical Database Design (Defined): Process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

1-04-10 Configuration Management: An Object-Based Method Barbara Dumas

1-04-10 Configuration Management: An Object-Based Method Barbara Dumas 1-04-10 Configuration Management: An Object-Based Method Barbara Dumas Payoff Configuration management (CM) helps an organization maintain an inventory of its software assets. In traditional CM systems,

More information

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Understanding Web personalization with Web Usage Mining and its Application: Recommender System Understanding Web personalization with Web Usage Mining and its Application: Recommender System Manoj Swami 1, Prof. Manasi Kulkarni 2 1 M.Tech (Computer-NIMS), VJTI, Mumbai. 2 Department of Computer Technology,

More information

Invited Applications Paper

Invited Applications Paper Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK THOREG@MICROSOFT.COM JOAQUINC@MICROSOFT.COM

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609.

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609. Data Integration using Agent based Mediator-Wrapper Architecture Tutorial Report For Agent Based Software Engineering (SENG 609.22) Presented by: George Shi Course Instructor: Dr. Behrouz H. Far December

More information

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

More information

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,

More information

2695 P a g e. IV Semester M.Tech (DCN) SJCIT Chickballapur Karnataka India

2695 P a g e. IV Semester M.Tech (DCN) SJCIT Chickballapur Karnataka India Integrity Preservation and Privacy Protection for Digital Medical Images M.Krishna Rani Dr.S.Bhargavi IV Semester M.Tech (DCN) SJCIT Chickballapur Karnataka India Abstract- In medical treatments, the integrity

More information

A Learning Based Method for Super-Resolution of Low Resolution Images

A Learning Based Method for Super-Resolution of Low Resolution Images A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

How To Cluster On A Search Engine

How To Cluster On A Search Engine Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING

More information

The Classical Architecture. Storage 1 / 36

The Classical Architecture. Storage 1 / 36 1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage

More information

Tracking Groups of Pedestrians in Video Sequences

Tracking Groups of Pedestrians in Video Sequences Tracking Groups of Pedestrians in Video Sequences Jorge S. Marques Pedro M. Jorge Arnaldo J. Abrantes J. M. Lemos IST / ISR ISEL / IST ISEL INESC-ID / IST Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal

More information

Load Balancing and Switch Scheduling

Load Balancing and Switch Scheduling EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

More information

SVM Based Learning System For Information Extraction

SVM Based Learning System For Information Extraction SVM Based Learning System For Information Extraction Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham Department of Computer Science, The University of Sheffield, Sheffield, S1 4DP, UK {yaoyong,kalina,hamish}@dcs.shef.ac.uk

More information

SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS

SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS Carlos Andre Reis Pinheiro 1 and Markus Helfert 2 1 School of Computing, Dublin City University, Dublin, Ireland

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

IBM Software Information Management. Scaling strategies for mission-critical discovery and navigation applications

IBM Software Information Management. Scaling strategies for mission-critical discovery and navigation applications IBM Software Information Management Scaling strategies for mission-critical discovery and navigation applications Scaling strategies for mission-critical discovery and navigation applications Contents

More information

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Reasoning Component Architecture

Reasoning Component Architecture Architecture of a Spam Filter Application By Avi Pfeffer A spam filter consists of two components. In this article, based on my book Practical Probabilistic Programming, first describe the architecture

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

Efficient Integration of Data Mining Techniques in Database Management Systems

Efficient Integration of Data Mining Techniques in Database Management Systems Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

ANALYTICS IN BIG DATA ERA

ANALYTICS IN BIG DATA ERA ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut

More information

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS Charanma.P 1, P. Ganesh Kumar 2, 1 PG Scholar, 2 Assistant Professor,Department of Information Technology, Anna University

More information

Modeling Guidelines Manual

Modeling Guidelines Manual Modeling Guidelines Manual [Insert company name here] July 2014 Author: John Doe john.doe@johnydoe.com Page 1 of 22 Table of Contents 1. Introduction... 3 2. Business Process Management (BPM)... 4 2.1.

More information

KNOWLEDGE FACTORING USING NORMALIZATION THEORY

KNOWLEDGE FACTORING USING NORMALIZATION THEORY KNOWLEDGE FACTORING USING NORMALIZATION THEORY J. VANTHIENEN M. SNOECK Katholieke Universiteit Leuven Department of Applied Economic Sciences Dekenstraat 2, 3000 Leuven (Belgium) tel. (+32) 16 28 58 09

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Database Application Developer Tools Using Static Analysis and Dynamic Profiling

Database Application Developer Tools Using Static Analysis and Dynamic Profiling Database Application Developer Tools Using Static Analysis and Dynamic Profiling Surajit Chaudhuri, Vivek Narasayya, Manoj Syamala Microsoft Research {surajitc,viveknar,manojsy}@microsoft.com Abstract

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Optimization of ETL Work Flow in Data Warehouse

Optimization of ETL Work Flow in Data Warehouse Optimization of ETL Work Flow in Data Warehouse Kommineni Sivaganesh M.Tech Student, CSE Department, Anil Neerukonda Institute of Technology & Science Visakhapatnam, India. Sivaganesh07@gmail.com P Srinivasu

More information

Tensor Factorization for Multi-Relational Learning

Tensor Factorization for Multi-Relational Learning Tensor Factorization for Multi-Relational Learning Maximilian Nickel 1 and Volker Tresp 2 1 Ludwig Maximilian University, Oettingenstr. 67, Munich, Germany nickel@dbs.ifi.lmu.de 2 Siemens AG, Corporate

More information

The Real Challenges of Configuration Management

The Real Challenges of Configuration Management The Real Challenges of Configuration Management McCabe & Associates Table of Contents The Real Challenges of CM 3 Introduction 3 Parallel Development 3 Maintaining Multiple Releases 3 Rapid Development

More information

Private Record Linkage with Bloom Filters

Private Record Linkage with Bloom Filters To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,

More information

Numerical Field Extraction in Handwritten Incoming Mail Documents

Numerical Field Extraction in Handwritten Incoming Mail Documents Numerical Field Extraction in Handwritten Incoming Mail Documents Guillaume Koch, Laurent Heutte and Thierry Paquet PSI, FRE CNRS 2645, Université de Rouen, 76821 Mont-Saint-Aignan, France Laurent.Heutte@univ-rouen.fr

More information

MapReduce/Bigtable for Distributed Optimization

MapReduce/Bigtable for Distributed Optimization MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. kbhall@google.com Scott Gilpin Google Inc. sgilpin@google.com Gideon Mann Google Inc. gmann@google.com Abstract With large data

More information