Random Walk Inference and Learning in A Large Scale Knowledge Base

Transcription

1 Random Walk Inferene and Learning in A Large Sale Knowledge Base Ni Lao Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA nlao@s.mu.edu Tom Mithell Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA tom.mithell@s.mu.edu William W. Cohen Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA wohen@s.mu.edu Abstrat We onsider the problem of performing learning and inferene in a large sale knowledge base ontaining imperfet knowledge with inomplete overage. We show that a soft inferene proedure based on a ombination of onstrained, weighted, random walks through the knowledge base graph an be used to reliably infer new beliefs for the knowledge base. More speifially, we show that the system an learn to infer different target relations by tuning the weights assoiated with random walks that follow different paths through the graph, using a version of the Path Ranking Algorithm (Lao and Cohen, 2010b). We apply this approah to a knowledge base of approximately 500,000 beliefs extrated imperfetly from the web by NELL, a never-ending language learner (Carlson et al., 2010). This new system improves signifiantly over NELL s earlier Horn-lause learning and inferene method: it obtains nearly double the preision at rank 100, and the new learning method is also appliable to many more inferene tasks. 1 Introdution Although there is a great deal of reent researh on extrating knowledge from text (Agihtein and Gravano, 2000; Etzioni et al., 2005; Snow et al., 2006; Pantel and Pennahiotti, 2006; Banko et al., 2007; Yates et al., 2007), muh less progress has been made on the problem of drawing reliable inferenes from this imperfetly extrated knowledge. In partiular, traditional logial inferene methods are too brittle to be used to make omplex inferenes from automatially-extrated knowledge, and probabilisti inferene methods (Rihardson and Domingos, 2006) suffer from salability problems. This paper onsiders the problem of onstruting inferene methods that an sale to large KB s, and that are robust to imperfet knowledge. The KB we onsider is a large triple store, whih an be represented as a labeled, direted graph in whih eah entity x is a node, eah binary relation R(x, y) is an edge labeled R between x and y, and unary onepts C(x) are represented as an edge labeled isa between the node for the entity x and a node for the onept C. We present a trainable inferene method that learns to infer relations by ombining the results of different random walks through this graph, and show that the method ahieves good saling properties and robust inferene in a KB ontaining over 500,000 triples extrated from the web by the NELL system (Carlson et al., 2010). 1.1 The NELL Case Study To evaluate our approah experimentally, we study it in the ontext of the NELL (Never Ending Language Learning) researh projet, whih is an effort to develop a never-ending learning system that operates 24 hours per day, for years, to ontinuously improve its ability to read (extrat strutured fats from) the web (Carlson et al., 2010). NELL began operation in January As of Marh 2011, NELL has built a knowledge base ontaining several million andidate beliefs whih it has extrated from the web with varying onfidene. Among these,

2 NELL has fairly high onfidene in approximately half a million, whih we refer to as NELL s (onfident) beliefs. NELL has lower onfidene in a few million others, whih we refer to as its andidate beliefs. NELL is given as input an ontology that defines hundreds of ategories (e.g., person, beverage, athlete, sport) and two-plae typed relations among these ategories (e.g., atheleteplayssport( athlete, sport )), whih it must learn to extrat from the web. It is also provided a set of 10 to 20 positive seed examples of eah suh ategory and relation, along with a downloaded olletion of 500 million web pages from the ClueWeb2009 orpus (Callan and Hoy, 2009) as unlabeled data, and aess to 100,000 queries eah day to Google s searh engine. Eah day, NELL has two tasks: (1) to extrat additional beliefs from the web to populate its growing knowledge base (KB) with instanes of the ategories and relations in its ontology, and (2) to learn to perform task 1 better today than it ould yesterday. We an measure its learning ompetene by allowing it to onsider the same text douments today as it did yesterday, and reording whether it extrats more beliefs, more aurately today than it did yesterday. NELL uses a large-sale semi-supervised multitask learning algorithm that ouples the training of over 1500 different lassifiers and extration methods (see (Carlson et al., 2010)). Although many of the details of NELL s learning method are not entral to this paper, two points should be noted. First, NELL is a multistrategy learning system, with omponents that learn from different views of the data (Blum and Mithell, 1998): for instane, one view uses orthographi features of a potential entity name (like ontains apitalized words ), and another uses free-text ontexts in whih the noun phrase is found (e.g., X frequently follows the bigram mayor of ). Seond, NELL is a bootstrapping system, whih self-trains on its growing olletion of onfident beliefs. 1.2 Knowledge Base Inferene: Horn Clauses Although NELL has now grown a sizable knowledge base, its ability to perform inferene over this knowledge base is urrently very limited. At present its only inferene method beyond simple inheritane HinesWard Eli Manning AthletePlays ForTeam AthletePlays ForTeam Steelers Giants TeamPlays InLeague TeamPlays InLeague TeamPlays InLeague Figure 1: An example subgraph. NFL MLB involves applying first order Horn lause rules to infer new beliefs from urrent beliefs. For example, it may use a Horn lause suh as AthletePlaysForTeam(x, y) (1) TeamPlaysInLeague(y, z) AthletePlaysInLeague(x,z) to infer that AthletePlaysInLeague(HinesWard,NFL), if it has already extrated the beliefs in the preonditions of the rule, with variables x, y and z bound to HinesWard, PittsburghSteelers and NFL respetively as shown in Figure 1. NELL urrently has a set of approximately 600 suh rules, whih it has learned by data mining its knowledge base of beliefs. Eah learned rule arries a onditional probability that its onlusion will hold, given that its preonditions are satisfied. NELL learns these Horn lause rules using a variant of the FOIL algorithm (Quinlan and Cameron-Jones, 1993), heneforth N-FOIL. N-FOIL takes as input a set of positive and negative examples of a rule s onsequent (e.g., +AthletePlaysInLeague(HinesWard,NFL), AthletePlaysInLeague(HinesWard,NBA)), and uses a separate-and-onquer strategy to learn a set of Horn lauses that fit the data well. Eah Horn lause is learned by starting with a general rule and progressively speializing it, so that it still overs many positive examples but overs few negative examples. After a lause is learned, the examples overed by that lause are removed from the training set, and the proess repeats until no positive examples remain. Learning first-order Horn lauses is omputationally expensive not only is the searh spae large, but some Horn lauses an be ostly to evaluate (Cohen and Page, 1995). N-FOIL uses two triks to improve its salability. First, it assumes that the onsequent prediate is funtional e.g., that eah Athlete plays in at most one League. This means that expliit negative examples need not be

3 provided (Zelle et al., 1995): e.g., if AthletePlaysIn- League(HinesWard,NFL) is a positive example, then AthletePlaysInLeague(HinesWard,z ) for any other value of z is negative. In general, this onstraint guides the searh algorithm toward Horn lauses that have fewer possible instantiations, and hene are less expensive to math. Seond, N-FOIL uses relational pathfinding (Rihards and Mooney, 1992) to produe general rules i.e., the starting point for a lause of the form R(x, y) is found by looking at positive instanes R(a, b) of the onsequent, and finding a lause that orresponds to a bounded-length path of binary relations that link a to b. In the example above, a start lause might be the lause (1). As in FOIL, the lause is then (potentially) speialized by greedily adding additional onditions (like ProfessionalAthlete(x)) or by replaing variables with onstants (eg, replaing z with NFL). For eah N-FOIL rule, an estimated onditional probability ˆP (onlusion preonditions) is alulated using a Dirihlet prior aording to ˆP = (N + + m prior)/(n + + N + m) (2) where N + is the number of positive instanes mathed by this rule in the FOIL training data, N is the number of negative instanes mathed, m = 5 and prior = 0.5. As the results below show, N-FOIL generally learns a small number of high-preision inferene rules. One important role of these inferene rules is that they ontribute to the bootstrapping proedure, as inferenes made by N-FOIL inrease either the number of andidate beliefs, or (if the inferene is already a andidate) improve NELL s onfidene in andidate beliefs. 1.3 Knowledge Base Inferene: Graph Random Walks In this paper, we onsider an alternative approah, based on the Path Ranking Algorithm (PRA) of Lao and Cohen (2010b), desribed in detail below. PRA learns to rank graph nodes y relative to a query node x. PRA begins by enumerating a large set of bounded-length edge-labeled path types, similar to the initial lauses used in NELL s variant of FOIL. These path types are treated as ranking experts, eah performing a random walk through the graph, onstrained to follow that sequene of edge types, and ranking nodes y by their weights in the resulting distribution. Finally PRA ombines these experts using logisti regression. As an example, onsider a path from x to y via the sequene of edge types isa, isa 1 (the inverse of isa), and AthletePlaysInLeague, whih orresponds to the Horn lause isa(x, ) isa 1 (, x ) (3) AthletePlaysInLeague(x, y) AthletePlaysInLeague(x, y) Suppose a random walk starts at a query node x (say x=hinesward). If HinesWard is linked to the single onept node ProfessionalAthlete via isa, the walk will reah that node with probability 1 after one step. If A is the set of ProfessionalAthlete s in the KB, then after two steps, the walk will have probability 1/ A of being at any x A. If L is the set of athleti leagues and l L, let A l be the set of athletes in league l: after three steps, the walk will have probability A l / A of being at any point y L. In short, the ranking assoiated with this path gives the prior probability of a value y being an athleti league for x whih is useful as a feature in a ombined ranking method, although not by itself a high-preision inferene rule. Note that the rankings produed by this expert will hange as the knowledge base evolves for instane, if the system learns about proportionally more soer players than hokey players over time, then the league rankings for the path of lause (3) will hange. Also, the ranking is speifi to the query node x. For instane, suppose the KB ontains fats whih reflet the ambiguity of the team name Giants 1 as in Figure 1. Then the path for lause (1) above will give lower weight to y = NFL for x = EliManning than to y = NFL for x = HinesWard. The main ontribution of this paper is to introdue and evaluate PRA as an algorithm for making probabilisti inferene in large KBs. Compared to Horn lause inferene, the key harateristis of this new inferene method are as follows: The evidene in support of inferring a relation instane R(a, b) is based on many existing paths between a and b in the urrent KB, ombined using a learned logisti funtion. 1 San Franiso s Major-League Baseball and New York s National Football League teams are both alled the Giants.

4 The onfidene in an inferene is sensitive to the urrent state of the knowledge base, and the speifi entities being queried (sine the paths used in the inferene have these properties). Experimentally, the inferene method yields many more moderately-onfident inferenes than the Horn lauses learned by N-FOIL. The learning and inferene are more effiient than N-FOIL, in part beause we an exploit effiient approximation shemes for random walks (Lao and Cohen, 2010a). The resulting inferene is as fast as 10 milliseonds per query on average. The Path Ranking Algorithm (PRA) we use is similar to that desribed elsewhere (Lao and Cohen, 2010b), exept that to ahieve effiient model learning, the paths between a and b are determined by the statistis from a population of training queries rather than enumerated ompletely. PRA uses random walks to generate relational features on graph data, and ombine them with a logisti regression model. Compared to other relational models (e.g. FOIL, Markov Logi Networks), PRA is extremely effiient at link predition or retrieval tasks, in whih we are interested in identifying top links from a large number of andidates, instead of fousing on a partiular node pair or joint inferenes. 1.4 Related Work The TextRunner system (Cafarella et al., 2006) answers list queries on a large knowledge base produed by open domain information extration. Spreading ativation is used to measure the loseness of any node to the query term nodes. This approah is similar to the random walk with restart approah whih is used as a baseline in our experiment. The FatRank system (Jain and Pantel, 2010) ompares different ways of onstruting random walks, and ombining them with extration sores. However, the shortoming of both approahes is that they ignore edge type information, whih is important for ahieving high auray preditions. The HOLMES system (Shoenmakers et al., 2008) derives new assertions using a few manually written inferene rules. A Markov network orresponding to the grounding of these rules to the knowledge base is onstruted for eah query, and then belief propagation is used for inferene. In omparison, our proposed approah disovers inferene rules automatially from training data. Below we desribe our approah in greater detail, provide experimental evidene of its value for performing inferene in NELL s knowledge base, and disuss impliations of this work and diretions for future researh. 2 Approah In this setion, we first desribe how we formulate link (relation) predition on a knowledge base as a ranking task. Then we review the Path Ranking Algorithm (PRA) introdued by Lao and Cohen (2010b; 2010a). After that, we desribe two improvement to the PRA method so that it is more suitable for the task of link predition on knowledge base. The first improvement helps PRA deal with the large number of relations whih is typial of large knowledge bases. The seond improvement aims at improving the quality of inferene by applying low variane sampling. 2.1 Learning with NELL s Knowledge Base For eah relation R in the knowledge base we train a model for the link predition task: given a onept x, find all other onepts y whih potentially have the relation R(x, y). This predition is made based on an existing knowledge base extrated imperfetly from the web. Although a model an potentially benefit from prediting multiple relations jointly, suh joint inferene is beyond the sope of this work. To ensure reasonable number of training instanes, we generate labeled training queries from 48 relations whih have more than 100 instanes in the knowledge base. We reate two tasks for eah relation i.e., prediting y given x and prediting x given y yielding 96 tasks in all. Eah node x whih has relation R in the knowledge base with any other node is treated as a training query, the atual nodes y in the knowledge base known to satisfy R(x, y) are treated as labeled positive examples, and any other nodes are treated as negative examples.

5 2.2 Path Ranking Algorithm Review We now review the Path Ranking Algorithm introdued by Lao and Cohen (2010b). A relation path P is defined as a sequene of relations R 1... R l, and in order to emphasize the types assoiated with eah step, P an also be written as R T 1 R 0... l Tl, where T i = range(r i ) = domain(r i+1 ), and we also define domain(p ) T 0, range(p ) T l. In the experiments in this paper, there is only one type of nodes alled onept, whih an be onneted through different types of relations. In this notation, relations like the team ertain player plays for, and the league ertain player s team is in an be expressed by the paths below (respetively): P 1 : onept AtheletePlayesForTeam onept P 2 : onept AtheletePlayesForTeam onept TeamPlaysInLeagure onept For any relation path P = R 1... R l and a seed node s domain(p ), a path onstrained random walk defines a distribution h s,p reursively as follows. If P is the empty path, then define { 1, if e = s h s,p (e) = (4) 0, otherwise If P = R 1... R l is nonempty, then let P = R 1... R l 1, and define h s,p (e) = h s,p (e ) P (e e ; R l ), (5) e range(p ) where P (e e ; R l ) = R l(e,e) R l (e, ) is the probability of reahing node e from node e with a one step random walk with edge type R l. R(e, e) indiates whether there exists an edge with type R that onnet e to e. More generally, given a set of paths P 1,..., P n, one ould treat eah h s,pi (e) as a path feature for the node e, and rank nodes by a linear model θ 1 h s,p1 (e) + θ 2 h s,p2 (e) +... θ n h s,pn (e) where θ i are appropriate weights for the paths. This gives a ranking of nodes e related to the query node s by the following soring funtion sore(e; s) = P P l h s,p (e)θ P, (6) where P l is the set of relation paths with length l. Given a relation R and a set of node pairs {(s i, t i )}, we an onstrut a training dataset D = {(x i, r i )}, where x i is a vetor of all the path features for the pair (s i, t i ) i.e., the j-th omponent of x i is h si,p j (t i ), and r i indiates whether R(s i, t i ) is true. Parameter θ is estimated by maximizing the following regularized objetive funtion O(θ) = i o i (θ) λ 1 θ 1 λ 2 θ 2, (7) where λ 1 ontrols L 1 -regularization to help struture seletion, and λ 2 ontrols L 2 -regularization to prevent overfitting. o i (θ) is a per-instane objetive funtion, whih is defined as o i (θ) = w i [r i ln p i + (1 y i ) ln(1 p i )], (8) where p i is the predited relevane defined as p i = p(r i = 1 x i ; θ) = exp(θt x i ) 1+exp(θ T x i ), and w i is an importane weight to eah example. A biased sampling proedure selets only a small subset of negative samples to be inluded in the objetive (see (Lao and Cohen, 2010b) for detail). 2.3 Data-Driven Path Finding In prior work with PRA, P l was defined as all relation paths of length at most l. When the number of edge types is small, one an generate P l by enumeration; however, for domains with a large number of relations (e.g., a knowledge base), it is impratial to enumerate all possible relation paths even for small l. For instane, if the number of relations related to eah node is 100, even the number of length three paths easily reahes millions. For other domains like parsed natural language sentenes, useful paths an be as long as ten relations (Minkov and Cohen, 2008). In this ase, even with smaller number of possible relations, the total number of relation path is still too large for systemati enumeration. In order to apply PRA to these domains, we modify the path generation proedure in PRA to produe only paths whih are potentially useful for the task. Define a query s to be supporting a path P if h s,p (e) 0 for any entity e. We require that any node reated during path finding needs to

6 Table 1: Number of paths in PRA models of maximum path length 3 and 4. Averaged over 96 tasks. l=3 l=4 all paths up to length L 15, 376 1, 906, 624 +query support α = ever reah a target entity L 1 regularization be supported by at least a fration α of the training queries s i, as well as being of length no more than l (In the experiments, we set α = 0.01) We also require that a path inluded in the PRA model only if it retrieves at least one target entity t i in the training set. As we an see from Table 1, together these two onstraints dramatially redue the number of paths that need to be onsidered, relative to systematially enumerating all possible paths. L1 regularization redues the size of the model even more. The idea of finding paths that onnets nodes in a graph is not new. It has been embodied previously in first-order learning systems (Rihards and Mooney, 1992) as well as N-FOIL, and relational database searhing systems (Bhalotia et al., 2002). These approahes onsider a single query during path finding. In omparison, the data-driven path finding method we desribed here uses statistis from a population of queries, and therefore an potentially determine the importane of a path more reliably. 2.4 Low-Variane Sampling Lao and Cohen (2010a) previously showed that sampling tehniques like finger printing and partile filtering an signifiantly speedup random walk without sarifiing retrieval quality. However, the sampling proedures an indue a loss of diversity in the partile population. For example, onsider a node in the graph with just two out links with equal weights, and suppose we are required to generate two walkers starting from this node. A disappointing result is that with 50 perent hane both walkers will follow the same branh, and leave the other branh with no probability mass. To overome this problem, we apply a tehnique alled Low-Variane Sampling (LVS) (Thrun et al., 2005), whih is ommonly used in robotis to improve the quality of sampling. Instead of generating independent samples from a distribution, LVS uses a single random number to generate all Table 2: Comparing PRA with RWR models. MRRs and training times are averaged over 96 tasks. l=2 l=3 MRR Time MRR Time RWR(no train) RWR s s PRA s s samples, whih are evenly distributed aross the whole distribution. Note that given a distribution P (x), any number r in [0, 1] points to exatly one x value, namely x = arg min j m=1..j P (m) r. Suppose we want to generate M samples from P (x). LVS first generates a random number r in the interval [0, M 1 ]. Then LVS repeatedly adds the fixed amount M 1 to r and hooses x values orresponding to the resulting numbers. 3 Results This setion reports empirial results of applying random walk inferene to NELL s knowledge base after the 165th iteration of its learning proess. We first investigate PRA s behavior by ross validation on the training queries. Then we ompare PRA and N-FOIL s ability to reliably infer new beliefs, by leveraging the Amazon Mehanial Turk servie. 3.1 Cross Validation on the Training Queries Random Walk with Restart (RWR) (also alled personalized PageRank (Haveliwala, 2002)) is a general-purpose graph proximity measure whih has been shown to be fairly suessful for many types of tasks. We ompare PRA to two versions of RWR on the 96 tasks of link predition with NELL s knowledge base. The two baseline methods are an untrained RWR model and a trained RWR model as desribed by Lao and Cohen (2010b). (In brief, in the trained RWR model, the walker will probabilistially prefer to follow edges assoiated with different labels, where the weight for eah edge label is hosen to minimize a loss funtion, suh as Equation 7. In the untrained model, edge weights are uniform.) We explored a range of values for the regularization parameters L 1 and L 2 using ross validation on the training data, and we fix both L 1 and L 2 parameters to for all tasks. The

7 MRR k 10k Exat 10k Independent Fingerprinting Low Variane Fingerprinting Independent Filtering Low Variane Filtering Random Walk Speedup Figure 2: Compare inferene speed and quality over 96 tasks. The speedup is relative to exat inferene, whih is on average 23ms per query. 1k 100 maximum path length is fixed to 3. 2 Table 2 ompares the three methods using 5-fold ross validation and the Mean Reiproal Rank (MRR) 3 measure, whih is defined as the inverse rank of the highest ranked relevant result in a set of results. If the the first returned result is relevant, then MRR is 1.0, otherwise, it is smaller than 1.0. Supervised training an signifiantly improve retrieval quality (p-value= omparing untrained and trained RWR), and leveraging path information an produe further improvement (p-value= omparing trained RWR with PRA). The average training time for a prediate is only a few seonds. We also investigate the effet of low-variane sampling on the quality of predition. Figure 2 ompares independent and low variane sampling when applied to finger printing and partile filtering (Lao and Cohen, 2010a). The horizontal axis orresponds to the speedup of random walk ompared with exat inferene, and the vertial axis measures the quality of predition by MRR with three fold ross validation on the training query set. Low-variane sampling an improve predition for both finger 2 Results with maximum length 4 are not reported here. Generally models with length 4 paths produe slightly better results, but are 4-5 times slower to train 3 For a set of queries Q, MRR = 1 Q q Q 1k 1 rank of the first orret answer for q printing and partile filtering. The numbers on the urves indiate the number of partiles (or walkers). When using a large number of partiles, the partile filtering methods onverge to the exat inferene. Interestingly, when using a large number of walkers, the finger printing methods produe even better predition quality than exat inferene. Lao and Cohen notied a similar improvement on retrieval tasks, and onjetured that it is beause the sampling inferene imposes a regularization penalty on longer relation paths (2010a). 3.2 Evaluation by Mehanial Turk The ross-validation result above assumes that the knowledge base is omplete and orret, whih we know to be untrue. To aurately ompare PRA and N-FOIL s ability to reliably infer new beliefs from an imperfet knowledge base, we use human assessments obtained from Amazon Mehanial Turk. To limit labeling osts, and sine our goal is to improve the performane of NELL, we do not inlude RWR-based approahes in this omparison. Among all the 24 funtional prediates, N-FOIL disovers onfident rules for 8 of them (it produes no result for the other 16 prediates). Therefore, we ompare the quality of PRA to N-FOIL on these 8 prediates only. Among all the 72 non-funtional prediates whih N-FOIL annot be applied to PRA exhibits a wide range of performane in ross-validation. The are 43 tasks for whih PRA obtains MRR higher than 0.4 and builds a model with more than 10 path features. We randomly sampled 8 of these prediates to be evaluated by Amazon Mehanial Turk. Table 3 shows the top two weighted PRA features for eah task on whih N-FOIL an suessfully learn rules. These PRA rules an be ategorized into broad overage rules whih behave like priors over orret answers (e.g. 1-2, 4-6, 15), aurate rules whih leverage speifi relation sequenes (e.g. 9, 11, 14), rules whih leverage information about the synonyms of the query node (e.g. 7-8, 10, 12), and rules whih leverage information from a loal neighborhood of the query node (e.g. 3, 12-13, 16). The synonym paths are useful, beause an entity may have multiple names on the web. We find that all 17 general rules (no speialization) learned by N-FOIL an be expressed as length two relation

8 Table 3: The top two weighted PRA paths for tasks on whih N-FOIL disovers onfident rules. stands for onept. ID athleteplaysforteam leagueplayers PRA Path (Comment) athleteplaysforteam 1 (teams with many players in the athlete s league) 2 leagueteams teamagainstteam (teams that play against many teams in the athlete s league) 3 athleteplayssport players (the league that players of a ertain sport belong to) 4 isa isa 1 athleteplayssport 5 isa isa 1 6 stadiumloatedincity 7 stadiumhometeam 8 latitudelongitude teamhomestadium 9 teamplaysincity 10 teammember teamplaysincity 11 teamhomestadium (popular leagues with many players) athleteplayssport (popular sports of all the athletes) superpartoforganization teamhomestadium teamplayssport (popular sports of a ertain league) stadiumloatedincity (ity of the stadium with the same team) latitudelongitudeof stadiumloatedincity (ity of the stadium with the same loation) itystadiums (stadiums loated in the same ity with the query team) athleteplaysforteam stadiumloatedincity (ity of the team s home stadium) 12 teamhomestadium stadiumhometeam teamplaysinleague 13 teamplayssport players teamplaysagainstteam 14 teamplayssport 15 isa isa 1 16 teamplaysinleague teamplayssport teamhomestadium (home stadium of teams whih share players with the query) teamplaysincity (ity of teams with the same home stadium as the query) (the league that the query team s members belong to) teamplaysinleague (the league that the query team s ompeting team belongs to) (sports played by many teams) leagueteams teamplayssport (the sport played by other teams in the league) Table 4: Amazon Mehanial Turk evaluation for the promoted knowledge. Using paired t-test at task level, PRA is not statistially different than N-FOIL for p@10 (p-value=0.3), but is signifiantly better for p@100 (p-value=0.003) PRA N-FOIL Task P majority #Paths p@10 p@100 p@1000 #Rules #Query p@10 p@100 p@1000 athleteplaysforteam (+1) (+30) athleteplayssport (+30) stadiumloatedincity (+0) teamhomestadium (+0) teamplaysincity (+0) teamplaysinleague (+151) teamplayssport (+86) average teammember ompaniesheadquarteredin publiationjournalist produedby N-FOIL does not produe results ompeteswith for non-funtional prediates hasoffieincity teamwontrophy worksfor average

9 Table 5: Compare Mehanial Turk workers voted assessments with our gold standard labels based on 100 samples. AMT=F AMT=T Gold=F 25% 15% Gold=T 11% 49% paths suh as path 11. In omparison, PRA explores a feature spae with many length three paths. For eah relation R to be evaluated, we generate test queries s whih belong to domain(r). Queries whih appear in the training set are exluded. For eah query node s, we applied a trained model (either PRA or N-FOIL) to generate a ranked list of andidate t nodes. For PRA, the andidates are sorted by their sores as in Eq. (6). For N-FOIL, the andidates are sorted by the estimated auraies of the rules as in Eq. (2) (whih generate the andidates). Sine there are about 7 thousand (and 13 thousand) test queries s for eah funtional (and non-funtional) prediate R, and there are (potentially) thousands of andidates t returned for eah query s, we annot evaluate all andidates of all queries. Therefore, we first sort the queries s for eah prediate R by the sores of their top ranked andidate t in desending order, and then alulate preisions at top 10, 100 and 1000 positions for the 1 ),..., where list of result R(s R,1, t R,1 1 ), R(s R,2, t R,2 s R,1 is the first query for prediate R, t R,1 1 is its first andidate, s R,2 is the seond query for prediate R, t R,2 1 is its first andidate, so on and so forth. To redue the labeling load, we judge all top 10 queries for eah prediate, but randomly sample 50 out of the top 100, and randomly sample 50 out of the top Eah belief is evaluated by 5 workers at Mehanial Turk, who are given assertions like Hines Ward plays for the team Steelers, as well as Google searh links for eah entity, and the ombination of both entities. Statistis shows that the workers spend on average 25 seonds to judge eah belief. We also remove some workers judgments whih are obviously inorret 4. We sampled 100 beliefs, and ompared their voted result to gold-standard labels produed by one author of this paper. Table 5 shows that 74% of the time the workers voted result agrees with our judgement. 4 Certain workers label all the questions with the same answer Table 4 shows the evaluation result. The P majority olumn shows for eah prediate the auray ahieved by the majority predition: given a query R(x,?), predit the y that most often satisfies R over all possible x in the knowledge base. Thus, the higher P majority is, the simpler the task. Prediting the funtional prediates is generally easier prediting the non-funtional prediates. The #Query olumn shows the number of queries on whih N-FOIL is able to math any of its rules, and hene produe a andidate belief. For most prediates, N-FOIL is only able to produe results for at most a few hundred queries. In omparison, PRA is able to produe results for 6,599 queries on average for eah funtional prediate, and 12,519 queries on average for eah non-funtional prediate. Although the preision at 10 (p@10) of N-FOIL is omparable to that of PRA, preision at 10 and at 1000 (p@100 and p@1000) are muh lower 5. The #Path olumn shows the number of paths learned by PRA, and the #Rule olumn shows the number of rules learned by N-FOIL, with the numbers before brakets orrespond to unspeialized rules, and the numbers in brakets orrespond to speialized rules. Generally, speialized rules have muh smaller reall than unspeialized rules. Therefore, the PRA approah ahieves high reall partially by ombining a large number of unspeialized paths, whih orrespond to unspeialized rules. However, learning more aurate speialized paths is part of our future work. A signifiant advantage of PRA over N-FOIL is that it an be applied to non-funtional prediates. The last eight rows of Table 4 show PRA s performane on eight of these prediates. Compared to the result on funtional prediates, preisions at 10 and at 100 of non-funtional prediates are slightly lower, but preisions at 1000 are omparable. We note that for some prediates preision at 1000 is better than at 100. After some investigation we found that for many relations, the top portion of the result list is more diverse: i.e. showing produts produed by different ompanies, journalist working at different publiations. 5 If a method makes k preditions, and k < n, then p@n is the number orret out of the k preditions, divided by n

10 While the lower half of the result list is more homogeneous: i.e. showing relations onentrated on one or two ompanies/publiations. On the other hand, through the proess of labeling the Mehanial Turk workers seem to build up a prior about whih ompany/publiation is likely to have orret beliefs, and their judgments are positively biased towards these ompanies/publiations. These two fators ombined together result in positive bias towards the lower portion of the result list. In future work we hope to design a labeling strategy whih avoids this bias. 4 Conlusions and Future Work We have shown that a soft inferene proedure based on a ombination of onstrained, weighted, random walks through the knowledge base graph an be used to reliably infer new beliefs for the knowledge base. We applied this approah to a knowledge base of approximately 500,000 beliefs extrated imperfetly from the web by NELL. This new system improves signifiantly over NELL s earlier Horn-lause learning and inferene method: it obtains nearly double the preision at rank 100. The inferene and learning are both very effiient our experiment shows that the inferene time is as fast as 10 milliseonds per query on average, and the training for a prediate takes only a few seonds. There are some prominent diretions for future work. First, inferene starting from both the query nodes and target nodes (Rihards and Mooney, 1992) an be muh more effiient in disovering long paths than just inferene from the query nodes. Seond, inferene starting from the target nodes of training queries is a potential way to disover speialized paths (with grounded nodes). Third, generalizing inferene paths to inferene trees or graphs an produe more expressive random walk inferene models. Overall, we believe that random walk is a promising way to sale up relational learning to domains with very large data sets. Aknowledgments This work was supported by NIH under grant R01GM081293, by NSF under grant IIS , by DARPA under awards FA and AF C-0179, and by a gift from Google. We thank Geoffrey J. Gordon for the suggestion of applying low variane sampling to random walk inferene. We also thank Bryan Kisiel for help with the NELL system. Referenes Eugene Agihtein and Luis Gravano Snowball: extrating relations from large plain-text olletions. In Proeedings of the fifth ACM onferene on Digital libraries - DL 00, pages 85 94, New York, New York, USA. ACM Press. Mihele Banko, Mihael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni Open Information Extration from the Web. In IJCAI, pages Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan Keyword searhing and browsing in databases using banks. ICDE, pages Avrim Blum and Tom Mithell Combining labeled and unlabeled data with o-training. In Proeedings of the eleventh annual onferene on Computational learning theory - COLT 98, pages , New York, New York, USA. ACM Press. MJ Cafarella, M Banko, and O Etzioni Relational Web Searh. In WWW. Jamie Callan and Mark Hoy Clueweb09 data set. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hrushka Jr., and Tom M. Mithell Toward an Arhiteture for Never-Ending Language Learning. In AAAI. William W. Cohen and David Page Polynomial learnability and indutive logi programming: Methods and results. New Generation Comput., 13(3&4): Oren Etzioni, Mihael Cafarella, Doug Downey, Anamaria Popesu, Tal Shaked, Stephen Soderl, Daniel S. Weld, and Er Yates Unsupervised namedentity extration from the web: An experimental study. Artifiial Intelligene, 165: Taher H. Haveliwala Topi-sensitive pagerank. In WWW, pages Alpa Jain and Patrik Pantel Fatrank: Random walks on a web of fats. In COLING, pages Ni Lao and William W. Cohen. 2010a. Fast query exeution for retrieval models based on path-onstrained random walks. KDD. Ni Lao and William W. Cohen. 2010b. Relational retrieval using a ombination of path-onstrained random walks. Mahine Learning.

11 Einat Minkov and William W. Cohen Learning graph walk based similarity measures for parsed text. EMNLP, pages Patrik Pantel and Maro Pennahiotti Espresso: Leveraging Generi Patterns for Automatially Harvesting Semanti Relations. In ACL. J. Ross Quinlan and R. Mike Cameron-Jones FOIL: A Midterm Report. In ECML, pages Bradley L. Rihards and Raymond J. Mooney Learning relations by pathfinding. In Proeedings of the Tenth National Conferene on Artifiial Intelligene (AAAI-92), pages 50 55, San Jose, CA, July. Matthew Rihardson and Pedro Domingos Markov logi networks. Mahine Learning. Stefan Shoenmakers, Oren Etzioni, and Daniel S. Weld Saling Textual Inferene to the Web. In EMNLP, pages Rion Snow, Daniel Jurafsky, and Andrew Y. Ng Semanti Taxonomy Indution from Heterogenous Evidene. In ACL. Sebastian Thrun, Wolfram Burgard, and Dieter Fox Probabilisti Robotis (Intelligent Robotis and Autonomous Agents). The MIT Press. Alexander Yates, Mihele Banko, Matthew Broadhead, Mihael J. Cafarella, Oren Etzioni, and Stephen Soderland TextRunner: Open Information Extration on the Web. In HLT-NAACL (Demonstrations), pages John M. Zelle, Cynthia A. Thompson, Mary Elaine Califf, and Raymond J. Mooney Induing logi programs without expliit negative examples. In Proeedings of the Fifth International Workshop on Indutive Logi Programming (ILP-95), pages , Leuven, Belgium.