Second Kyoto Workshop January 2011 Gifu, Japan Search and You Shall Find - and Teach Us All Marius Paşca Google Inc. mars@google.com
Unweaving the World Wide Web of Facts The Web is a repository of implicitly-encoded human knowledge some text fragments contain easier-to-extract knowledge More knowledge leads to better answers acquire facts from a fraction of the knowledge on the Web exploit available facts during search Open-domain information extraction extract knowledge (facts, relations) applicable to a wide range, rather than closed, pre-defined set of domains (e.g., medical, financial etc.) no need to specify set of concepts and relations of interest in advance rely on as little manually-created input data as possible 2
Instances, Classes and Attributes A concept (class) is a placeholder for a set of instances (objects) that share similar properties set of instances {matrix, kill bill, ice age, pulp fiction, cidade de deus,...} class label movies, films definition a series of pictures projected on a screen in rapid succession with objects shown in successive positions slightly changed so as to produce the optical effect of a continuous picture in which the objects move (Merriam Webster) a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement (WordNet) 3
Instances, Classes and Attributes Attributes capture the types of facts that are relevant for a given instance or class relevant properties extracted from a text collection for a given class (e.g., stealth factor and top speed for SportsCar, or bestselling album and drummer for MusicBand, or author and genre for Book) as an alternative to manually pre-specifying relevant relations of a class (e.g., Currency-CurrencyOf-Country, or City-BirthPlaceOf- Actor) Applications augment results of search queries (zr1, black eyed peas, la sombra del viento) with class attributes and/or facts structured-search interfaces semantic query refinements acquisition of knowledge resources from text 4
Sources of Open-Domain Information Human-compiled knowledge resources resources created by experts resources created collaboratively by non-experts Sources of textual data semi-structured text unstructured text 5
Expert Resources: Cyc Collections and individuals collections correspond to classes (concepts) individuals correspond to instances collections have instances; individuals cannot have instances Attributes for individuals, capture properties and values 6
Non-Expert Resources: Wikipedia Wikipedia infobox Wikipedia article 7
Documents Unstructured text Semi-structured text 8
Documents Semi-structured text Semi-structured text 9
Beyond Documents what is the weather like in in march Search 10
Characteristics of Documents vs. Queries Characteristic Type of medium Purpose Available context Average quality Grammatical style Average length Data Source Document Sentences Queries text text convey info. request info. surrounding text self-contained high (varies) low natural language bag of keywords 25 words or more 2-3 words 11
Characteristics of Documents vs. Queries 12
Extraction from Queries: Instances Input target classes, available as small sets of seed instances e.g., {phentermine, viagra, vicodin, vioxx, xanax} for Drug Data source anonymized search queries along with frequencies Output ranked (longer) lists of instances, one per class e.g., [viagra, phentermine, vicodin, xanax, vioxx, ambien, adderall, hydrocone, oxycontin, cialis, valium, lexapro, ritalin,...] for Drug 13
Instance Extraction side effects of generic birth control pills can low blood pressure make you tired prescription vicodin online long term xanax use propanolol and vicodin interaction causes of low blood pressure during pregnancy taking lipitor during pregnancy buy xanax in uk taking beta blockers during pregnancy how oxycontin works long term lamictal use propanolol and lamictal interaction is xanax habit forming can lamictal be crushed how does lipitor work effect of beta blockers on exercise does vioxx cause weight gain can phentermine make you tired buy vioxx in uk side effects of viagra pills long term vioxx use prescription lamictal online long term xanax use what effects does low blood pressure have buy lipitor online Identify queries that contain a seed instance {phentermine, viagra, vicodin, vioxx, xanax} for Drug 14
Instance Extraction side effects of generic birth control pills can low blood pressure make you tired prescription vicodin online long term xanax use propanolol and vicodin interaction causes of low blood pressure during pregnancy taking lipitor during pregnancy buy xanax in uk taking beta blockers during pregnancy how oxycontin works long term lamictal use propanolol and lamictal interaction is xanax habit forming can lamictal be crushed how does lipitor work effect of beta blockers on exercise does vioxx cause weight gain can phentermine make you tired buy vioxx in uk side effects of viagra pills long term vioxx use prescription lamictal online long term xanax use what effects does low blood pressure have buy lipitor online Collect query templates prefix and postfix around instance match [long term] [use] prefix postfix [buy] [in uk] prefix postfix [can] [make you tired] prefix postfix 15
Instance Extraction side effects of generic birth control pills can low blood pressure make you tired prescription vicodin online long term xanax use propanolol and vicodin interaction causes of low blood pressure during pregnancy taking lipitor during pregnancy buy xanax in uk taking beta blockers during pregnancy how oxycontin works long term lamictal use propanolol and lamictal interaction is xanax habit forming can lamictal be crushed how does lipitor work effect of beta blockers on exercise does vioxx cause weight gain can phentermine make you tired buy vioxx in uk side effects of viagra pills long term vioxx use prescription lamictal online long term xanax use what effects does low blood pressure have Identify queries that match the query templates collect and rank large pool of candidate instances buy lipitor online phentermine xanax lamictal vioxx low blood pressure [long term] [use] prefix postfix [buy] [in uk] prefix postfix [can] [make you tired] prefix postfix 16
Output Instances Class Newspaper Person University VideoGame Top Extracted Instances [new york times, le monde, washington post, usa today, wall street journal, ny times, chicago tribune, boston globe, toronto star,...] [leonardo da vinci, rembrandt, andy warhol, pablo picasso, vincent van gogh, salvador dali, van gogh, frida kahlo, picasso,...] [university of chicago, stanford university, universty of texas at austin, columbia university, university of pennsylvania,...] [grand theft auto, warcraft, need for speed, quake, super maro bros., gta, world of warcraft, doom, need for speed underground,...] 17
Extraction from Queries: Attributes Input target classes, available as sets of representative instances e.g., {Delphi, Apple Computer, Honda, Oracle, Coca Cola, Toyota, Washington Mutual, Delta, Reuters, Target,...} for Company small sets of seed attributes, one per class e.g., {headquarters, stock price, ceo, location, chairman} for Company Data source anonymized search queries along with frequencies Output ranked lists of attributes, one per class e.g., {headquarters, mission statement, stock price, ceo, cio, code of conduct, stock symbol, organizational structure, corporate address,...} for Company 18
Class Attribute Extraction Target classes Company: {Delphi, Apple Computer, Honda, Oracle, Coca Cola, Toyota, Washington Mutual, Delta, Reuters, Target,...} Seed attributes Company: {headquarters, stock price, ceo, location, chairman} Pool of candidate attributes Company: {installing, stock price, accord, headquarters, mission statement,...} Query logs installing coca delta honda new where washington mission honda cola air is accord statement lines the toyota oracle company accord mutual world stock 1989 8.1-7 cressida headquarters for new price one sei delta theheadquarters year solaris water history oracle airlines stock pump 8corporation forprice delphi impact target corporation Search-signature vectors (one per candidate attribute) Company: installing [ ] [ ] [cressida water pump] prefix infix postfix [ ] [ ] [8.1-7 on solaris 8] prefix infix postfix Company: stock price Company: accord Company: headquarters Company: mission statement [ ] [company one year] [target] prefix infix postfix [ ] [ ] [1989 sei] prefix infix postfix [where is the world] [for] [corporation] prefix infix postfix [ ] [for the] [corporation] prefix infix postfix [ ] [air lines] [history] prefix infix postfix [new] [ ] [ ] prefix infix postfix [ ] [new] [impact] prefix infix postfix [ ] [for] [airlines] prefix infix postfix Reference search-signature vectors (one per class) Company Ranked list of extracted class attributes Company: {headquarters, mission statement, stock price, ceo, code of conduct, stock symbol, organizational structure, corporate address, cio,...} 19
Output Attributes Class AircraftModel CarModel CellPhoneModel Wine Top Extracted Attributes [weight, length, history, fuel consumption, interior photos, specifications, photographs, interior pictures, seating arrangement, flight deck,...] [transmission, top speed, acceleration, transmission problems, owners manual, gas mileage, towing capacity, stalling, maintenance schedule, performance parts,...] [features, battery life, retail price, mobile review, specification, price list, functions, ratings, tips, tricks,...] [vintage, color, cost, style, taste, vintage chart, pronunciation, shelf life, wine ratings, wine reviews,...] 20
Conclusion If knowledge is generally prominent or relevant, people will (eventually) search for it anonymized query logs collectively capture knowledge, through requests that may be answered by knowledge asserted in document collections Queries contain multiple types of knowledge some of them are easier to extract than others instances, classes, attributes, relations 21