On-Demand Information Extraction. Summer/Fall 07. New York University Satoshi Sekine

Transcription

1 On-Demand Information Extraction Summer/Fall 07 New York University Satoshi Sekine

2 Introduction (

3 Research topics On-demand IE IE pattern Discovery Multi/Sing doc. sum. IE and Sum. Summarization Paraphrase Discovery Cross Ling. Pre-ODIE Relation Discovery Information needs Encyclopedia QA 2nd grade language test Cross Ling. QA IR, clustering & sum. QA IR English Analyzer (OAK) Extended NE Attribute for ENE Query log & class Web People Search Japanese Spelling

4 What is Information Extraction To automatically extract information on specific scenario from unstructured text and put it into table format Newspaper Ex. Management succession Date person Company Position 7/1/2003 John Smith Smith Trade corporation COE 7/2/2003 Bill Brown Bank of Manhattan President

5 Problem in conventional IE Preparation of knowledge for given scenario Pattern, lexicon and semantic knowledge Company announced Person s promotion to Position Writing rules by hand or creating training data Restriction Preparation time (one month in MUC) Specific/Limited scenario/domain

6 Our goal Make one month to one minute (=automatic) IE on the fly How? Use unsupervised learning methods Pattern discovery, paraphrase discovery Prepare as much knowledge/tools as we can for as many scenarios as possible Extended NE, semantic knowledge, relation Demo

7 ODIE Flow Chart Description of task Relevant Document IR system Language Analyzer Pattern discovery Pattern ENE tagger Paraphrase discovery Co-reference Pattern sets Table construction Table

8 Automatic Discovery of IE patterns (Sudo et.al HLT01) (Sudo et.al ACL03) Bottleneck of IE is knowledge creation IE pattern Company announced Person s promotion to Position Key Idea Retrieve documents on the given topic Collect relatively frequent patterns in the retrieved documents Research focus and Result Pattern format (predicate-argument, chain, sub-tree) Computation time (Tree mining algorithms) Near human performance was achieved

9 IE Pattern Discovery - Result Succession Train: 1 yr. newspaper Test: 148 (87 relevant, 61 irrelevant) Slots: Person:135 Org: 172 Pot: 215

10 Paraphrase Acquisition Automatic acquisition of paraphrase (Sekine Gengo01) Shinyama et al HLT02) (Shinyama et al IWP03) Purpose: Relationship between IE patterns Key idea Events are usually reported in different newspapers on the same day However, NE and numeric, date expressions are identical Use them as anchor to acquire paraphrase We observed encouraging results, but we found that it is not so easy a task

11 Paraphrase Acquisition Procedure Newspaper A Article A Find Comparable Articles Newspaper B Anchors identified NE tagging Coref. Article B Find Comparable Sentences Anchors identified phrase A Extract Paraphrase phrase B

12 Paraphrase Acquisition Evaluation Two techniques (co-reference, Argument Structure Database) One year of two Japanese newspaper 195 article pairs on arrest of murder suspects Coref ASD Sentence w/o (93) with Phrases w/o (>100) w/o with w/o with with obtained precision recall %(41) 44% 69%(52) 56% 24%(25) 56%(18) 62%(23)

13 Relation Discovery Relation discovery (Hasegawa et al ACL04) Motivation Discovering particular relations between NE s Country & President, Merged Companies Unsupervised method Not known the relationships in advance As many relationships as possible Idea Context based clustering Pairs of entities with similar context would be in the same relation Similar contexts could also characterize the relations

14 Relation Discovery Procedure Tag ENE on text, select two ENE types Collecting expressions within 5 words, and make the context between them as context vector Cluster ENE pairs based on context vectors Label each cluster based on frequent context words Named entity Company A is offering to buy s interest in is negotiating to acquire s planned purchase of Accumulated contexts 5 words or less Named entity Company B

15 Relation Discovery Result Major relations Ratio Common words (Relative frequency) President 17/23 President (1.0), president (0.415), Senator 19/21 Sen. (1.0), Republican (0.214), Prime Minister 15/16 Minister (1.0), minister (0.875), Prime (0.875), 15/16 Gov. (1.0), governor (0.458), Governor (0.3), Governor Secretary 6/7 Secretary (1.0), secretary (0.143), Republican 5/6 Rep. (1.0), Republican (0.667), Coach 5/5 coach (1.0), M&A 10/11 buy (1.0), bid (0.382), M&A 9/9 acquire (1.0), acquisition (0.583), buy (0.583), Parent 7/7 parent (1.0), unit (0.476), own (0.143),

16 Relation Discovery Evaluation and Error analysis Evaluation result Labeling accuracy: 10/10 = 100% (Cluster level) 108/121 = 89% (NE pair level) (cf. recall: 140/(177+65) = 58% (All clusters)) False Alarm (Mis-clustered NE pairs): Common words in another cluster which existed in 5 words contexts by accident (noise words) Chechnya war may exhaust President Boris Yeltsin Miss (Undetected NE pairs): Absence of common words in 5 words contexts Outer context might be helpful Boris Yeltsin on end the fighting in Chechnya

17 Another Paraphrase Discovery via Relation Discovery (Hasegawa et.al Gnengo-05) Expressions found in the same relation should contain paraphrase Two filters Used in more than one pair of instance Expressions contain frequent term Ex) A bought B / A has agreed to buy B / A, which is buying B / A s proposed acquisition of B / A s acquisition of B / A s agreement to buy B / A s purchase of B / A bid for B / A s takeover of B / A merger with B / A succeeded in buying B / B, which was acquired by A / B would become a subsidiary of A / B agreed to be bought by A

18 Yet Another Paraphrase Discovery using context and keywords (Sekine IPW-05) To eliminate the frequency threshold for the vector model in Hasegawa s method Algorithm Find keywords in context of each pair of NEs using TFIDF Cluster phrases based on NE instances (buy acquire / buy agree / buy purchase)

19 Extended Named Entity Sekine et.al LREC02) (Sekine et.al LREC04) Named Entity with 200 categories Definition (150 page) was released Automatic tagger 130K entries 2000 rules 70% accuracy Improving

20 Attributes for ENE (ver ) (example for PERSON) Attribute(20) Example of value Vocation professional baseball player, economist, poet Nationality Freq. (%) ENE 46(100) Vocation American, Chinese, Japanese 29(63) Country Career A professor at Yale University, The Princess of Wales 26(57) Vocation Masterpiece Guernica, Mona Lisa 25(54) Product, Facility Graduate M.A. in German at Cambridge, MK High School 20(44) School Hometown Paris, Manchester, Shanghai 19(41) City Native Province State of Illinois, Sichuan 18(39) Province Previous stay England, New York 12(26) Location Mentor Andrea del Verrocchio, Michelangelo di Lodovic Buonarroti Simoni 10(22) Person Death date 04,23,1704, 04/23/1704, unknown 10(22) date Era Edo period, the 11th century 8(17) Era Award Academy Award, MVP, Nobel Prize 8(17) Award Real Name Saint Nicholas 8(17) Person Another name Santa, father Christmas 8(17) Person Title Knight, an honorary degree at Yale 6(13) Title Competition World Series, 1955 piano competition in Paris 6(13) Game Place of birth New York, Birmingham 5(11) Location Father John B. Kelly, Sr. 5(11) Person Cause of death Car accident, Guillotine 5(11)

21 ODIE Flow Chart Description of task Relevant Document IR system Language Analyzer Pattern discovery Pattern ENE tagger Paraphrase discovery Co-reference Pattern sets Table construction Table

22 On-demand Information Extraction - ODIE Middle-size NSF project at NYU 03-08) Demo Description of the task (topic) acquire acquisition merge merger buy purchase Build a table in 1 minute Docid Company Money Date nyt Comdata Holdings Corp., Ceridian Corp. $900 million Last week nyt Chemicals Inc., $400 million Last year nyt Tenneco Inc., Mobil Corp. $1.27 billion nyt CoreStates Financial Corp., Merdian Bancorp $3.1 billion

23 nyt The move comes amid a flurry of acquisitions in the computer financial processing industry. Last week, check processor Ceridian Corp. agreed to buy Comdata Holdings Corp. for $900 million. nyt Last year, Terra Industries Inc., another large fertilizer maker, bought Agricultural Minerals & Chemicals Inc. for $400 million. ``Industry growth is primarily overseas,'' said Harvey Stober, an analyst with Donaldson, Lufkin & Jenrette in New York. nyt Tenneco Inc. said it agreed to buy Mobil Corp.'s plastics division for $1.27 billion as part of its strategy to expand into less cyclical, higher-growth businesses. nyt CoreStates Financial Corp. agreed to buy Meridian Bancorp for $3.1 billion in stock, creating a bank with a commanding role in Philadelphia and the industrial cities of eastern Pennsylvania.

24 ODIE - Evaluation Result Out of 20 topics, tables created for 2 topics are judged useful, 12 are useful, 6 are not useful. Correctness of the table fillers Evaluation Number of rows Correct 84 Partially correct 4 Incorrect 12

26 Preemptive Information Extraction (Shinyama, Sekine NAACL06) We can even delete queries 1. Find a pair (or triple) of entities that have a similar relation in articles. ex. Relation between Katrina - New Orleans in Article A is similar to relation between Longwang - Taiwan in Article B. 2. Cluster relations based on their similarity. 3. A cluster of similar relations = table.

27 Knowledge Discovery from Query Log (work at Microsoft Research, Summer 06) Query log contains a lot of knowledge Discover knowledge about ENE Query log (6month, 1.5B) bathroom+showers+pictures miss+mundo owen+mulligan+tyrone news+press free+wedding+planner marc+arther+glen preadolescente+follando gokmenler+agricultural+machinery+co.+erotika+com academy+award+winner ENE dictionary (245K) 100+greatest+britons AWARD 100+worst+britons AWARD aaass/orbis+books+prize+for+polish+studies AWARD abel+prize AWARD academy+award AWARD academy+awards AWARD acm+turing+award AWARD agatha+award AWARD agatha+awards AWARD air+medal AWARD

28 Problems to be solved Noise Problem 2 This is noise in ENE dictionary. I want to get rid of them!!! Populate the list Problem 3 The list may not have enough coverage. I want to populate ENE Important Context Problem 1 This is a very general context. I want to see important context for this category Clustering Problem 4 Individual entities are verbose. I want to cluster them.

29 Key Idea Find Important context for each category New Formula (not TFIDF, MI, log likelihood ratio ) Score(context) = ftype{context} * log( g(context) / C ); g(context) = ftype{context}/finst{context}; C = ftype{contexttop1000} / Finst{contexttop1000}; Result ACADEMIC #+syllabus, degrees+in+#, phd+in+#, masters+in+#, diploma+in+#, #+lecture+notes, masters+degree+in+#, courses+in+, AWARD #+winners, #+nominees, #+nominations, #+winner, #+award, who+won+#, winners+of+#, list+of+#+winners, winners+of+the+#

30 Results Delete Noise (Problem 2) Entities which does not co-occur with important context are noise BOOK: it, porno, space, night, we, working, candy, wheels, foundation, jazz, ghost, couples, the bible, giant, daddy, creation Find New entities (Problem 3) words co-occur with important context are entities AWARD: golden+globes, grammys, golden+globe, kentucky+derby, daytime+emmy, sag, sag+awards, american+idol, daytime+emmys

31 Web People Search a SemEval 2007 task Input Output First 100 results for a person name search Clustring according to the actual people John Smith 1 (Captain) John Smith 2 (Labour leader) Captain John Smith John Smith wikipedia en.eikipedia.. BBC: Labour leader John Smith news John Smith wikipedia en.wikipedia John Smith 3 (IBM researcher) John Smith 4 (Film director) John Smith 5 (Shoe company) John Smith 6 (Writer)

32 WePS SemEval - formally called Senseval since proposal / 18 final tasks >100 teams, > 125 systems, 110 papers WePS is the largest single task 16 participants (38 people in ML) Data Training source Test Av. entity Av. doc. Wikipedia ECDL WEB03* Total av. source Av. entity Av. doc. Wikipedia ACL Census Total av

33 WePS (Result) team CU_COM SEM COMBINED_BASELINE IRST-BP PSNUS UVA SHEF FICO UNN SCATTERED_BASELINE AUG SWAT-IV UA-ZSA TITPI JHU1-13 DFKI2 WIT UC3M_13 UBC-AS JOINED_BASELINE F α=0, purity inv_purity Presentation/Poster Tomorrow - PS2-11:15 12:30 Tomorrow - PS2 11:15 12:30 Next presentation - 17:00 17:15 Tomorrow - PS3 14:30 15:45 Tomorrow - PS2 11:15 12:30 Tomorrow - PS3 14:30 15:45 Tomorrow - PS2 11:15 12:30 Tomorrow - PS3 14:30 15:45 Tomorrow - PS3 14:30 15:45 Tomorrow - PS2 11:15 12:30 Tomorrow - PS2 11:15 12:30 Tomorrow - PS3 14:30 15:45 Tomorrow - PS3 14:30 15:45 -

34 WePS Systems Basic strategies are similar HTML analsys => pre-processing (POS, NE ) => feature selection (token, NE ) => calculate similarities => make clusters => produce output Summary is available Future Second event (08 fall or 09 spring) Attribute extraction (DOB, vocation )

35 Human Relationship Extraction (Goto NHK) Extract human relationships from movie/tv summary Display it as a network in a graph NE for the domain Attribute for NE Clustering the relationships

36 English Analyzer (OAK system) Chunked sentence (low level constituent) POS tagged sentence CONLL PTB/tag Tokenized sentence Dependency between Chunks and constituents PTB/tree CONL L, MUC PTB/tree PTB/tree Collins Plain Sentence PTB/tag PTB/tag PTB/tree NE tagged sentence (org,loc,person,time ) Plain NE tagger Chunker Dependency Analyzer POS tagger Parser Tokenizer Function Tagger Sentence Splitter Regularizer Triplet Plain Text OAK System PTB/tree Parse tree PTB/tree Parse tree with Function labels Glarf Regularized structure (predicate-argument)

37

39 Extra slides

40 IR, clustering and multi-doc. summaization Nobata et.al Gengo03) Nobata et.al DUC03) (Murakami et.al Gengo04) Large number of retrieved documents Automatic clustering by sub-topics 200 documents -> 5 clusters Summarize each cluster (MDS) Demo (

41

42

43

44 Cross Lingual Question Answering Sekine DARPA03) (Sekine TALIP04) Developed in a month in DARPA s SurpriseLanguage experiment Type in question in English Translate question to Hindi Find answer in Hindi newspaper Translate answer to English Demo (

45 NE tagged corpus Num. knowledge Q Examiner NE tagger Keywords in English Type of expected answer newspaper Relevant Documents CLIR Q in English Numeric tagger Keywords in Hindi User I/F Find answer Bi-ling. dict. Answer, context in English Answer, context in Hindi Translation (H->E)

46

47

48

49 Huge Text data & knowledge acquisition (Yoshihira et.al Gengo04) (Ando et.al LREC04) (Sekine Gengo04) Collecting huge corpus (untagged) WEB pages : (Japanese 350G, English 300G) Newspaper : 38 years, ~10G Encyclopedia Tool Suffix Array for the huge data Application Semantic knowledge acquisition by lexico-syntactic pattern Automatic acquisition of NE instances IE pattern, paraphrase acquisition Collect spelling variations KWIC system (

50

51 Survey on Multi-document summarization Sekine et.al DUC03) Purpose Multi-doc. summarization & IE Analyze summary/category of 100 doc. set Category of doc. set 13 category: (multi single)-(person org loc ), other Consistent tagging by people It should be useful for summarization strategy Table format summary 40-45%: Table is natural, 36-38%: Table is OK Encouraging results for On-demand IE

52 Spelling variation in Japanese (Masuyama et.al Gengo04) Serious problem in real world systems 30% of question for encyclopedia QA system 40% of zero hits on dictionary look up A system to find the closest entry 190,000 entries Edit distance after narrowing down candidates If the target is in the dictionary, 100% in top 3 Demo Planning to construct variation dictionary from WEB corpus

53

54 Encyclopedia QA (Sekine Gengo03) (Sekine et al. Gengo05) User can ask question in natural language and get the answer from encyclopedia System is working as a real demo ( Answer in the entry words or numeral in explanation Answer prepared in advance by IE and pattern based Q matcher pinpoint the answer

55

56 2nd Grade Language Test Objectives Realize the NLP technologies into the form which can be easily observed by ordinary people Observe the problems of the NLP technologies by degrading the level of target materials Results # of Q Known Correct Cor in known Cor. In total Kanji Word K Reading Total

57

59 My Research Interest things are the central People, organization, location, event name, vocation, product name How they are related How they act, are used or are evaluated It is knowledge How to create knowledge? Extract from huge corpus Fusion with the existing knowledge (e.g. encyclopedia)

60 Knowledge Belongs to

61 Knowledge Belongs to Belongs to A part of Belongs to A part of

62 Knowledge Belongs to Has a lunch Belongs to A part of Belongs to A part of

63 Knowledge Work in, commute in Belongs to Has a lunch Belongs to A part of Belongs to A part of

64 Knowledge features Largest city Work in, commute in Favorite team Belongs to Has a lunch Belongs to A part of Belongs to A part of

65 Knowledge features Largest city Work in, commute in Now visiting Favorite team Belongs to Has a lunch Belongs to A part of Belongs to A part of

66 Knowledge Largest city Friendship countries A member of features Work in, commute in Now visiting Favorite team Belongs to Has a lunch Belongs to A part of Belongs to A part of

67 Knowledge A member of A member of features Pay money A member of Friendship countries Work in, commute in Now visiting A member of Largest city Saw it play Favorite team Nationality A member of Belongs to Awarded Favorite team Nationality Live Born in Has a lunch Belongs to A part of Belongs to Same name A part of A member of

68 Boo! Chee r! Knowledge A member of A member of features Pay money A member of Friendship countries Work in, commute in Now visiting A member of Largest city Saw it play Favorite team Nationality A member of Belongs to Awarded Favorite team Nationality Live Born in Has a lunch Belongs to A part of Belongs to Same name A part of A member of

69 Knowledge Creation Not only by hand Cyc, EDR Learn from Corpus Frequency, TFIDF Pattern discovery, NE discovery Distributional similarity NE discovery, attribute for NE, Paraphrase Discovery Lexico-syntactic pattern Hyper-hyponym discovery, IE pattern Alignment Paraphrase Discovery

70 Topics Names Create categories (e.g. product name, event name) Find entities (e.g. instances of laptop names) Relation Find relations (e.g. pro/cons of products, sentiment, opinion) Categorize Relation Paraphrase, textual entailment (e.g. same kind of opinion) Knowledge Link and structure the extracted knowledge Human interface