A Wikipedia-based Naive Bayes Approach for Obtaining Related Phrases from A Natural Language Query

Size: px

Start display at page:

Download "A Wikipedia-based Naive Bayes Approach for Obtaining Related Phrases from A Natural Language Query"

Randolf Lane
7 years ago
Views:

1 DEIM Forum 2012 D7-2 Wikipedia Web Wikipedia Wikipedia,,,, A Wikipedia-based Naive Bayes Approach for Obtaining Related Phrases from A Natural Language Query Masumi SHIRAKAWA, Kotaro NAKAYAMA, Takahiro HARA, and Shojiro NISHIO Graduate School of Information Science and Technology, Osaka University 1-5 Yamadaoka, Suita, Osaka , Japan The Center for Knowledge Structuring, The University of Tokyo Hongo, Bunkyo-ku, Tokyo , Japan {shirakawa.masumi,hara,nishio}@ist.osaka-u.ac.jp, nakayama@cks.u-tokyo.ac.jp 1. Web 2006 Web Wikipedia 1 Wikipedia 1 1) 2) 3) Wikipedia [2], [16]

2 Wikipedia Wikipedia 2. Wikipedia Wikipedia 2006 Wikipedia Wiki Web Web Wikipedia URL [10] Wikipedia 2 Wikipedia Wikipedia [9] [12] Strube [16] WordNet 3 Wikipedia Wikipedia [12] Wikipedia Gabrilovich [2] Wikipedia (Explicit Semantic Analysis, ESA) ESA ESA Milne [8] ESA [10], [11] Wikipedia Ito [4] [10] Wikipedia Wikipedia Twitter 4 Meij [6] Ferragina [1] Wikipedia Song [14] ESA Wikipedia Wikipedia Yahoo! Content Analysis API Wikipedia 3. Wikipedia

3 Wikipedia Wikipedia 1 6 t T T E e c P (t T ) P (e t) P (c e) P (c t) P (c) P (c T ) Table 1 1 Definition of symbols. t T t e e c t c c c T c P (T =T ) T T Table 2 2 An example of the probability that a term is a keyphrase. P (t T ) Apple Apple Inc Steve Jobs Japan China tree black house Wikipedia t P (t T ) Wikipedia [7] Wikipedia (wikification) Wikipedia t CountDocuments(t) CountDocuments(t Key) P (t T ) CountDocuments(t Key) CountDocuments(t) 2 TFIDF Apple Inc. Steve Jobs black house (1)... and New

4 3 Apple Table 3 The probability that a term Apple is linked to an entity. P (e t) Apple Inc Apple Apple Records Apple (album) Apple Corps Apple Store Apple (company) App Store Apple Inc. Table 4 Related terms of an entity Apple Inc. and their probability. P (c e) AppleInsider Apple Store Steve Jobs IPhone OS IPod Touch FairPlay Mac OS X Macworld York Times said... New York Times New York York Wikipedia t e P (e t) Wikipedia [9] t e CountAnchortexts(t, e) P (e t) CountAnchortexts(t, e) e i E CountAnchortexts(t, ei) (2) E Wikipedia 3 Apple 8 IT Apple Inc. Apple Apple Records e c P (c e) Wikipedia ESA [2] e c CountLinks(e, c) e c P (c e) CountLinks(e, c) c j E CountLinks(e, cj) (3) ESA e c ESA Sim(e, c) e c Sim(e, c) P (c e) Sim(e, c c j E j) (2) t c P (c t) = P (c e i )P (e i t) (5) e i E 4 ESA Apple Inc. 8 Apple Inc (4) c P (c) c P (c e) c CountLinks(c) P (c) CountLinks(c) c j E CountLinks(cj) (6) Wikipedia

5 P (T = T ) = P (t k T ) P (t k / T ) t k T t k / T = P (t k T ) (1 P (t k T )) (8) t k T t k / T 1 (7) (8) ( ) P (c t P (c T ) P (T = T t ) k T k ) P (c) T 1 T (9) 1 Fig. 1 Naive Bayes for a set of keyphrases in which members are unobservable. T = {t 1,..., t K } P (c T ) 7 t k P (c t) [14] P (c T = {t 1,..., t K}) P (c) K P (t k c) k=1 K k=1 P (c t k) P (c) K 1 (7) T [13] T T P (c T ) 1 t 1 t 2 t 3 T T P (T = T ) 8 T T (1) 7 T T T T K T T t k T t k / T (9) ( ( P (t k T )P (c t k ) 1 P (tk T ) ) ) P (c) T t k T t k / T (10) P (c) K 1 t k [13] K k=1( P (t k T )P (c t k ) + ( 1 P (t k T ) ) ) P (c) P (c T ) P (c) K 1 (11) (11) (7) P (c t k ) P (c t k ) P (c) P (t k T ) P (t k T ) t k P (c t k ) t k P (c) P (t k T ) P (c) P (c) 4. ESA 4 Twitter 2 8 (a) (b) Microsoft Microsoft (a) brand (b) Xbox Live (a) Microsoft brand

6 (a) Did you know that Microsoft is the most influential brand in Canada? Microsoft (b) Microsoft denies Xbox Live security breach Xbox Microsoft (c) Warriors beat the Heat... Happy face! NBA (d) McClennan names Warriors lineup for first pre-season trial Fig. 2 2 Related terms obtained by our method (value means probability). Canada (c) (d) Warriors (c) NBA (d) Warriors Golden State Warriors New Zealand Warriors (c) NBA (d) (c) Heat NBA (d) McClennan Twitter Twitter K-means #Obama #MacBook # [5] 5 Table 5 Three datasets for evaluation and their statistics. U IT S #Obama #MacBook #NFL (779) (1,251) (1,043) #Bones #Silverlight #NHL (949) (221) (1,045) #PGA #VMWare #NBA (1,243) (890) (1,085) #Microsoft #MySQL #MLB ( ) (1,040) (1,241) (752) #medicine #Ubuntu #MLS (1,109) (988) (969) #Christ #Chrome #UFC (871) (1,018) (984) #NASCAR (857) 5,991 5,609 6,735 83,748 82,608 91, ,636 16,539 18,603 [14] 5 U IT

7 6 Table 6 The result of clustering. purity NMI ARI U IT S U IT S U IT S BOW ESA ( 10) ESA ( 20) ESA ( 50) ESA ( 100) ESA ( 200) ESA ( 500) ESA ( 1,000) ESA ( 2,000) ESA ( 5,000) ( 10) ( 20) ( 50) ( 100) ( 200) ( 500) ( 1,000) ( 2,000) ( 5,000) (ESA 10) (ESA 20) (ESA 50) (ESA 100) (ESA 200) (ESA 500) (ESA 1,000) (ESA 2,000) (ESA 5,000) S 1) 2) 3) RT URL 4) # 5) 5 bag-of-words (BOW) Gabrilovich ESA [2] (ESA) ESA ESA (purity) [17] (NMI) [15] adjusted Rand index (ARI) [3] purity NMI ARI NMI ARI false-positive false-negative 0 1 K-means (BOW) Wikipedia ESA 5 Song [14]

8 BOW ESA ESA ESA ESA purity ARI NMI (IT, S) (U) IT S ESA ESA 6. Wikipedia Wikipedia B( ) [1] P. Ferragina and U. Scaiella, TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities), Proc. of ACM Conference on Information and Knowledge Management (CIKM), pp , Oct [2] E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, Proc. of International Joint Conference on Artificial Intelligence (IJCAI), pp , Jan [3] L. Hubert and P. Arabie, Comparing Partitions, Journal of Classification, vol.2, no.1, pp , [4] M. Ito, K. Nakayama, T. Hara, and S. Nishio, Association Thesaurus Construction Methods based on Link Cooccurrence Analysis for Wikipedia, Proc. of ACM Conference on Information and Knowledge Management (CIKM), pp , Oct [5] D. Laniado and P. Mika, Making Sense of Twitter, Proc. of International Semantic Web Conference (ISWC), pp , Nov [6] E. Meij, W. Weerkamp, and M. de Rijke, Adding Semantics to Microblog Posts, Proc. of ACM International Conference on Web Search and Data Mining (WSDM), Feb [7] R. Mihalcea and A. Csomai, Wikify! Linking Documents to Encyclopedic Knowledge, Proc. of ACM Conference on Information and Knowledge Management (CIKM), pp , Nov [8] D. Milne and I.H. Witten, An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links, Proc. of AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI), pp.25 30, July [9] D. Milne and I.H. Witten, Learning to Link with Wikipedia, Proc. of ACM Conference on Information and Knowledge Management (CIKM), pp , Oct [10] K. Nakayama, T. Hara, and S. Nishio, Wikipedia Mining for An Association Web Thesaurus Construction, Proc. of International Conference on Web Information Systems Engineering (WISE), pp , Dec [11] Y. Ollivier and P. Senellart, Finding Related Pages Using Green Measures: An Illustration with Wikipedia, Proc. of National Conference on Artificial Intelligence (AAAI), pp , July [12] S.P. Ponzetto and M. Strube, Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution, Proc. of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pp , June [13] M. Shirakawa, H. Wang, Y. Song, Z. Wang, K. Nakayama, T. Hara, and S. Nishio, Entity Disambiguation based on a Probabilistic Taxonomy, Tech. Rep. MSR-TR , Microsoft Research, Nov [14] Y. Song, H. Wang, Z. Wang, H. Li, and W. Chen, Short Text Conceptualization Using a Probabilistic Knowledgebase, Proc. of International Joint Conference on Artificial Intelligence (IJCAI), pp , July [15] A. Strehl and J. Ghosh, Cluster Ensembles A Knowledge Reuse Framework for Combining Multiple Partitions, Journal of Machine Learning Research, vol.3, pp , Dec [16] M. Strube and S.P. Ponzetto, WikiRelate! Computing Semantic Relatedness using Wikipedia, Proc. of National Conference on Artificial Intelligence (AAAI), pp , July [17] Y. Zhao and G. Karypis, Criterion Functions for Document Clustering: Experiments and Analysis, Tech. Rep. #01-40, Department of Computer Science, University of Minnesota, Feb

Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities

Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities Masumi Shirakawa Kotaro Nakayama Takahiro Hara Shojiro Nishio Graduate School of Information Science and Technology,