Keyphrase Extraction for Scholarly Big Data



Similar documents
MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

Term extraction for user profiling: evaluation by the user

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

PULLING OUT OPINION TARGETS AND OPINION WORDS FROM REVIEWS BASED ON THE WORD ALIGNMENT MODEL AND USING TOPICAL WORD TRIGGER MODEL

How To Write A Summary Of A Review

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

Data Mining Yelp Data - Predicting rating stars from review text

TextRank: Bringing Order into Texts

Summarizing Online Forum Discussions Can Dialog Acts of Individual Messages Help?

Clustering Documents with Active Learning using Wikipedia

Removing Web Spam Links from Search Engine Results

Identifying Focus, Techniques and Domain of Scientific Papers

Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts

Visualization methods for patent data

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Web Document Clustering

Analysis and Synthesis of Help-desk Responses

Semi-Supervised Learning for Blog Classification

Personalized News Filtering and Summarization on the Web

An Introduction to Data Mining

INTELLIGENT AND PERVASIVE ARCHIVING FRAMEWORK TO ENHANCE THE USABILITY OF THE ZERO-CLIENT- BASED CLOUD STORAGE SYSTEM

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

A Comparison of Word- and Term-based Methods for Automatic Web Site Summarization

A SURVEY ON WEB MINING TOOLS

SUIT: A Supervised User-Item Based Topic Model for Sentiment Analysis

Ming-Wei Chang. Machine learning and its applications to natural language processing, information retrieval and data mining.

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Domain Classification of Technical Terms Using the Web

Sentiment analysis on tweets in a financial domain

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

Finding Advertising Keywords on Web Pages. Contextual Ads 101

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Dynamical Clustering of Personalized Web Search Results

A Web Content Mining Approach for Tag Cloud Generation

2014/02/13 Sphinx Lunch

Hexaware E-book on Predictive Analytics

Science Navigation Map: An Interactive Data Mining Tool for Literature Analysis

1 o Semestre 2007/2008

Using Artificial Intelligence to Manage Big Data for Litigation

Partially Supervised Word Alignment Model for Ranking Opinion Reviews

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Social Media Mining. Data Mining Essentials

Classification of Engineering Consultancy Firms Using Self-Organizing Maps: A Scientific Approach

Text Mining: The state of the art and the challenges

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Clustering Technique in Data Mining for Text Documents

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Mining Opinion Features in Customer Reviews

How To Write A Wikipedia Article Automatically On A Wikipedia Stub

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

CHAPTER 1 INTRODUCTION

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

María Elena Alvarado gnoss.com* Susana López-Sola gnoss.com*

An Unsupervised Approach to Domain-Specific Term Extraction

AN IMPROVED FUZZY SYSTEM FOR REPRESENTING WEB PAGES IN CLUSTERING TASKS

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Secure Because Math: Understanding ML- based Security Products (#SecureBecauseMath)

The Open Source Knowledge Discovery and Document Analysis Platform

How To Create A Text Classification System For Spam Filtering

Automatic Advertising Campaign Development

Enhanced Information Access to Social Streams. Enhanced Word Clouds with Entity Grouping

FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Towards a Visually Enhanced Medical Search Engine

Sentiment Analysis and Topic Classification: Case study over Spanish tweets

De la Business Intelligence aux Big Data. Marie- Aude AUFAURE Head of the Business Intelligence team Ecole Centrale Paris. 22/01/14 Séminaire Big Data

COMP9321 Web Application Engineering

WHAT DEVELOPERS ARE TALKING ABOUT?

Inner Classification of Clusters for Online News

MS1b Statistical Data Mining

Spring February. March

End-to-End Sentiment Analysis of Twitter Data

Scientific Report. BIDYUT KUMAR / PATRA INDIAN VTT Technical Research Centre of Finland, Finland. Raimo / Launonen. First name / Family name

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

USE OF INFORMATION SOURCES AMONGST POSTGRADUATE STUDENTS IN COMPUTER SCIENCE AND SOFTWARE ENGINEERING A CITATION ANALYSIS YIP SUMIN

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value

Viral Marketing in Social Network Using Data Mining

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

Curriculum Vitae Ruben Sipos

Finding Advertising Keywords on Web Pages

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Top 10 Algorithms in Data Mining

CAS CS 565, Data Mining

Topic Extrac,on from Online Reviews for Classifica,on and Recommenda,on (2013) R. Dong, M. Schaal, M. P. O Mahony, B. Smyth

Deep Web Entity Monitoring

Top Top 10 Algorithms in Data Mining

Information Management course

WebInEssence: A Personalized Web-Based Multi-Document Summarization and Recommendation System

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Using R for Social Media Analytics

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Using Data Mining for Mobile Communication Clustering and Characterization

Semi-Supervised and Unsupervised Machine Learning. Novel Strategies

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Question Answering and Multilingual CLEF 2008

Identifying Search Keywords for Finding Relevant Social Media Posts

Transcription:

Keyphrase Extraction for Scholarly Big Data Cornelia Caragea Computer Science and Engineering University of North Texas July 10, 2015

Scholarly Big Data Large number of scholarly documents on the Web PubMed currently has over 24 million documents Google Scholar is estimated to have 160 million documents The growth in the number of papers indexed DBLP: Navigating in these digital libraries has become very challenging.

Keyphrase Extraction in Document Networks Keyphrases: Allow for efficient processing of more information in less time Are useful in many applications: Topic tracking, information filtering and search, classification, clustering, and recommendation. Keyphrase extraction is the task of automatically extracting descriptive phrases or concepts from a document.

Keyphrases Example: A snippet from the 2010 best paper award winner in the WWW conference - the author-input keyphrases are shown in red

Previous Approaches to Keyphrase Extraction Use generally only the textual content of the target document [Mihalcea and Tarau, 2004], [Liu et al., 2010]. Recently, models are proposed that incorporate a local neighborhood of a document [[Wan and Xiao, 2008]. Obtained improvements over models that use only textual content. However, their neighborhood is limited to textually-similar documents. During these Big Data times - access to giant document networks

Our Questions In addition to a document s textual content and textually-similar neighbors, are there other informative neighborhoods in research document collections? Can these neighborhoods improve keyphrase extraction?

From Data to Knowledge A typical scientific research paper: Proposes new problems or extends the state-of-the-art for existing research problems. Cites relevant, previously-published papers in appropriate contexts. Citation contexts capture the influence of one paper on another as well as the flow of information Can serve as micro summaries of a cited paper!

A Small Citation Network Citation contexts are very informative! [Das G. and Caragea, 2014]; [Caragea et al., 2014]

Citation Contexts - Not a New Idea Using terms from citation contexts resembles the analysis of hyperlinks and the graph structure of the Web Web search engines build on the intuition that the anchor text pointing to a page is a good descriptor of its content, and thus use anchor terms as additional index terms for a target webpage. Previously used for other tasks: Scientific paper summarization [Mei and Zhai, 2008; Abu-Jbara and Radev, 2011; Qazvinian et al., 2010] Indexing of cited papers [Ritchie et al. (2006)] Author influence in document networks [Kataria et al., 2011]

Citation Contexts to Keyphrase Extraction How can we use these contexts and how do they help in keyphrase extraction? We proposed: CiteTextRank [Das Gollapalli and Caragea, 2014]: an unsupervised, graph-based algorithm that incorporates evidence from multiple sources (citation contexts as well as document content) in a flexible way to extract keyphrases. Citation-enhanced Keyphrase Extraction (CeKE) [Caragea et al., 2014]: a supervised binary classification model built on a combination of novel features that capture information from citation contexts and existing features from previous works.

Features for CeKE

How Does CeKE Compare with Supervised Models? Table: Comparison of CeKE with Hulth s and KEA methods. Features used in previous supervised methods: Hulth s features: POS, relative position, term frequency and collection frequency. KEA s features: tf-idf and relative position

How Does CeKE Perform in the Absence of Either Cited or Citing Contexts? Table: Results with both contexts and only cited/citing contexts.

Conclusions and Future Directions Our models give significant improvements over baseline models for multiple datasets of research papers in the Computer Science domain Future directions: Citation context lengths: Incorporate more sophisticated approaches to identifying the text that is relevant to a target citation [Abu-Jbara and Radev, 2012; Teufel, 1999] and study the influence of context lengths on the quality of extracted keyphrase Evaluate our models on other domains, e.g., the ACL Anthology, PubMed.

References S. Das Gollapalli and C. Caragea (2014). Extracting Keyphrases from Research Papers using Citation Networks. In: Proceedings of the 28th American Association for Artificial Intelligence (AAAI 14). C. Caragea, F. Bulgarov, A. Godea, and S. Das Gollapalli. (2014). Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 14). R. Mihalcea and P. Tarau. (2004). TextRank: Bringing order into text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 04). X. Wan and J. Xiao (2008). Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI 08). Qiaozhu Mei and ChengXiang Zhai. (2008). Generating impact-based summaries for scientific literature. In: Proceedings of the Conference of the Association for Computational Linguistics, pages 816-824, Columbus, Ohio. V. Qazvinian, D. Radev, and A. Özgür. 2010. Citation summarization through keyphrase extraction. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 10, pages 895-903.

ACL Workshop on Keyphrase Extraction For more information, please visit: www.cse.unt.edu/ ccaragea/acl2015ws.html

Thank you!