Enhanced Information Access to Social Streams. Enhanced Word Clouds with Entity Grouping

Similar documents

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

Exploring Big Data in Social Networks

Research of Postal Data mining system based on big data

Tweets Miner for Stock Market Analysis

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Why do statisticians "hate" us?

Task 3 Web Community Sensing & Task 6 Query and Visualization

Analysis of Social Media Streams

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

Sentiment analysis on tweets in a financial domain

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Clustering Technique in Data Mining for Text Documents

Introduction to Data Mining

Big Data and Analytics: Challenges and Opportunities

Sentiment Analysis on Big Data

Automated Collaborative Filtering Applications for Online Recruitment Services

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Predicting stocks returns correlations based on unstructured data sources

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

COMP9321 Web Application Engineering

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

1 o Semestre 2007/2008

Comparison of K-means and Backpropagation Data Mining Algorithms

Digital Collections as Big Data. Leslie Johnston, Library of Congress Digital Preservation 2012

SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG

Practical Graph Mining with R. 5. Link Analysis

Clustering Data Streams

Programming Tools based on Big Data and Conditional Random Fields

How To Cluster On A Search Engine

Customer Driven Big-Data Analytics for the Companies Servitization

KEYWORD SEARCH IN RELATIONAL DATABASES

Visualization methods for patent data

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search

The Need for Training in Big Data: Experiences and Case Studies

Association rules for improving website effectiveness: case analysis

Financial Trading System using Combination of Textual and Numerical Data

The Big Data Paradigm Shift. Insight Through Automation

THEMIS: Fairness in Data Stream Processing under Overload

How To Write A Summary Of A Review

IC05 Introduction on Networks &Visualization Nov

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Dynamical Clustering of Personalized Web Search Results

Content-Based Discovery of Twitter Influencers

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

Database Marketing, Business Intelligence and Knowledge Discovery

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Twitter Analytics: Architecture, Tools and Analysis

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Performance Metrics for Graph Mining Tasks

Final Project Report

Term extraction for user profiling: evaluation by the user

Handling the Complexity of RDF Data: Combining List and Graph Visualization

Implementing Graph Pattern Mining for Big Data in the Cloud

with your eyes: Considerations when visualizing information Joshua Mitchell & Melissa Rands, RISE

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Twitter Data Analysis: Hadoop2 Map Reduce Framework

How Companies are! Using Spark

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Text Clustering Using LucidWorks and Apache Mahout

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

Big Data Text Mining and Visualization. Anton Heijs

Hybrid model rating prediction with Linked Open Data for Recommender Systems

A Statistical Text Mining Method for Patent Analysis

Social Media Mining. Data Mining Essentials

Canonical Image Selection for Large-scale Flickr Photos using Hadoop

DYNAMIC QUERY FORMS WITH NoSQL

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Keyphrase Extraction for Scholarly Big Data

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

Information Management course

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

Distributed Computing and Big Data: Hadoop and MapReduce

Contact Recommendations from Aggegrated On-Line Activity

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Client Overview. Engagement Situation. Key Requirements

Microblogging Queries on Graph Databases: An Introspection

Transcription:

Enhanced Information Access to Social Streams through Word Clouds with Entity Grouping Martin Leginus 1, Leon Derczynski 2 and Peter Dolog 1 1 Department of Computer Science, Aalborg University Selma Lagerlofs Vej 300, 9200 Aalborg, Denmark 2 Department of Computer Science, University of Sheffield S1 4DP, United Kingdom 20.05.2015

Agenda Word clouds for social streams Entity redundancies and motivation Grouping entities Graph based word cloud generation Synthetic evaluation User study Discussions, Conclusions and Future work

Word clouds generated from social streams A visual retrieval interface depicting the most important terms of a dataset. Word clouds provide means to minimize an information overload when browsing social media. Leginus, M., Zhai, C., and Dolog, P. (2015). Personalized generation of word clouds from tweets. Journal of the Association for Information Science and Technology.

Word clouds generated from social streams Figure : The user interface of FeedWinnower system. Hong L., Convertino G., Suh B., Chi E. H, and Kairam S. Feedwinnower: layering structures over collections of information streams. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 947-950. ACM, 2010

Word clouds generated from social streams Figure : Tweetmotif user interface for exploratory search. OConnor B., Krieger M., and Ahn D. Tweetmotif: Exploratory search and topic summarization for twitter. Proceedings of ICWSM, pages 2-3, 2010.

Word clouds generated from social streams Figure : Eddi - summarization interface of user timeline tweets. Bernstein, M. S., Suh, B., Hong, L., Chen, J., Kairam, S., and Chi, E. H. (2010). Eddi: interactive topic-based browsing of social status streams. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 303312. ACM.

Motivation Word clouds generated from terms often not meaningful - enrich with named entities. (clouds with entities perceived as more useful (Finn et.al 2010)) The football team Manchester United can be referred to as MUFC, Man U, Red Devils or the Reds. Redundancies decrease a quality of word clouds (decreased diversity, user confusion and limited browsing experience).

Goals 1 Condense divergent terms referring to the same entity. 2 Maximize a diversity of word clouds to provide a broad overview of topics. Research hypothesis: Word clouds with grouped named entities improve coverage, relevance and diversity.

The process of word clouds generation 1 Data collection 2 Data preprocessing 3 Word cloud generation

Grouping named entities 1 Recognise named entities (NER) and disambiguate them (entity linking) using the TextRazor service 2 Find alternative names for the recognised entity. 3 Perform lemmatisation: 4 Using the aliases, build a term cluster for each entity 5 Find canonical names for entities.

Grouping named entities 1 Recognise named entities (NER) and disambiguate them (entity linking) using the TextRazor service 2 Find alternative names for the recognised entity. (We employ a Freebase KB)

Grouping named entities 1 Recognise named entities (NER) and disambiguate them (entity linking) using the TextRazor service 2 Find alternative names for the recognised entity. 3 Perform lemmatisation: Group together all the inflicted forms of a word to exploit only the base form of the term e.g., fan, fans. 4 Using the aliases, build a term cluster for each entity 5 Find canonical names for entities.

Grouping named entities 1 Recognise named entities (NER) and disambiguate them (entity linking) using the TextRazor service 2 Find alternative names for the recognised entity. 3 Perform lemmatisation: 4 Using the aliases, build a term cluster for each entity e.g., mufc, manchester united, man united, red devils, devils 5 Find canonical names for entities.

Grouping named entities 1 Recognise named entities (NER) and disambiguate them (entity linking) using the TextRazor service 2 Find alternative names for the recognised entity. 3 Perform lemmatisation: 4 Using the aliases, build a term cluster for each entity 5 Find canonical names for entities. Represent the cluster mufc, manchester united, man united, red devils, devils with Manchester United F.C.

Grouping named entities 1 Recognise named entities (NER) and disambiguate them (entity linking) using the TextRazor service 2 Find alternative names for the recognised entity. 3 Perform lemmatisation: 4 Using the aliases, build a term cluster for each entity 5 Find canonical names for entities. Proceed with word cloud generation.

Graph-based word cloud generation Terms extracted from tweets used for a graph creation. When two terms (vertices) co-occur at least α times, two directed edges are introduced t 1 t 2, t 2 t 1 Tweets are short, hence a parameter α is set to 0.

Graph-based ranking Stochastic traversal of terms graph estimates an importance of a term t. Iterative stationary probability is defined as: ( din (v) ) π(v) (i+1) = (1 β) p(v u)π (i) (u) u=1 + β p p (1) User preferences can be encoded into a vector of prior probabilities p p The resulting global rank of a term t after convergence is: Top-k ranked terms are then used for word cloud generation. I(t) = π(t) (2)

Divrank ranking Intention is to increase the diversity of ranking Transition probabilities change over time - rich gets richer principle Q. Mei, J. Guo, and D. Radev. Divrank: the interplay of prestige and diversity in information networks. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1009-1018. ACM, 2010.

Divrank ranking A transition probability from a node u to node v at time T is: ( ) p0 (v u) p T (v) p T (v u) = (1 β) + β p p v V p(v u) p T (v) Divrank algorithm is useful for incorrectly disambiguated entities e.g., BBC, BBC NEWS, BBC NEWS WORLD.

Evaluation metrics A term t links to tweets Tw t Tw tq is the set of all tweets that are associated with a query phrase t q Coverage indicates how many tweets are retrievable from the given word cloud. Coverage(WC k ) = t WC k Tw t, (3) Tw tq Overlap captures the extent of redundancies i.e., how many terms link to the same tweet. Overlap(WC k ) = avg ti t j Tw ti Tw tj min{ Tw ti, Tw tj }, (4)

Evaluation metrics I Mean Average Precision: 1 A word cloud is transformed into a query Q WCk. 2 Retrieve and rank tweets matching the query. 3 Measure Mean Average Precision (MAP). Ranking function is Okapi BM25: S(tw,Q WCk ) = c(q i,q WCk ) TF(q i,tw) IDF(q i ) (5) q i Q WCk tw The function c(q i,q WCk ) returns a weight of the term q i. Components TF(q i,tw) and IDF(q i ) are calculated in the standard way.

Evaluation dataset and word cloud generation methods TREC2011 microblogging collection with relevance judgements. Tweets rated as relevant or highly relevant are considered equally relevant. PgRankTerms (baseline) estimates a global importance of terms (extracted from tweets and transformed into a graph). MFE: selects top-k most popular recognized entities. MFEA: selects top-k most popular recognized entities grouped with their aliases. PgRankTermsEntities: ranks top-k phrases from the graph which contains extracted terms and grouped recognized entities.

Results Coverage 0.6 0.5 0.4 Coverage 0.3 10 20 30 40 50 # terms in the word cloud PgRankTerms PgRankTermsEntities MFE MFEA Relevance 0.15 0.1 0.05 Overlap 10 20 30 40 50 # terms in the word cloud MAP@30 0.24 0.23 0.22 0.21 0.2 0.19 MAP@30 0.18 10 20 30 40 50 # terms in the word cloud Figure : Word clouds with grouped entities attain higher Coverage and MAP.

Diversification Overlap 0.14 0.12 PgRankTerms PgRankTermsEntities DivRankTermEntities Overlap 0.1 0.08 0.06 0.04 0.02 10 20 30 40 50 # terms in the word cloud Figure : Divrank decreases redundancies and also improve Coverage wrt. to the baseline.

User study Are word clouds with named entities perceived as more relevant and diverse by the users? Do measured synthetic metrics correlate with the ratings of relevance and diversity by users?

Terms vs. Entity-grouped clouds 160 distinct relevance ratings, 89 positive towards word clouds with named entities, 27 neutral ratings and 44 for the baseline generated word clouds. Diversity ratings, 73 positive towards word clouds with named entities, 51 neutral ratings and 36 for the baseline generated word clouds. 40 40 % of ratings 30 20 10 % of ratings 30 20 10 0 1 2 3 4 5 Relevance ratings 0 1 2 3 4 5 Diversity ratings

User ratings vs. synthetic metrics % of ratings 40 30 20 10 Improved MAP % of ratings 40 30 20 10 Improved MAP % of ratings % of ratings 0 1 2 3 4 5 Relevance ratings Improved diversity ( Overlap) 40 30 20 10 0 1 2 3 4 5 Relevance ratings Decreased MAP and Overlap 40 30 20 10 % of ratings % of ratings 0 1 2 3 4 5 Diversity ratings Improved diversity ( Overlap) 40 30 20 10 0 1 2 3 4 5 Diversity ratings Decreased MAP and Overlap 40 30 20 10 0 1 2 3 4 5 Relevance ratings 0 1 2 3 4 5 Diversity ratings MAP metric predicts extrinsic human evaluations of cloud quality.

Discussions False positives from NER affect relevance ratings e.g., a cloud for Super Bowl, seats" contained Super (2010 American film)". Imprecise named entitiy disambiguation (e.g., BBC, BBC News, BBC News World) increases redundancies. Due to subjective nature of the crowdsourcing task, we disregarded a user qualifying phase.

Conclusions A technique that groups aliases of the same entity and represents them with a canonical term. Significantly decreased redundancy and significantly higher coverage than the baseline. User study supports that word clouds with grouped named entities are significantly more relevant and diverse than baseline. MAP metric predicts extrinsic human evaluations of cloud quality.