Network Big Data: Facing and Tackling the Complexities Xiaolong Jin



Similar documents
Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks

Research of Postal Data mining system based on big data

Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Florida International University - University of Miami TRECVID 2014

Introduction. A. Bellaachia Page: 1

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Patient Similarity-guided Decision Support

Sentiment analysis on tweets in a financial domain

User Modeling in Big Data. Qiang Yang, Huawei Noah s Ark Lab and Hong Kong University of Science and Technology 杨 强, 华 为 诺 亚 方 舟 实 验 室, 香 港 科 大

Probabilistic topic models for sentiment analysis on the Web

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

SPATIAL DATA CLASSIFICATION AND DATA MINING

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

Computing Issues for Big Data Theory, Systems, and Applications

Search Result Optimization using Annotators

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Fast Data in the Era of Big Data: Twitter s Real-

Tackling Big Data with Tensor Methods

The Data Mining Process

Term extraction for user profiling: evaluation by the user

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Why are Organizations Interested?

Predicting the NFL Using Twitter. Shiladitya Sinha, Chris Dyer, Kevin Gimpel, Noah Smith

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Search and Information Retrieval

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Estimating Twitter User Location Using Social Interactions A Content Based Approach

Analysis of Social Media Streams

Data Mining Yelp Data - Predicting rating stars from review text

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

A Capability Model for Business Analytics: Part 2 Assessing Analytic Capabilities

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Building a Question Classifier for a TREC-Style Question Answering System

Micro blogs Oriented Word Segmentation System

Advanced Methods for Pedestrian and Bicyclist Sensing

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

IEEE JAVA Project 2012

CHAPTER 1 INTRODUCTION

TOWARD BIG DATA ANALYSIS WORKSHOP

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

CHAPTER 1 INTRODUCTION

The Enron Corpus: A New Dataset for Classification Research

Introduction to Data Mining

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification

Semantic Sentiment Analysis of Twitter

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems

A Strategic Approach to Unlock the Opportunities from Big Data

Information Management course

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Inference Methods for Analyzing the Hidden Semantics in Big Data. Phuong LE-HONG

Forecasting stock markets with Twitter

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

Data, Measurements, Features

A New Era Of Analytic

locuz.com Big Data Services

Business Intelligence and Decision Support Systems

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

A Time Efficient Algorithm for Web Log Analysis

Big Data and Complex Networks Analytics. Timos Sellis, CSIT Kathy Horadam, MGS

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Context Models For Web Search Personalization

Learn to Personalized Image Search from the Photo Sharing Websites

Personalizing Image Search from the Photo Sharing Websites

XML enabled databases. Non relational databases. Guido Rotondi

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Optimization of Image Search from Photo Sharing Websites Using Personal Data

Social Media Mining. Data Mining Essentials

From Data to Foresight:

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection

Big Data and Analytics: Challenges and Opportunities

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Web Archiving and Scholarly Use of Web Archives

PPInterFinder A Web Server for Mining Human Protein Protein Interaction

Complex, true real-time analytics on massive, changing datasets.

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

COMP9321 Web Application Engineering

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Transcription:

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin CAS Key Laboratory of Network Data Science & Technology Institute of Computing Technology Chinese Academy of Sciences (CAS) 2015-08-10 @ Big Data in China Forum

Big Data is a Big Chance! Sensing & Understanding the Present Monitoring development index (Price Environment Health) Crime Detection on the Web Social Media Data and log data up to PB size Structured and unstructured data Historical and real time streaming data PB size Log Data EB size Monitoring Data Unpredictable criminal activity Challenges:Large volume, variety, complex connection and big noise in data bring difficulties to measurement fusion and pattern mining

Big Data is a Big Chance! Analyzing and Predicting the Future Election Prediction based on Twitter UN Global Pulse: Predicting unemployment rate and disease Challenges:Strong interaction, real time and dynamic properties make the life cycle of data separated, hard to balance latency and accuracy, and difficult to predict the tendency

Big Challenges in Big Data How to design a big data system? How to optimize it? System Complexity Big Data Computation Complexity Is that computable? How to compute towards Web scale data? Data Complexity How to represent the data? How to measure the data complexity? Volume Velocity Variety Veracity

Data Grave or Data Goldmine Can we well address these Big Scientific Challenges in Big Data? No? Yes? Data Grave Data Gold-mine

Our Efforts on Handling Complexities Data Complexity How to represent the data? Finding the semantic representations Computation Complexity How to compute with Web scale data? Bringing order to big data System ADA

Data Complexity Finding Semantic Representations of Large Scale Web Documents Topic Modeling Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives. 1 Discover the hidden themes that pervade the collection. 2 Annotate the documents according to those themes. 3 Use annotations to organize, summarize, and search the texts.

Topic Modeling Meets Big Data Topic Modeling Big Data Feature Sparseness

Prevalent Short Texts Uncovering the topics of short texts is crucial for a wide range of content analysis tasks Data Source Average Word Count (removing stop words) Weibo Sina weibo ~9 Questions Baidu Zhidao ~6 Web page titles Logs ~5 Query Query log ~3

The Limitation of Conventional Topic Models Bag-of-words Assumption The occurrences of words plays a less discriminative role No enough word counts to know how words are related The limited contexts in short texts More difficult to identify the senses of ambiguous words in short documents

Key Ideas of Our Approach Topics are basically groups of correlated words and the correlation is revealed by word co-occurrence patterns in documents why not directly model the word co-occurrences for topic learning? Topic models on short texts suffer from the problem of severe sparse patterns in short documents why not use the rich global word co-occurrence patterns for better revealing topics?

Biterm Topic Model (BTM) Biterm: co-occurred word pairs in short texts "visit apple store" -> "visit apple", "visit store", "apple store " Model the generation of biterms with latent topic structures a topic ~ a probability distribution over words a corpus ~ a mixture of topics a biterm ~ two i.i.d samples drawn from one topic X. Yan, et al.,..., Biterm Topic model for Short Text, WWW 2013;

Comparison between Different Models LDA Mixture of Unigram BTM Document level topic distribution Suffer sparsity of the doc Model the generation of each word Ignore context Corpus level topic distribution Alleviate doc sparsity Single topic assumption in each document Too strong assumption Corpus level topic distribution Alleviate the sparsity Model the generation of word pairs Leverage context

Evaluation on Tweets Dataset:Tweets2011 Sample 50 hashtag with clear topic Extract tweets with these hashtags Evaluation Metric: H score IntraDis: average distance between docs under the same hashtag InterDis: average distance between docs under different hashtags The smaller H score is, the better topic representation

Evaluation on Baidu Zhidao Dataset:Baidu Zhidao Q&A Question classification according to their tags

Online Algorithm for BTM Batch algorithm is no longer suitable for topic learning in real world application. It is impractical to scan the whole dataset repeatedly due to the limitation of memory. It is desired to keep the model up-to-date when new data arrive continuously. Online BTM (obtm) assumes documents are divided by time slices Incremental BTM (ibtm) ibtm updates the model continuouesly via a technique of incremental Gibbs sampler. X. Yan, et al., BTM:Topic Modeling over Short Texts, IEEE TKDE, 2014.

Computation Complexity Bring Order to Big Data Ranking is a central problem in many applications! Ranking Recommendation Web Search Information Filtering

Ranking Meets Big Data High computation cost especially because ranking is a more complex task! We can save computation cost if we find the core data of the ranking problem

Top-k Learning to Rank Ranking with Big Data WSDM n is usually 10 5 ~10 6 Multi-level ratings 1 2 3 4 Global learning Full-order groundtruth Top-K groundtruth 1 2 3 4 5 n 1 2 3 4 k Global prediction Local learning Users mainly care about top-k ranking, where k is usually 10-20 S. Niu, et al.,top-k Learning to Rank: Labeling, Ranking and Evaluation, SIGIR 2012; Best Student Paper Award S. Niu, et al. Is Top-k Sufficient for Ranking?, CIKM 2013;

Local=Global? Is top-k Sufficient for Ranking? (empirical) NDCG@5 NDCG@10 NDCG@5-full NDCG@10-full RankNet 0.89 0.88 0.87 0.86 0.85 0.93 0.92 0.91 0.9 0.84 0 10 20 30 40 50 60 70 80 90 100 0.89 0 10 20 30 40 50 60 70 80 90 100 0.89 0.935 0.88 0.925 ListMLE 0.87 0.86 0.915 0.905 0.85 0.895 0.84 0 10 20 30 40 50 60 70 80 90 100 0.885 0 10 20 30 40 50 60 70 80 90 100 MQ2007list MQ2008list

Local=Global? Is top-k Sufficient for Ranking? (theoretical) Losses in Top-k Setting Losses in Full-Order Setting Weighted Kendall s Tau IR Evaluation Measures (NDCG) Losses in a top-k setting are tighter bounds of 1- NDCG, compared with those in a full-order setting!

Top-k Ranking Framework Top-k Labeling Top-k Ranking Top-k Evaluation An efficient labeling strategy to get top-k ground-truth more powerful ranking algorithms in the scenario of big data new evaluation measures for the scenario of big data

Top-k Ranking Algorithm: Hybrid FocusedRank New characteristics of top-k ground-truth Total ordering of top k items Listiwise ranking algorithms Preferences between top k Items and the other n-k items Pairwise ranking algorithms Struct-SVM AdaRank ListNet RankSVM RankBoost RankNet Hybrid FocusedRank FocusedSVM FocusedBoost FocusedNet

Experimental Results Top-10 MQ2007 Top-10 TD2003 κndcg@10 κerr 0.68 0.67 0.66 0.65 0.64 0.63 0.62 0.61 0.6 0.64 0.63 0.62 0.61 0.6 0.59 0.58 0.57 0.56 0.41 0.405 0.4 0.395 0.39 0.385 0.38 0.375 0.37 0.365 0.36 0.47 0.46 0.45 0.44 0.43 0.42 0.41 0.4 0.39 0.38 0.37 Performance comparison among Hybrid FocusedRank, pairwise and listwise algorithms on Top-k datasets.

Experimental Results (cont ) Top-10 MQ2007 Top-10 TD2003 0.41 0.48 κndcg@10 κerr 0.405 0.4 0.395 0.39 0.385 0.38 0.676 0.672 0.668 Top-k ListMLE FocusedSVMFocusedBoostFocusedNet 0.46 0.44 0.42 0.4 0.38 0.36 Top-k ListMLE 0.635 0.63 0.625 FocusedSVMFocusedBoostFocusedNet 0.664 0.62 0.66 0.615 0.656 Top-k ListMLE FocusedSVMFocusedBoostFocusedNet 0.61 Top-k ListMLE FocusedSVMFocusedBoostFocusedNet Performance comparison between Hybrid FocusedRank and Top-k ListMLE on Top-k datasets.

ADA: A System for Big Data Analysis Goal: To discover, query and infer the relationships between different objects Multi-type objects: virtual persons, real persons, events, organizations, etc. Relation discovery: People to People: social, interaction, co-occurrence, action People to Event: initiate, participate, involve Event to Event: causality, sequence, contain Other: People to Org, Event to Org, Org to Org Relation Query: Retrieve the relations between objects Relation Inference: Inference/prediction Reasoning Virtual identity recognition

Architecture of the ADA System

Case 1: Query the Real Persons on the Web

Case 2: Analyzing and Tracking Events

Summary Three Scientific Challenges for Big Data Data complexity Computation complexity System complexity Our research efforts on big data Finding the semantic representations Bringing order to big data A system for network big data analysis ADA: Discover, query and infer the relationships between objects

Thank you for your attention!