Network Big Data: Facing and Tackling the Complexities Xiaolong Jin CAS Key Laboratory of Network Data Science & Technology Institute of Computing Technology Chinese Academy of Sciences (CAS) 2015-08-10 @ Big Data in China Forum
Big Data is a Big Chance! Sensing & Understanding the Present Monitoring development index (Price Environment Health) Crime Detection on the Web Social Media Data and log data up to PB size Structured and unstructured data Historical and real time streaming data PB size Log Data EB size Monitoring Data Unpredictable criminal activity Challenges:Large volume, variety, complex connection and big noise in data bring difficulties to measurement fusion and pattern mining
Big Data is a Big Chance! Analyzing and Predicting the Future Election Prediction based on Twitter UN Global Pulse: Predicting unemployment rate and disease Challenges:Strong interaction, real time and dynamic properties make the life cycle of data separated, hard to balance latency and accuracy, and difficult to predict the tendency
Big Challenges in Big Data How to design a big data system? How to optimize it? System Complexity Big Data Computation Complexity Is that computable? How to compute towards Web scale data? Data Complexity How to represent the data? How to measure the data complexity? Volume Velocity Variety Veracity
Data Grave or Data Goldmine Can we well address these Big Scientific Challenges in Big Data? No? Yes? Data Grave Data Gold-mine
Our Efforts on Handling Complexities Data Complexity How to represent the data? Finding the semantic representations Computation Complexity How to compute with Web scale data? Bringing order to big data System ADA
Data Complexity Finding Semantic Representations of Large Scale Web Documents Topic Modeling Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives. 1 Discover the hidden themes that pervade the collection. 2 Annotate the documents according to those themes. 3 Use annotations to organize, summarize, and search the texts.
Topic Modeling Meets Big Data Topic Modeling Big Data Feature Sparseness
Prevalent Short Texts Uncovering the topics of short texts is crucial for a wide range of content analysis tasks Data Source Average Word Count (removing stop words) Weibo Sina weibo ~9 Questions Baidu Zhidao ~6 Web page titles Logs ~5 Query Query log ~3
The Limitation of Conventional Topic Models Bag-of-words Assumption The occurrences of words plays a less discriminative role No enough word counts to know how words are related The limited contexts in short texts More difficult to identify the senses of ambiguous words in short documents
Key Ideas of Our Approach Topics are basically groups of correlated words and the correlation is revealed by word co-occurrence patterns in documents why not directly model the word co-occurrences for topic learning? Topic models on short texts suffer from the problem of severe sparse patterns in short documents why not use the rich global word co-occurrence patterns for better revealing topics?
Biterm Topic Model (BTM) Biterm: co-occurred word pairs in short texts "visit apple store" -> "visit apple", "visit store", "apple store " Model the generation of biterms with latent topic structures a topic ~ a probability distribution over words a corpus ~ a mixture of topics a biterm ~ two i.i.d samples drawn from one topic X. Yan, et al.,..., Biterm Topic model for Short Text, WWW 2013;
Comparison between Different Models LDA Mixture of Unigram BTM Document level topic distribution Suffer sparsity of the doc Model the generation of each word Ignore context Corpus level topic distribution Alleviate doc sparsity Single topic assumption in each document Too strong assumption Corpus level topic distribution Alleviate the sparsity Model the generation of word pairs Leverage context
Evaluation on Tweets Dataset:Tweets2011 Sample 50 hashtag with clear topic Extract tweets with these hashtags Evaluation Metric: H score IntraDis: average distance between docs under the same hashtag InterDis: average distance between docs under different hashtags The smaller H score is, the better topic representation
Evaluation on Baidu Zhidao Dataset:Baidu Zhidao Q&A Question classification according to their tags
Online Algorithm for BTM Batch algorithm is no longer suitable for topic learning in real world application. It is impractical to scan the whole dataset repeatedly due to the limitation of memory. It is desired to keep the model up-to-date when new data arrive continuously. Online BTM (obtm) assumes documents are divided by time slices Incremental BTM (ibtm) ibtm updates the model continuouesly via a technique of incremental Gibbs sampler. X. Yan, et al., BTM:Topic Modeling over Short Texts, IEEE TKDE, 2014.
Computation Complexity Bring Order to Big Data Ranking is a central problem in many applications! Ranking Recommendation Web Search Information Filtering
Ranking Meets Big Data High computation cost especially because ranking is a more complex task! We can save computation cost if we find the core data of the ranking problem
Top-k Learning to Rank Ranking with Big Data WSDM n is usually 10 5 ~10 6 Multi-level ratings 1 2 3 4 Global learning Full-order groundtruth Top-K groundtruth 1 2 3 4 5 n 1 2 3 4 k Global prediction Local learning Users mainly care about top-k ranking, where k is usually 10-20 S. Niu, et al.,top-k Learning to Rank: Labeling, Ranking and Evaluation, SIGIR 2012; Best Student Paper Award S. Niu, et al. Is Top-k Sufficient for Ranking?, CIKM 2013;
Local=Global? Is top-k Sufficient for Ranking? (empirical) NDCG@5 NDCG@10 NDCG@5-full NDCG@10-full RankNet 0.89 0.88 0.87 0.86 0.85 0.93 0.92 0.91 0.9 0.84 0 10 20 30 40 50 60 70 80 90 100 0.89 0 10 20 30 40 50 60 70 80 90 100 0.89 0.935 0.88 0.925 ListMLE 0.87 0.86 0.915 0.905 0.85 0.895 0.84 0 10 20 30 40 50 60 70 80 90 100 0.885 0 10 20 30 40 50 60 70 80 90 100 MQ2007list MQ2008list
Local=Global? Is top-k Sufficient for Ranking? (theoretical) Losses in Top-k Setting Losses in Full-Order Setting Weighted Kendall s Tau IR Evaluation Measures (NDCG) Losses in a top-k setting are tighter bounds of 1- NDCG, compared with those in a full-order setting!
Top-k Ranking Framework Top-k Labeling Top-k Ranking Top-k Evaluation An efficient labeling strategy to get top-k ground-truth more powerful ranking algorithms in the scenario of big data new evaluation measures for the scenario of big data
Top-k Ranking Algorithm: Hybrid FocusedRank New characteristics of top-k ground-truth Total ordering of top k items Listiwise ranking algorithms Preferences between top k Items and the other n-k items Pairwise ranking algorithms Struct-SVM AdaRank ListNet RankSVM RankBoost RankNet Hybrid FocusedRank FocusedSVM FocusedBoost FocusedNet
Experimental Results Top-10 MQ2007 Top-10 TD2003 κndcg@10 κerr 0.68 0.67 0.66 0.65 0.64 0.63 0.62 0.61 0.6 0.64 0.63 0.62 0.61 0.6 0.59 0.58 0.57 0.56 0.41 0.405 0.4 0.395 0.39 0.385 0.38 0.375 0.37 0.365 0.36 0.47 0.46 0.45 0.44 0.43 0.42 0.41 0.4 0.39 0.38 0.37 Performance comparison among Hybrid FocusedRank, pairwise and listwise algorithms on Top-k datasets.
Experimental Results (cont ) Top-10 MQ2007 Top-10 TD2003 0.41 0.48 κndcg@10 κerr 0.405 0.4 0.395 0.39 0.385 0.38 0.676 0.672 0.668 Top-k ListMLE FocusedSVMFocusedBoostFocusedNet 0.46 0.44 0.42 0.4 0.38 0.36 Top-k ListMLE 0.635 0.63 0.625 FocusedSVMFocusedBoostFocusedNet 0.664 0.62 0.66 0.615 0.656 Top-k ListMLE FocusedSVMFocusedBoostFocusedNet 0.61 Top-k ListMLE FocusedSVMFocusedBoostFocusedNet Performance comparison between Hybrid FocusedRank and Top-k ListMLE on Top-k datasets.
ADA: A System for Big Data Analysis Goal: To discover, query and infer the relationships between different objects Multi-type objects: virtual persons, real persons, events, organizations, etc. Relation discovery: People to People: social, interaction, co-occurrence, action People to Event: initiate, participate, involve Event to Event: causality, sequence, contain Other: People to Org, Event to Org, Org to Org Relation Query: Retrieve the relations between objects Relation Inference: Inference/prediction Reasoning Virtual identity recognition
Architecture of the ADA System
Case 1: Query the Real Persons on the Web
Case 2: Analyzing and Tracking Events
Summary Three Scientific Challenges for Big Data Data complexity Computation complexity System complexity Our research efforts on big data Finding the semantic representations Bringing order to big data A system for network big data analysis ADA: Discover, query and infer the relationships between objects
Thank you for your attention!