Research Statement. 1.1. Constrained Frequent Pattern Mining For Large Graph/Networks

Research Statement Feida ZHU School of Information Systems, Singapore Management University Tel: (65) 6808-5101; Email: fdzhu@smuedusg 30 (Day) 04 (Month) 2013 (Year) Introduction The past decade has seen an unprecedented explosion of data in almost all areas of our life, from the boom of online social networks drawing hundreds of millions of users to highly accurate GPS systems tracking every move of the attached mobile devices The concept of Big Data has never attracted more attention from the research community as its importance grows increasingly palpable each day Yet, with all the wonders it could make happen, Big Data at the same time poses serious research challenges for mining and analysis tasks My central research theme has therefore been focused on --- Big Data Mining and Analytics The challenge of Big Data, in my understanding, can be best characterized by 4 V s, which are Volume, Velocity, Variety and Value as shown in Figure 1 These 4 V s also serve as a good map for my current and near-future research, which I shall present one by one in the following The settings have been centered on network and social media data as social networks have been the main data source for my research for the past few years However, all the results apply as well to other data settings of similar nature Variety Volume Big Data Velocity Value The Four Dimensions of the Big Data Challenge: (1) Volume --- taming data of societal-scale Figure 1 The most noticeable feature of the big data is its sheer volume, which is often of societal scale Mining and analysis on such data becomes extremely difficult even for simple tasks like frequent pattern discovery My research along this dimension has been focused on a fundamental problem in data mining which is the constrained frequent pattern mining problem, particularly on graph/network data which is the main data representation for social networks and also the most challenging setting compared with item-sets and sequences Frequent patterns have proved extremely powerful in a wide range of network analysis tasks including network clustering, classification, community detection and evolution To add to the complexity, the mining task often comes with user-specified constraints on the pattern result My research in this dimension can be further grouped into the following three topics 11 Constrained Frequent Pattern Mining For Large Graph/Networks To use frequent patterns for various knowledge discovery tasks, one must first be able to find the set of frequent patterns from the given data My research on constrained frequent pattern mining starts with my two Best Student Paper Awards [ICDE 07][PAKDD 07] during my PhD study in which I proposed a novel randomized mining framework to find the colossal frequent patterns in transaction data and a comprehensive constraint-pushing mining framework for graph data It is well-known that frequent pattern mining in graph setting is notoriously hard, especially in face of today s network scale Most work on graph mining has been largely focused on graph transaction setting where the input data is a large collection of small graphs However, 1

all the social network applications today present us with large single graphs It has been shown that frequent pattern mining in single network setting is a much more challenging problem than its counter-part in the transaction setting due to the existence of overlapping embeddings and accordingly much trickier support computation My VLDB 2011 paper on Mining top-k large structural patterns in massive networks [VLDB 11] proposed the first work that is able to find large patterns in massive graph data We developed a novel concept called r-spider and a corresponding algorithm called SpiderMine to use small frequent patterns in spider-shape to find top-k large patterns probabilistically within any user-specified error bound This work provides users for the first time the capacity to reach and study the largest frequent patterns in big graph data within reasonable amount of time With the boom of mobile social data and research on information diffusion, another kind of constrained pattern --- the skinny patterns, which are graph patterns with a long backbone from which short twigs branch out, have found important applications for the descriptive power of its long backbone to represent spatial and temporal trajectories in heterogeneous information networks, and of the short twigs the various kinds of associated information My work in [SIGMOD 13] proposed a whole new direct mining paradigm for efficient constrained frequent graph mining such that frequent patterns with certain structural constraints can be generated directly with minimum redundancy, something impossible with traditional mining methodology in which patterns are grown in the order of increasing sizes The research agenda in this direction is to systematically explore and tackle the challenges posed by the constrained pattern mining problem for large networks as those ubiquitous in our daily life I have a coming book chapter on Mining Constrained Graph Patterns to be published by Springer later this year which will be a good summary of my work along this direction 12 Collaborative Pattern Mining In Distributed Environment Due to the remarkable size of network data, many of these networks are not stored in a centralized fashion Different parts of the network could be stored in different data centers around the world, or in a machine farm All existing mining algorithms have assumed a centralized storage of the entire graph and are therefore powerless in such a distributed environment Besides, one way to handle huge single network could be to first partition the data carefully and then mine them collaboratively Under this new setting, even the most classic problems in graph mining become fresh and interestingly challenging This is a whole new direction with few research work published There are many foundation work to be laid out and directions to be chartered My research agenda is to develop efficient algorithms for those fundamental mining problems in this setting and make it work on the societal-scale social network data we have here 13 Sampling and Summarization For Large Networks The size of today s social network has made it even impossible to visually comprehend as a whole by human examination Certain summarization of the original network becomes necessary for visualization of mining results or navigation in the network On the other hand, sampling of the entire network is also essential as it is often unrealistic to obtain the whole network My research agenda here is to examine the principles and algorithms of effective and efficient sampling methods to facilitate our data acquisition and find intuitive, informative and interesting ways to summarize large network data such as our Twitter data set 2

(2) Velocity --- conducting real-time analysis in huge-volume data flow Perhaps the most important and unique feature of social media compared against all the traditional news media is the real-time responsiveness of the data For example, it has been observed that, in life-critical disasters of societal scale, Twitter is the most important and timely source from which people find out and track the breaking news before any mainstream media picks up on them and rebroadcast the footage Consequently, it is essential that we are able to conduct mining and analysis in the huge-volume data flow in a real-time fashion One important topic in social media study is the bursty topics which capture social events attracting population-wise attention Our work in [ACL 12] proposed the first algorithm to find such topics from Twitter in an offline fashion To achieve the real-time responsiveness, our work published at KDD 13 proposed a novel mining framework called TopicSketch which is able to detect bursty topics earlier than traditional news media and can potentially handle hundreds of millions tweets per day which is close to the total number of daily tweets in Twitter One example of bursty topics detected from our data is illustrated in the following figure To our best knowledge, this is the first work that achieves real-time detection on social media of such scale as Twitter The future work includes incorporating community-awareness and information diffusion structure into the detection algorithm such that bursty events of different kinds can be distinguished and their potential virality can be predicted Other real-time mining and analysis such as frequent patterns and outlier detection would also be studied as part of the research in this dimension to handle the velocity of big data (3) Variety --- understanding data of high heterogeneity The challenge of big data also comes from the fact that the data is usually highly heterogeneous, ie, they are of different formats, types and come from different sources For example, even for the same user, we have text data from his tweets and reviews, multimedia data such as images from his Instagram account and videos from Youtube, trajectory and location data from his mobile devices and so on The analytical capacity to integrate, understand and leverage these highly heterogeneous data is immensely important The key is to find a connecting ingredient or a unifying model to achieve effective integration My approach in this dimension so far is to use what I deem the most characterizing feature of social media data --- user behavior --- as the gluing element to tie things together Our tutorial in DASFAA 13 titled Behavior Driven Social Network Mining and Analysis gives a selected summary of our recent research work along this line In particular we pushed the user behavior element into the following three mining tasks and produced interesting results which are otherwise unobtainable 3

(1) Behavior-driven Topic Modeling We proposed in [SDM 13] a B-LDA model to incorporate user behavior into the LDA topic modeling to better capture the user interactions which are critically important for topic analysis, user clustering and followee recommendation on social micro-blogging services such as Twitter (2) Behavior-driven Anomaly Detection We used group-level user behavior to characterize anomaly collections and identified spammer groups that are hard to catch with traditional point anomaly framework [SDM 12, CIKM 12] We also used collective user rating behavior to model anomalous users and products in online review settings and proposed a unifying framework based on mutual dependency principles [ICDM 12] Extensions of these pieces of work have been submitted to DMKD and TKDE (3) Behavior-driven Relationship Mining We studied the user follow links in Twitter network and developed a novel algorithm which, based on this piece of information alone, is able to identify with high accuracy those offline real-life friends of the target user [WebSci 12] This work has profound potential impact as we will further elaborate in the next part We also studied user follow linkage to dynamically propagate user attribute/relationship labels with user input [DASFAA 13] In another work published at [SocInfo 13], we re-visited the user ranking problem on social network and examined the problem from the user interaction perspective We provided a new angle to the problem based on the interplay between information and interaction (4) Value --- translating data analytical results into real-world impact This dimension of the Big Data challenge has not been well explored as yet In online social media setting, the central question to ask is --- How would all the analytical results about the online social data impact our offline real life? For example, all the research findings on social influence would remain inconsequential if we are not able to establish the linkage between the online and offline world My research agenda here is to fill this gap and establish the connection As the first effort toward this Holy Grail, we proposed [WebSci 12] a novel algorithm to distinguish a user s online and offline friends from her Twitter follow network, as illustrated in the right figure This work provides foundation for many exciting applications and future works including robust user modeling, business competitive analysis, user profile matching, spammer detection, etc Based on this work, our next work [DASFAA 13] is to propagate dynamically user attribute labels in the relationship network The corresponding demo system has won the Best Demo Award (Runner- Up) at DASFAA 13 A fundamental task in bridging the online and offline world is to integrate various aspects of information about the same user across different platforms The problem has profound impact to user modeling and business intelligence and has begun to attract a huge amount of research interest from the community We provide the first solution to use the whole range of user data and the result will be published in SIGMOD 14 4

Conclusion My research agenda in the past few years and in the near future will be focused on the Big Data challenge along, in particular, the four dimensions of Volume, Velocity, Variety and Value and with an emphasis on graph/network data Besides this main theme, I have also been working on other data mining applications including program parameter tuning [CoCoMile'12, LION'13], churn prediction [ASONAM'12], game strategy mining [CIG' 12] and network experimentation [ICWSM 13] References 1 "A Direct Mining Approach To Efficient Constrained Graph Pattern Discovery", by Feida ZHU, Zequn ZHANG, Qiang QU, 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD'13), New York, USA, June, 2013 2"Reviving Dormant Ties in an Online Social Network Experiment", by Ee-Peng LIM, Denzil CORRERA, David LO, Michael FINEGOLD, Feida ZHU, The 7th International AAAI Conference on Weblogs and Social Media (ICWSM'13), Boston, USA, July, 2013 3 "It Is Not Just What We Say, But How We Say Them: LDA-based Behavior-Topic Model", by Minghui QIU, Feida ZHU, and Jing JIANG, 05/2013, 2013 SIAM International Conference on Data Mining (SDM'13), Austin, Texas, USA, May, 2013 4 "TwiCube: A Real-time Twitter Online Community Analysis Tool", by Juan DU, Wei XIE, Cheng LI, Feida ZHU, and Ee Peng LIM, 04/2013, The 18th International Conference on Database Systems for Advanced Applications (DASFAA'13), Wuhan, China, April, 2013 5 "Dynamic Label Propagation in Social Networks", by Juan DU, Feida ZHU, and Ee Peng LIM, 04/2013, The 18th International Conference on Database Systems for Advanced Applications (DASFAA'13), Wuhan, China, April, 2013 6 "Automated Parameter Tuning Framework for Heterogeneous and Large Instances: Case study in Quadratic Assignment Problem", by LINDAWATI, Zhi YUAN, Hoong Chuin LAU, and Feida ZHU, 01/2013, Learning and Intelligent OptimizatioN Conference (LION 13), Catania, Italy 7 "A Survey of Recommender Systems in Twitter", by Su Mon KYWE, Ee Peng LIM, and Feida ZHU, 12/2012, International Conference on Social Informatics (SocInfo 12), Lausanne, Switzerland 8 "On Recommending Hashtags in Twitter Networks", by Su Mon KYWE, Tuan Anh HOANG, Ee Peng LIM, and Feida ZHU, 12/2012, International Conference on Social Informatics (SocInfo 12), Lausanne, Switzerland 9 "Detecting Anomalies in Bipartite Graphs with Mutual Dependency Principles", by Hanbo DAI, Feida ZHU, Ee Peng LIM, and Hwee Hwa PANG, 12/2012, The 12th IEEE International Conference on Data Mining (ICDM'12), Brussels, Belgium 10 "Impact of Multimedia in Sina Weibo: Popularity and Life Span", by Xun ZHAO, Feida ZHU, Weining QIAN, and Aoying ZHOU, 11/2012, The Joint Conference of the Sixth Chinese Semantic Web Symposium and the First Chinese Web Science Conference (CSWS & CWSC '12), Shenzheng, China 11 "Mining Coherent Anomaly Collections On Web Data", by Hanbo DAI, Feida ZHU, Ee Peng LIM, and Hwee Hwa PANG, 10/2012, the 21st Int Conf on Information and Knowledge Management (CIKM'12), Hawaii, USA 5

12 "In-Game Action List Segmentation and Labeling in Real-Time Strategy Games", by Wei GONG, Ee Peng LIM, Feida ZHU, Achananuparp PALAKORN, David LO, and Chong Tat Freddy CHUA, 09/2012, the 8th IEEE Conference on Computational Intelligence and Games (CIG' 12), Granada, Spain 13 "Follow Link Seeking Strategy: A Pattern Based Approach", by Agus Trisnajaya KWEE, Ee Peng LIM, Achananuparp PALAKORN, and Feida ZHU, 08/2012, the 6th ACM workshop on Social Network Mining and Analysis (SNAKDD' 12), Beijing, China 14 "Collective Churn Prediction in Social Network", by Jayadi Oentaryo RICHARD, Ee Peng LIM, David LO, Feida ZHU, and Philips Kokoh PRASETYO, 08/2012, Proc of the 4th Int Conf on Advances in Social Networks Analysis and Mining (ASONAM'12), Istanbul, Turkey 15 "Instance-specific Parameter Tuning via Constraint-based Clustering", by Lindawati LINDAWATI, Hoong Chuin LAU, and Feida ZHU, 08/2012, Proc of the 1st Int Workshop on Combining COnstraint solving with MIning and LEarning(CoCoMile' 12) joint with ECAI 2012, Montpellier, France 16 "Finding Bursty Topics From Microblogs", by Qiming DIAO, Jing JIANG, Feida ZHU, and Ee Peng LIM, 07/2012, 536-544, 50th Annual Meeting of the Association for Computational Linguistics (ACL 12), Jeju Island, Korea 17 "Detecting Anomalous Twitter Users by Extreme Group Behaviors", by Hanbo DAI, Ee Peng LIM, Feida ZHU, and Hwee Hwa PANG, 07/2012, Proc of the 2012 ACM Int Conf on Net Science (NetSci' 12), Chicago, Illinois, USA 18 "Detecting Extreme Rank Anomalous Collections", by Hanbo DAI, Feida ZHU, Ee Peng LIM, and Hwee Hwa PANG, 04/2012, SIAM International Conference on Data Mining (SDM 12), Anaheim, California, USA 19 "When a Friend in Twitter is a Friend in Life", by Wei XIE, Cheng LI, Feida ZHU, Ee Peng LIM, and Xueqing GONG, 04/2012, the 4th ACM Int Conf on Web Science (WebSci' 12), Chicago, Iillinois, USA 20 Mining Top-K Large Structural Patterns In Massive Networks, by Feida Zhu, Qiang Qu, David Lo, Xifeng Yan, Jiawei Han and Philip Yu, in Proc 2011 Int Conf on Very Large Data Base (VLDB 11), USA, August, 2011 21 "Mining Diversity On Networks", by Liu Lu, Feida Zhu, Chen Chen, Xifeng Yan, Jiawei Han, Philip S Yu, and Shiqiang Yang, in Proc 2010 Int Conf on Database Systems for Advanced Applications (DASFAA'10), Japan, April, 2010 22 "Efficient Topological OLAP on Information Networks", by Qiang Qu, Feida Zhu, Xifeng Yan, Jiawei Han, Philip Yu and Hongyan Li, in Proc 2011 Int Conf on Database Systems for Advanced Applications (DASFAA'11), Hong Kong, April, 2011 23 "Top-K Aggregation Queries over Large Networks", by Xifeng Yan, Bin He, Feida Zhu, and Jiawei Han, in Proc 2010 International Conference on Data Engineering (ICDE '10), USA, March 2010 24 gprune: A Constraint Pushing Framework for Graph Pattern Mining, by Feida Zhu, Xifeng Yan, Jiawei Han, and Philip S Yu, Proc of the 11th Pacific-Asia Conf on Knowledge Discovery and Data Mining (PAKDD'07), Nanjing, China, May 2007 6

25 Mining Colossal Frequent Patterns by Core Pattern Fusion, by Feida Zhu, Xifeng Yan, Jiawei Han, Philip S Yu, and Hong Cheng, Proc of the 23th Int Conf on Data Engineering (ICDE'07), Istanbul, Turkey, April 2007 7