Research Summary. Tao Li. March 31, Matrix-based Algorithms for Data Mining 2. 2 General Data Mining Methods 2

Transcription

1 Research Summary Tao Li March 31, 2013 Research Overview I have been pursuing and will continue to pursue a competitive research agenda. My research explores two related topics on mining from data how to efficiently discover useful patterns and how to effectively retrieve information. The interests lie broadly in data mining and information retrieval studying both the algorithmic and application issues. I focus strongly on research challenges grounded in real-world problems, and work to validate my research in this context. Major Research Achievements Contents 1 Matrix-based Algorithms for Data Mining 2 2 General Data Mining Methods 2 3 System Log and Event Mining 2 4 Text Data Mining 3 5 Music Data Mining 3 6 Malware Detection 4 7 Applications Disaster Management Recommendation Systems Database Exploration and Navigation Business Intelligence Bioinformatics

2 1 Matrix-based Algorithms for Data Mining Matrix-based methodologies are rapidly becoming a significant part of data mining. I have been pursuing a strong research agenda in matrix-based data mining algorithms and applications and in the fore-front of this arena. My accomplishments are summarized below. I presented a general framework for clustering based on matrix computation and provided characterizations of different clustering methods within the framework. My recent studies demonstrate the clustering capability of Non-negative matrix Factorization (NMF) and establish the theoretical foundation for NMF to solve unsupervised learning problems: NMF with the sum of squared error cost function is equivalent to a relaxed K-means clustering, the most widely used unsupervised learning algorithm; NMF with the I-divergence cost function is equivalent to probabilistic latent semantic indexing, another unsupervised learning method popularly used in text analysis. I provided several important variants of NMF including Tri-factor NMF, Semi-NMF, and Convex NMF for enhancing clustering capability and improving clustering interpretation and developed a series of computational algorithms for NMF-type factorization with correctness and convergence analysis. I also extended NMF for solving many other data mining problems including, semi-supervised clustering, consensus clustering, dimensionality reduction, and clique finding. Many of these research works are performed in collaboration with Dr. Chris Ding (University of Texas at Arlington) and Dr. Michael I. Jordan (University of California at Berkeley). 2 General Data Mining Methods I have also developed some novel techniques for performing general data mining tasks under different scenarios. These techniques include 1) adaptive dimension reduction methods by combining linear discriminant analysis (LDA) and K-means clustering into a coherent framework to adaptively select the most discriminative subspace; 2) novel clustering and semi-supervised learning methods from multiple information sources; 3) wavelet-based methods for general data mining and data pre-processing; 4) a general framework for combining hierarchical clustering and partitional clustering and algorithms for performing semi-supervised hierarchical clustering; and 5) a general framework for generating multiple clustering views by combining meta-clustering with consensus clustering. 3 System Log and Event Mining Many systems, from computing systems, physical systems, business systems, to social systems, are only observable indirectly from the events they emit. Events are naturally temporal and are stored as logs, e.g., computer system logs, HTTP requests, database queries, network traffic data, etc. These events capture system states and activities over time. To understand the system behaviors and dynamics, one has to mine event logs to uncover patterns from events. In collaboration with IBM research and Xerox Research, I have developed an integrated framework along with a series of algorithms and tools on mining log data for computing system and service management and is 2

3 one of the leaders in this direction. Specifically, I have conducted research in the following themes: (i) design and develop log organization methods that can transform log data in disparate formats and contents into a canonical form and extract system events; (ii) design and develop methods including temporal dependency pattern mining and event summarization for data-driven pattern discovery and problem determination; and (iii) design and develop tools and methods to bridge the gap between the applications and intelligent algorithms. Our recent work on optimizing monitoring situations based on the ticket resolutions to reduce the number of non-actionable tickets has been incorporated into the IBM Tivoli Monitoring System products. 4 Text Data Mining The explosive growth of the volume and complexity of textual data (e.g., news, s, blogs, web pages, twitter streams) causes the data overload problem. It has become a necessity to semantically understand documents and deliver meaningful information to users. My research goal in text mining is to help users better understand and utilize large real document data sets via document clustering, categorization, and summarization. More specifically, there are four closely related dimensions of this research theme: 1) Summarization: Given a collection of documents, how to generate a concise yet comprehensive summary to present the information organized around some key aspects or topics? 2) Categorization and Clustering: Given a collection of documents, how to organize them into a list of meaningful categories? 3) Evolution: Given a collection of documents from diverse sources or documents reporting on temporal events, how to characterize the difference or the evolution? 4) Applications/Systems: How to build useful applications and systems? My students and I have developed effective data mining algorithms for 1) clustering high-dimensional documents by utilizing the dual relationship in word-document matrix; 2) hierarchically categorizing documents with a large number of classes; 3) summarizing documents in multiple scenarios to reflect the major or topic-relevant information contained in the document collection; 4) integrating document clustering and summarization to obtain meaningful document clusters with summarized interpretation; and 5) summarizing the difference and evolution of different document sources. We have also developed two software systems: a) ihelp: an intelligent online helpdesk system to automatically find problem-solution patterns from past customer-representative interactions; and b) Sumview: a web-based review summarization system to effectively and efficiently extract the most representative sentences in the reviews on various product features. 5 Music Data Mining The rapid growth of the Internet and the advancements of the Internet technologies have made it possible for users to have access to large amounts of on-line music data. The multifaceted nature of music information provides a wealth of opportunities for mining useful information and utilizing it to create novel ways of interaction with large music collections. In collaboration with Dr. Mitsunori Ogihara (University of Miami), I have been developing advanced data mining techniques in the context of music processing. We have been awarded a patent on using wavelet histograms 3

4 from music feature representation and are one of the leading groups in content-based music genre classification. We are one of the first to develop data mining methods for computational music emotion detection and many papers have followed our seminal work. We have also developed novel data mining methods that integrate features from different sources (e.g., acoustic signals, lyrics, meta-data and user tags) for music information retrieval. 6 Malware Detection Due to its damage to computer security, malware (such as virus and Trojan Horses) has caught the attention of computer security researchers for decades. The malware s trend towards stealth has motivated much research in intelligent malware analysis, where data mining techniques are used to deal with obfuscation. Having collaborated with the anti-virus laboratory of KingSoft Corporation ( and Comodo Corporation ( // for a long time, we have developed 1) a series of data mining techniques for building automatic malware detection systems; 2) an ensemble classification framework to combine heterogeneous base-level classifiers; 3) a principled cluster ensemble framework for combining individual clustering solutions for malware categorization; and 4) a semiparametric classifier model to combine file content and file relations together for malware detection. Our research has had important theoretical and practical impacts and many related systems have been incorporated into popular industrial products. For example, our cluster ensemble based malware detection scheme has been used in the Comodo file verdict service (see: http: //valkyrie.comodo.com). 7 Applications One of the important characteristics of data mining research is the combination of theory and applications. I focus strongly on research challenges grounded in real-world problems, and work to validate my research in this context. I am always looking for novel applications where data mining methods can help. 7.1 Disaster Management Over the last six years, in collaboration with Dr. Shu-ching Chen and Mr. Steve Luis, I have been developing data-driven techniques for disaster management. We have been working with government and industry partners to build community-driven services and tools for information sharing and exchange to operate during and after a hurricane recovery period. Our developed systems create collaborative solutions using advanced data mining and information retrieval techniques to help impacted communities better understand the current disaster situation and how the community is recovering. The systems are also able to facilitate professional organizations like Chambers of Commerce to assist their members, and help government agencies to assess damage and prioritize recovery needs. Our work has been recognized by FEMA (Federal Emergency 4

5 Management Agency) Private Sector Office as a model in assistance of Public-Private Partnerships (see: under Miami-Dade County). 7.2 Recommendation Systems Recommendation systems have gained increasing attention in recent years. My students and I have developed effective techniques for personalized news recommendation and for reciprocal recommendation. Different from traditional recommendation systems, news recommendation systems need effective news representation and processing and reciprocal recommender systems involve two different parties and focus on the preferences of both parties simultaneously. Our developed recommendation techniques have been deployed in Xiamen Rencai Network (XMRC) for Xiamen Talent Service Center (see: Database Exploration and Navigation My research on database exploration and navigation aims to use data mining and information retrieval techniques to help user quickly find useful information from the large volume of data stored in many different databases. In collaboration with Dr. Zhiyuan Chen (University of Maryland Baltimore County), I have developed several approaches to organize and rank SQL query results and to dynamically generate query forms. The novelty of our approaches is: (1) they take into account the diversity of user preferences, and (2) they use probabilistic models and are robust to poor-quality data. 7.4 Business Intelligence The leading indicators within Business Intelligence (BI) systems are one type of Key Performance Indicators (KPIs) that present key drivers of business value and offer the organization the unique opportunity to positively effect. My students and I have developed a semi-automatic system for analyzing operational metrics, factoring out the key performance indicators (KPIs), and then further discovering leading indicators by developing data mining techniques incorporated with the domain knowledge. We have also developed a system for collecting and analyzing the customer feedbacks or requirements (e.g., voice of customers) with the purpose of extracting useful knowledge nuggets and turning them into desired features in the products, solutions, or services. 7.5 Bioinformatics I have been developing computational tools for microarray data analysis and protein-protein interaction inference. I have performed a comprehensive comparative study on feature selection methods as well as state-of-the-art classification methods on various multi-class gene expression datasets and proposed uncorrelated discriminant analysis for multi-class expression data classification. I have also developed several gene selection algorithms to identify marker genes from microarray data with different characteristics and studied the problem of gene functional classification from multiple information sources. 5