FOUNDATIONAL SYSTEMS BRIDGING DATA MINING AND VISUAL ANALYTICS

JAEGUL CHOO RESEARCH STATEMENT My primary research goal is to develop new methods and systems that firmly unify data mining and visual analytics for solving challenging problems in big data. Data mining has long been proposing scalable methods for big data. However, real-world data may not necessarily follow the assumptions and conditions required by these methods. Furthermore, given data, users often have little or no idea as to what problems to solve, making existing methods less useful. Visual analytics, a newly emerging discipline, can handle these situations by allowing users to explore and understand data via interactive visualization. However, visual analytics cannot easily accommodate big data due to the limited scalability in terms of human perception and computer screen space. An ideal solution is to combine these two complementary disciplines. Data mining methods can solve the scalability issue in visual analytics by summarizing large-scale complex data and extract intelligent information beyond raw data. Visual analytics can provide users with intuitive visual access to data mining outputs as well as interactive control over data mining methods for users intended tasks. In fact, the two areas have had little amalgamation so far. Based on my research across both, I think the main hurdles lie in (1) difficulties in understanding and interacting with data mining methods and their outputs and (2) significant computational time required by the methods. My research intends to remedy these issues via the following interrelated threads: (1) a foundational visual analytics system providing an easy access to a wide variety of data mining methods, (2) novel methodologies achieving flexible interactivity and real-time response of data mining methods, and (3) scalable visual analytics systems targeting real-world domains. Below I describe specific projects in each thread. FOUNDATIONAL SYSTEMS BRIDGING DATA MINING AND VISUAL ANALYTICS Big data, e.g., text documents, images, and biological data, are often represented in a high-dimensional space. In visual analytics for large-scale high-dimensional data, dimension reduction and clustering are key techniques in that the former visualizes high-dimensional data in a 2D/3D space while the latter reduces numerous data items to a small number of groups. Recent advancement in these methods from data mining and machine learning communities has not been fully transferred to many real-world applications. The Testbed system [1] is a foundational visual analytics system to fill this gap. It integrates more than 20 dimension reduction methods, including the two-stage methods I developed [2], and about 10 clustering methods, allowing users to effortlessly apply different methods to their own data and perform analysis with the most suitable methods. In order to facilitate intuitive comparisons, the system also offers aligning capabilities between outputs from different methods based on manifold alignment techniques [3]. The impact of the Testbed system is two-fold. First, it works as a base for experimenting and improving new dimension reduction and clustering methods in a visual analytic environment. Because of the flexible software Fig 1. The Testbed system providing a visual overview architecture of the system, one can seamlessly integrate and and data details on demand. evaluate new methods [4]. Second, it can be applied to a wide range of applications and provide deep insight

about data. For instance, I applied the system to a novel domain of protein disorder prediction [5], where the obtained knowledge via interactive visualization significantly improved the prediction performance over stateof-the-art methods. The system is currently being applied to many other domains such as healthcare and computer network in collaboration with Samsung Electronics and Prof. Nick Feamster at Georgia Tech. DATA MINING METHODS SUPPORTING FLEXIBLE AND REAL-TIME INTERACTIONS Significant noise in real-world data often causes data mining methods to generate unsatisfactory results. Being able to interact with the methods and the data is critical in steering the results in users own way to obtain the most meaningful output. However, most methods are not designed for incorporating the various needs of users. In addition, interaction with the methods may be inefficient since it is slow to compute them repetitively. Thus, I developed novel data mining methods and their integration framework in visual analytics for flexible and real-time interaction support. p-isomap. An essential interaction with data mining methods is to change their parameters. To make this interaction fast, I have proposed a dynamic parametric updating algorithm for a widely-used dimension reduction method, ISOMAP [4]. The proposed approach involves sophisticated algorithmic modules, such as efficient shortest-path update due to edge addition/removal, and it has achieved up to around 100x speed-up compared to the original ISOMAP. PIVE. I also developed a fundamental methodology called PIVE [6, 7], a Per-Iteration Visualization Environment, which enables continuous real-time interactions with data mining methods. PIVE exploits the fact that many modern data mining algorithms run iteratively until convergence and major changes in the solution occur mostly during an early stage of iterations. Motivated by this idea, PIVE visualizes intermediate results from algorithm iterations in real time, during which users can efficiently interact with the method without having to wait until its convergence. PIVE has great impact in that it changes a paradigm of interacting with data mining methods since in the past such continuous interactions in real time have been considered impractical due to the methods running too slow. To demonstrate the advantage with actual methods, we recently developed user interaction capabilities such as re-position of data items and cluster splitting/merging in t-distributed stochastic neighborhood embedding, k-means, and latent Dirichlet allocation [7]. Weakly Supervised NMF. Nonnegative matrix factorization (NMF) is a popular method in data mining tasks including clustering, collaborative filtering, outlier detection, etc. Weakly-supervised NMF (WS-NMF) [8] is a novel method that supports various user interactions in the context of clustering and topic modeling. Unlike other semi-supervised methods, the underlying philosophy in WS-NMF is to reflect semantically meaningful user feedback from users viewpoints instead of requiring method-centric constraints. We demonstrated the capabilities of WS-NMF, such as incorporating information from other sources, exemplar data items, and features of interest. This work is currently under review in a DMKD journal, and it has also led us to an interactive topic modeling system called UTOPIAN [9]. REAL-WORLD VISUAL ANALYTICS SYSTEMS Based on the above-mentioned foundational research, I have built mature visual analytics systems in diverse real-world applications. First, I have focused on two representative machine learning tasks: classification and clustering. These tasks are usually performed in a fully automated manner, but in practice, many algorithms do not properly handle noisy real-world data. ivisclassifier [10] and ivisclustering [11] are the systems that leverage human-in-the-loop processes in classification (e.g., facial recognition) and clustering (e.g., document

clustering), respectively. ivisclassifier, which uses regularized linear discriminant analysis to visualize data with class information, allows users to visually analyze the relationships between classes and interactively improve classifier performance. ivisclustering, by enhancing latent Dirichlet allocation (LDA), a popular document topic modeling method, supports various important interactions such as cluster keyword refinement and hierarchical cluster management. More recently, I have proposed a system called UTOPIAN (User-driven Topic modeling based on Interactive NMF) [9]. In general, it is burdensome, given a large-scale document corpus, to go through individual documents to make sense of them and find out those of users interest. Topic modeling is useful in this context, but derived topics are often unclear for real-world data. As a way to tackle this fundamental problem, UTOPIAN provides useful interaction capabilities in topic modeling, such as topic merging/splitting and topic creation via seed Fig 2. The UTOPIAN system visualizing a topic summary with various interaction capabilities. keywords/documents. This work also highlights the important advantages of NMF over LDA in terms of algorithmic consistency against noisy document data. Furthermore, the interactions offered by UTOPIAN are performed efficiently owing to the PIVE framework incorporated. Since UTOPIAN has been published in VAST 13/TVCG [9], the novel idea of bringing NMF in the visual analytics context has received enormous interest from many researchers, which has opened up collaboration opportunities with the research groups of Prof. Daniel Keim at University of Konstanz, Prof. Niklaus Elmqvist at Purdue University, and other researchers. RESEARCH AGENDA My long-term goal is to develop methods and systems that take the best advantage of both data mining and visual analytics for big data leveraging computational methods to sift through huge data to reveal underlying insight and enabling humans to exploit their visual perception and intuition to delve into data. Although I have taken the first steps toward this goal with my previous research, I plan to broaden and deepen this investigation, including both fundamental re-design of computational methods and application of visual analytics to unexplored domains. In the following, I describe a few of my research directions. Scaling up Visual Analytics. My future research will proactively scale up visual analytics. The scalability issues arise from the two perspectives: back-end computation and front-end interactive visualization. For the former, data mining methods have to scale up for large-scale data. On-going efforts include parallel distributed NMF algorithms that I currently work on as a co-pi of the DARPA XDATA project. For the latter, visual analytics systems should support fast interactive visualization of numerous data items. For example, an interactive visual document recommender system, VisIRR [12], which I am currently developing, handles about half a million documents. I plan to further explore various research problems in scalable visual analytics. Revolutionizing Computing Paradigms in Visual Analytics. Considering data mining methods are not originally designed for visual analytics, exploiting inherent characteristics of visual analytics could significantly decrease computational time. My future research will highly harness the fact that the human perception and screen space do not require fully accurate results from computations. I envision a completely new paradigm that allows computational methods to immediately generate approximate solutions and incrementally refine them until

users are satisfied. To realize this idea, I am looking into literature from other fields, e.g., adaptive mesh refinement in numerical analysis and wavelet transformation in image processing. I have recently published some of the promising results [6], and I will continue this investigation in my future research. Visualizing the Quality of Computational Output. When humans face computational outputs, it is crucial to inform them of the output quality. For instance, in dimension reduction, the output quality corresponds to how well given relationships are preserved in a low-dimensional space. In clustering, it would be how clear and meaningful the resulting clusters are. This notion of output quality can be further applied at different levels of an individual data item, a cluster, and a data set. The current practice of plugging data mining methods into visual analytics does not effectively reveal such information. However, a poor quality of an output could significantly mislead subsequent analyses. A 2D snapshot of high-dimensional data severely distorting their original relationships would not be helpful towards understanding data. Clusters computed from data with no real clusters, e.g., uniformly distributed data, do not convey any meaningful information. My research will focus on how to visualize this quality information along with the output to properly guide humans analyses. Building Visual Analytics for Data Comparison and Contrast. At the heart of analysis tasks is to compare and contrast between different data groups for acquiring comprehensive knowledge. I plan to develop fundamental data mining methods and visual analytics systems to support these analyses. One method I am currently working on is joint-discriminative topic modeling using NMF, which simultaneously identifies both common and distinct topics among multiple document data sets. Equipping it with a highly interactive visual environment, where users can dynamically create and compare between multiple data groups, will be a promising research direction. I, together with Prof. Haesun Park (Georgia Tech) and Prof. Chandan Reddy (Wayne State University), am preparing to submit an NSF proposal based on this idea on January 2014. Broadening Real-world Impact. I will continuously widen the real-world applicability of my research. I plan to carry this out by (1) pioneering novel domains and (2) developing web-based systems. For example, I have recently analyzed novel social media data about nonprofit micro-financing activities available at Kiva.org. This work, the papers about which were accepted in WSDM 14 [13] and WWW 14 [14], is one of the very first studies that applied machine learning techniques in this domain. I plan to perform deeper analysis on this application using visual analytics approaches as well. On the other hand, I am currently extending my visual analytics systems to web-based systems. Collaborating with Georgia Tech Research Institute, a web-based version of the Testbed system is under active development. Additionally, I am collaborating with Prof. Ji Soo Yi (Purdue University) and Dr. Bum Chul Kwon (University of Konstanz) in the project of building a website (http://www.hivelab.org/caniask.net) where users can interactively label positive and negative aspects with rich contents when writing reviews or answers. In this project, we also plan to integrate interactive topic modeling capabilities of UTOPIAN for the visual summary of reviews/answers. Cross-disciplinary research between data mining and visual analytics has given me deep interest and motivation, and I still see its tremendous potentials for big data. I have collaborated with more than 40 researchers and engineers in universities, national labs, and companies, who have constantly inspired me with new ideas and directions. I am also involved with various research funding proposals for NSF, DARPA, NIH, ONR, and industry. For example, we recently received $2.7 million award from the DARPA XDATA program for big data. In conclusion, my research seeks to find new methods and systems synthesizing data mining and visual analytics to accomplish interactive in-depth analysis of big data. I hope my unique experiences and insights spanning both fields to further grow, proving the true value of such synthesis.

SELECTED REFERENCES 1. An Interactive Visual Testbed System for Dimension Reduction and Clustering of Large-Scale High-Dimensional Data. Jaegul Choo, Hanseung Lee, Zhicheng Liu, John T. Stasko, Haesun Park. SPIE Conference on Visualization and Data Analysis (VDA) 2013. Software is available at http://fodava.gatech.edu/fodava-testbed-software. 2. Two-stage Framework for Visualization of Clustered High- Dimensional Data. Jaegul Choo, Shawn Bohn, Haesun Park. IEEE Symposium on Visual Analytics Science and Technology (VAST) 2009. 3. Heterogeneous Data Fusion via Space Alignment Using Nonmetric Multidimensional Scaling. Jaegul Choo, Shawn Bohn, Grant C. Nakamura, Amanda M. White, Haesun Park. SIAM International Conference on Data Mining (SDM) 2012. 4. p-isomap: An Efficient Parametric Update for ISOMAP for Visual Analytics. Jaegul Choo, Hanseung Lee, Chandan K. Reddy, Haesun Park. SIAM International Conference on Data Mining (SDM) 2010. 5. A Visual Analytics Approach for Protein Disorder Prediction. Jaegul Choo, Fuxin Li, Kihyung Joo, Haesun Park. SIAM Expanding the Frontiers of Visual Analytics and Visualization (Book Chapter) 2011. 6. Screen Space- and Perception-Based Framework for Efficient Computational Algorithms in Large-Scale Visual Analytics. Jaegul Choo, Haesun Park. IEEE Computer Graphics and Applications (CG&A) 2013. 7. PIVE: A Per-Iteration Visualization Environment for Supporting Real-time Interactions with Computational Methods. Jaegul Choo, Changhyun Lee, Haesun Park. Technical Report, Georgia Institute of Technology, 2013. 8. Weakly Supervised Nonnegative Matrix Factorization for User-driven Clustering. Jaegul Choo, Changhyun Lee, Chandan K. Reddy, Haesun Park. Data Mining and Knowledge Discovery (DMKD) 2013, Under Review. 9. UTOPIAN: User-driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization. Jaegul Choo, Changhyun Lee, Chandan K. Reddy, Haesun Park. IEEE Transactions on Visualization and Computer Graphics (TVCG) 2013. 10. ivisclassifier: An Interactive Visual Analytics System for Classification based on Supervised Dimension Reduction. Jaegul Choo, Hanseung Lee, Jaeyeon Kihm, Haesun Park. IEEE Conference on Visual Analytics Science and Technology (VAST) 2010. 11. ivisclustering: An Interactive Visual Clustering for Documents via Topic Modeling. Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John T. Stasko, Haesun Park. Computer Graphics Forum (CGF) 2012. 12. VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-Scale Document Data. Jaegul Choo, Changhyun Lee, Edward Clarkson, Zhicheng Liu, Hanseung Lee, Duen Horng (Polo) Chau,,Fuxin Li, Ramakrishnan Kannan, Charles D. Stolper, David Inouye, Nishant Mehta,,Hua Ouyang, Subhojit Som, Alexander Gray, John T. Stasko, and Haesun Park. Computer Graphics Forum (Eurovis / CGF) 2014, Under Review. 13. Understanding and Promoting Micro-finance Activities in Kiva.org. Jaegul Choo, Changhyun Lee, Daniel Lee, Hongyuan Zha, Haesun Park. ACM Conference on Web Search and Data Mining (WSDM) 2014, Accepted. 14. To Gather Together for a Better World: Understanding and Leveraging Communities in Micro-Lending Recommendation. Jaegul Choo, Daniel Lee, Bistra Dilkina, Hongyuan Zha, Haesun Park. International Conference on World Wide Web (WWW) 2014, Accepted.