FOUNDATIONAL SYSTEMS BRIDGING DATA MINING AND VISUAL ANALYTICS



Similar documents
How To Make Visual Analytics With Big Data Visual

Information Visualization WS 2013/14 11 Visual Analytics

CHAPTER 1 INTRODUCTION

The Value of Visualization for Understanding Data and Making Decisions

The Scientific Data Mining Process

A Framework of User-Driven Data Analytics in the Cloud for Course Management

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Data Mining Yelp Data - Predicting rating stars from review text

NStreamAware: Real-Time Visual Analytics for Data Streams to Enhance Situational Awareness

A Knowledge Management Framework Using Business Intelligence Solutions

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

WebSphere Business Modeler

Understanding the Value of In-Memory in the IT Landscape

Distributed forests for MapReduce-based machine learning

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

Data Mining System, Functionalities and Applications: A Radical Review

Database Marketing, Business Intelligence and Knowledge Discovery

Clustering & Visualization

Data Mining Algorithms and Techniques Research in CRM Systems

Envisioning a Future for Public Health Knowledge Management

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Data Discovery, Analytics, and the Enterprise Data Hub

Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

BUSINESS INTELLIGENCE

Healthcare Measurement Analysis Using Data mining Techniques

Search Result Optimization using Annotators

A Review of Data Mining Techniques

SUSTAINING COMPETITIVE DIFFERENTIATION

Research Statement Immanuel Trummer

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Clustering Technique in Data Mining for Text Documents

CHAPTER 1 INTRODUCTION

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Reinventing Business Intelligence through Big Data

CONNECTING DATA WITH BUSINESS

Sentiment Analysis on Big Data

Sanjeev Kumar. contribute

Big Data, Physics, and the Industrial Internet! How Modeling & Analytics are Making the World Work Better."

Specific Usage of Visual Data Analysis Techniques

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Microsoft Services Exceed your business with Microsoft SharePoint Server 2010

A Survey on Product Aspect Ranking

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Introduction to Data Mining

Machine Learning with MATLAB David Willingham Application Engineer

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

The Data Mining Process

How To Write A Summary Of A Review

Cisco Data Preparation

Research of Postal Data mining system based on big data

agility made possible

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Big Data 101: Harvest Real Value & Avoid Hollow Hype

Internship Opportunities Xerox Research Centre India (XRCI), Bangalore Analytics Research Group

A Semantic Marketplace of Peers Hosting Negotiating Intelligent Agents

An Implementation of Active Data Technology

Big Data in Pictures: Data Visualization

A Framework for End-to-End Proactive Network Management

Text Analytics. A business guide

Dynamic Data in terms of Data Mining Streams

Interactive Visual Data Analysis in the Times of Big Data

Why your business decisions still rely more on gut feel than data driven insights.

Component visualization methods for large legacy software in C/C++

Visual Analytics. Daniel A. Keim, Florian Mansmann, Andreas Stoffel, Hartmut Ziegler University of Konstanz, Germany

SPATIAL DATA CLASSIFICATION AND DATA MINING

Intinno: A Web Integrated Digital Library and Learning Content Management System

SharePoint for Engineering Document Management & Control

Statistics for BIG data

Big Data: Rethinking Text Visualization

ADVANCED MACHINE LEARNING. Introduction

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

MS1b Statistical Data Mining

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Supply Chains: From Inside-Out to Outside-In

Crime Pattern Analysis

Supply Chain Platform as a Service: a Cloud Perspective on Business Collaboration

Accelerate your Big Data Strategy. Execute faster with Capgemini and Cloudera s Enterprise Data Hub Accelerator

How To Use Neural Networks In Data Mining

Application of Business Intelligence in Transportation for a Transportation Service Provider

Big Data Analytics for Healthcare

Visualization of large data sets using MDS combined with LVQ.

Transcription:

JAEGUL CHOO RESEARCH STATEMENT My primary research goal is to develop new methods and systems that firmly unify data mining and visual analytics for solving challenging problems in big data. Data mining has long been proposing scalable methods for big data. However, real-world data may not necessarily follow the assumptions and conditions required by these methods. Furthermore, given data, users often have little or no idea as to what problems to solve, making existing methods less useful. Visual analytics, a newly emerging discipline, can handle these situations by allowing users to explore and understand data via interactive visualization. However, visual analytics cannot easily accommodate big data due to the limited scalability in terms of human perception and computer screen space. An ideal solution is to combine these two complementary disciplines. Data mining methods can solve the scalability issue in visual analytics by summarizing large-scale complex data and extract intelligent information beyond raw data. Visual analytics can provide users with intuitive visual access to data mining outputs as well as interactive control over data mining methods for users intended tasks. In fact, the two areas have had little amalgamation so far. Based on my research across both, I think the main hurdles lie in (1) difficulties in understanding and interacting with data mining methods and their outputs and (2) significant computational time required by the methods. My research intends to remedy these issues via the following interrelated threads: (1) a foundational visual analytics system providing an easy access to a wide variety of data mining methods, (2) novel methodologies achieving flexible interactivity and real-time response of data mining methods, and (3) scalable visual analytics systems targeting real-world domains. Below I describe specific projects in each thread. FOUNDATIONAL SYSTEMS BRIDGING DATA MINING AND VISUAL ANALYTICS Big data, e.g., text documents, images, and biological data, are often represented in a high-dimensional space. In visual analytics for large-scale high-dimensional data, dimension reduction and clustering are key techniques in that the former visualizes high-dimensional data in a 2D/3D space while the latter reduces numerous data items to a small number of groups. Recent advancement in these methods from data mining and machine learning communities has not been fully transferred to many real-world applications. The Testbed system [1] is a foundational visual analytics system to fill this gap. It integrates more than 20 dimension reduction methods, including the two-stage methods I developed [2], and about 10 clustering methods, allowing users to effortlessly apply different methods to their own data and perform analysis with the most suitable methods. In order to facilitate intuitive comparisons, the system also offers aligning capabilities between outputs from different methods based on manifold alignment techniques [3]. The impact of the Testbed system is two-fold. First, it works as a base for experimenting and improving new dimension reduction and clustering methods in a visual analytic environment. Because of the flexible software Fig 1. The Testbed system providing a visual overview architecture of the system, one can seamlessly integrate and and data details on demand. evaluate new methods [4]. Second, it can be applied to a wide range of applications and provide deep insight

about data. For instance, I applied the system to a novel domain of protein disorder prediction [5], where the obtained knowledge via interactive visualization significantly improved the prediction performance over stateof-the-art methods. The system is currently being applied to many other domains such as healthcare and computer network in collaboration with Samsung Electronics and Prof. Nick Feamster at Georgia Tech. DATA MINING METHODS SUPPORTING FLEXIBLE AND REAL-TIME INTERACTIONS Significant noise in real-world data often causes data mining methods to generate unsatisfactory results. Being able to interact with the methods and the data is critical in steering the results in users own way to obtain the most meaningful output. However, most methods are not designed for incorporating the various needs of users. In addition, interaction with the methods may be inefficient since it is slow to compute them repetitively. Thus, I developed novel data mining methods and their integration framework in visual analytics for flexible and real-time interaction support. p-isomap. An essential interaction with data mining methods is to change their parameters. To make this interaction fast, I have proposed a dynamic parametric updating algorithm for a widely-used dimension reduction method, ISOMAP [4]. The proposed approach involves sophisticated algorithmic modules, such as efficient shortest-path update due to edge addition/removal, and it has achieved up to around 100x speed-up compared to the original ISOMAP. PIVE. I also developed a fundamental methodology called PIVE [6, 7], a Per-Iteration Visualization Environment, which enables continuous real-time interactions with data mining methods. PIVE exploits the fact that many modern data mining algorithms run iteratively until convergence and major changes in the solution occur mostly during an early stage of iterations. Motivated by this idea, PIVE visualizes intermediate results from algorithm iterations in real time, during which users can efficiently interact with the method without having to wait until its convergence. PIVE has great impact in that it changes a paradigm of interacting with data mining methods since in the past such continuous interactions in real time have been considered impractical due to the methods running too slow. To demonstrate the advantage with actual methods, we recently developed user interaction capabilities such as re-position of data items and cluster splitting/merging in t-distributed stochastic neighborhood embedding, k-means, and latent Dirichlet allocation [7]. Weakly Supervised NMF. Nonnegative matrix factorization (NMF) is a popular method in data mining tasks including clustering, collaborative filtering, outlier detection, etc. Weakly-supervised NMF (WS-NMF) [8] is a novel method that supports various user interactions in the context of clustering and topic modeling. Unlike other semi-supervised methods, the underlying philosophy in WS-NMF is to reflect semantically meaningful user feedback from users viewpoints instead of requiring method-centric constraints. We demonstrated the capabilities of WS-NMF, such as incorporating information from other sources, exemplar data items, and features of interest. This work is currently under review in a DMKD journal, and it has also led us to an interactive topic modeling system called UTOPIAN [9]. REAL-WORLD VISUAL ANALYTICS SYSTEMS Based on the above-mentioned foundational research, I have built mature visual analytics systems in diverse real-world applications. First, I have focused on two representative machine learning tasks: classification and clustering. These tasks are usually performed in a fully automated manner, but in practice, many algorithms do not properly handle noisy real-world data. ivisclassifier [10] and ivisclustering [11] are the systems that leverage human-in-the-loop processes in classification (e.g., facial recognition) and clustering (e.g., document

clustering), respectively. ivisclassifier, which uses regularized linear discriminant analysis to visualize data with class information, allows users to visually analyze the relationships between classes and interactively improve classifier performance. ivisclustering, by enhancing latent Dirichlet allocation (LDA), a popular document topic modeling method, supports various important interactions such as cluster keyword refinement and hierarchical cluster management. More recently, I have proposed a system called UTOPIAN (User-driven Topic modeling based on Interactive NMF) [9]. In general, it is burdensome, given a large-scale document corpus, to go through individual documents to make sense of them and find out those of users interest. Topic modeling is useful in this context, but derived topics are often unclear for real-world data. As a way to tackle this fundamental problem, UTOPIAN provides useful interaction capabilities in topic modeling, such as topic merging/splitting and topic creation via seed Fig 2. The UTOPIAN system visualizing a topic summary with various interaction capabilities. keywords/documents. This work also highlights the important advantages of NMF over LDA in terms of algorithmic consistency against noisy document data. Furthermore, the interactions offered by UTOPIAN are performed efficiently owing to the PIVE framework incorporated. Since UTOPIAN has been published in VAST 13/TVCG [9], the novel idea of bringing NMF in the visual analytics context has received enormous interest from many researchers, which has opened up collaboration opportunities with the research groups of Prof. Daniel Keim at University of Konstanz, Prof. Niklaus Elmqvist at Purdue University, and other researchers. RESEARCH AGENDA My long-term goal is to develop methods and systems that take the best advantage of both data mining and visual analytics for big data leveraging computational methods to sift through huge data to reveal underlying insight and enabling humans to exploit their visual perception and intuition to delve into data. Although I have taken the first steps toward this goal with my previous research, I plan to broaden and deepen this investigation, including both fundamental re-design of computational methods and application of visual analytics to unexplored domains. In the following, I describe a few of my research directions. Scaling up Visual Analytics. My future research will proactively scale up visual analytics. The scalability issues arise from the two perspectives: back-end computation and front-end interactive visualization. For the former, data mining methods have to scale up for large-scale data. On-going efforts include parallel distributed NMF algorithms that I currently work on as a co-pi of the DARPA XDATA project. For the latter, visual analytics systems should support fast interactive visualization of numerous data items. For example, an interactive visual document recommender system, VisIRR [12], which I am currently developing, handles about half a million documents. I plan to further explore various research problems in scalable visual analytics. Revolutionizing Computing Paradigms in Visual Analytics. Considering data mining methods are not originally designed for visual analytics, exploiting inherent characteristics of visual analytics could significantly decrease computational time. My future research will highly harness the fact that the human perception and screen space do not require fully accurate results from computations. I envision a completely new paradigm that allows computational methods to immediately generate approximate solutions and incrementally refine them until

users are satisfied. To realize this idea, I am looking into literature from other fields, e.g., adaptive mesh refinement in numerical analysis and wavelet transformation in image processing. I have recently published some of the promising results [6], and I will continue this investigation in my future research. Visualizing the Quality of Computational Output. When humans face computational outputs, it is crucial to inform them of the output quality. For instance, in dimension reduction, the output quality corresponds to how well given relationships are preserved in a low-dimensional space. In clustering, it would be how clear and meaningful the resulting clusters are. This notion of output quality can be further applied at different levels of an individual data item, a cluster, and a data set. The current practice of plugging data mining methods into visual analytics does not effectively reveal such information. However, a poor quality of an output could significantly mislead subsequent analyses. A 2D snapshot of high-dimensional data severely distorting their original relationships would not be helpful towards understanding data. Clusters computed from data with no real clusters, e.g., uniformly distributed data, do not convey any meaningful information. My research will focus on how to visualize this quality information along with the output to properly guide humans analyses. Building Visual Analytics for Data Comparison and Contrast. At the heart of analysis tasks is to compare and contrast between different data groups for acquiring comprehensive knowledge. I plan to develop fundamental data mining methods and visual analytics systems to support these analyses. One method I am currently working on is joint-discriminative topic modeling using NMF, which simultaneously identifies both common and distinct topics among multiple document data sets. Equipping it with a highly interactive visual environment, where users can dynamically create and compare between multiple data groups, will be a promising research direction. I, together with Prof. Haesun Park (Georgia Tech) and Prof. Chandan Reddy (Wayne State University), am preparing to submit an NSF proposal based on this idea on January 2014. Broadening Real-world Impact. I will continuously widen the real-world applicability of my research. I plan to carry this out by (1) pioneering novel domains and (2) developing web-based systems. For example, I have recently analyzed novel social media data about nonprofit micro-financing activities available at Kiva.org. This work, the papers about which were accepted in WSDM 14 [13] and WWW 14 [14], is one of the very first studies that applied machine learning techniques in this domain. I plan to perform deeper analysis on this application using visual analytics approaches as well. On the other hand, I am currently extending my visual analytics systems to web-based systems. Collaborating with Georgia Tech Research Institute, a web-based version of the Testbed system is under active development. Additionally, I am collaborating with Prof. Ji Soo Yi (Purdue University) and Dr. Bum Chul Kwon (University of Konstanz) in the project of building a website (http://www.hivelab.org/caniask.net) where users can interactively label positive and negative aspects with rich contents when writing reviews or answers. In this project, we also plan to integrate interactive topic modeling capabilities of UTOPIAN for the visual summary of reviews/answers. Cross-disciplinary research between data mining and visual analytics has given me deep interest and motivation, and I still see its tremendous potentials for big data. I have collaborated with more than 40 researchers and engineers in universities, national labs, and companies, who have constantly inspired me with new ideas and directions. I am also involved with various research funding proposals for NSF, DARPA, NIH, ONR, and industry. For example, we recently received $2.7 million award from the DARPA XDATA program for big data. In conclusion, my research seeks to find new methods and systems synthesizing data mining and visual analytics to accomplish interactive in-depth analysis of big data. I hope my unique experiences and insights spanning both fields to further grow, proving the true value of such synthesis.

SELECTED REFERENCES 1. An Interactive Visual Testbed System for Dimension Reduction and Clustering of Large-Scale High-Dimensional Data. Jaegul Choo, Hanseung Lee, Zhicheng Liu, John T. Stasko, Haesun Park. SPIE Conference on Visualization and Data Analysis (VDA) 2013. Software is available at http://fodava.gatech.edu/fodava-testbed-software. 2. Two-stage Framework for Visualization of Clustered High- Dimensional Data. Jaegul Choo, Shawn Bohn, Haesun Park. IEEE Symposium on Visual Analytics Science and Technology (VAST) 2009. 3. Heterogeneous Data Fusion via Space Alignment Using Nonmetric Multidimensional Scaling. Jaegul Choo, Shawn Bohn, Grant C. Nakamura, Amanda M. White, Haesun Park. SIAM International Conference on Data Mining (SDM) 2012. 4. p-isomap: An Efficient Parametric Update for ISOMAP for Visual Analytics. Jaegul Choo, Hanseung Lee, Chandan K. Reddy, Haesun Park. SIAM International Conference on Data Mining (SDM) 2010. 5. A Visual Analytics Approach for Protein Disorder Prediction. Jaegul Choo, Fuxin Li, Kihyung Joo, Haesun Park. SIAM Expanding the Frontiers of Visual Analytics and Visualization (Book Chapter) 2011. 6. Screen Space- and Perception-Based Framework for Efficient Computational Algorithms in Large-Scale Visual Analytics. Jaegul Choo, Haesun Park. IEEE Computer Graphics and Applications (CG&A) 2013. 7. PIVE: A Per-Iteration Visualization Environment for Supporting Real-time Interactions with Computational Methods. Jaegul Choo, Changhyun Lee, Haesun Park. Technical Report, Georgia Institute of Technology, 2013. 8. Weakly Supervised Nonnegative Matrix Factorization for User-driven Clustering. Jaegul Choo, Changhyun Lee, Chandan K. Reddy, Haesun Park. Data Mining and Knowledge Discovery (DMKD) 2013, Under Review. 9. UTOPIAN: User-driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization. Jaegul Choo, Changhyun Lee, Chandan K. Reddy, Haesun Park. IEEE Transactions on Visualization and Computer Graphics (TVCG) 2013. 10. ivisclassifier: An Interactive Visual Analytics System for Classification based on Supervised Dimension Reduction. Jaegul Choo, Hanseung Lee, Jaeyeon Kihm, Haesun Park. IEEE Conference on Visual Analytics Science and Technology (VAST) 2010. 11. ivisclustering: An Interactive Visual Clustering for Documents via Topic Modeling. Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John T. Stasko, Haesun Park. Computer Graphics Forum (CGF) 2012. 12. VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-Scale Document Data. Jaegul Choo, Changhyun Lee, Edward Clarkson, Zhicheng Liu, Hanseung Lee, Duen Horng (Polo) Chau,,Fuxin Li, Ramakrishnan Kannan, Charles D. Stolper, David Inouye, Nishant Mehta,,Hua Ouyang, Subhojit Som, Alexander Gray, John T. Stasko, and Haesun Park. Computer Graphics Forum (Eurovis / CGF) 2014, Under Review. 13. Understanding and Promoting Micro-finance Activities in Kiva.org. Jaegul Choo, Changhyun Lee, Daniel Lee, Hongyuan Zha, Haesun Park. ACM Conference on Web Search and Data Mining (WSDM) 2014, Accepted. 14. To Gather Together for a Better World: Understanding and Leveraging Communities in Micro-Lending Recommendation. Jaegul Choo, Daniel Lee, Bistra Dilkina, Hongyuan Zha, Haesun Park. International Conference on World Wide Web (WWW) 2014, Accepted.