2 Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks

Transcription

1 Volume 32 (2013), Number 3 Eurographics Conference on Visualization (EuroVis) 2013 B. Preim, P. Rheingans, and H. Theisel (Guest Editors) Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks Submission ID: 238 Figure 1: The top 26 topics in a Wordonoi for a large research organization for homeland security consisting of 2,000 nodes and 5,000 edges. The cells have been colored based on categories to allow distinguishing between different and adjacent topics. Abstract Datasets with both relationships and textual content are becoming increasingly common; examples include hypertext documents, rich social networks, and scientific authorship. We call this type of datasets knowledge networks, and present a novel and interactive visualization technique called Wordonoi to visualize them. Wordonoi visualizes both the textual and relational components of knowledge networks by spatializing them into a multi-scale 2D visualization using a Voronoi tessellation and then mapping keywords onto the different cells. Because knowledge networks are often large, we also provide aggregation mechanisms for summarization. We explore and implement several interactions like interactive coloring, semantic zooming, and searching. We also validate the technique with three examples, including a research organizational structure, a hypertext network, and NSF funding data. Categories and Subject Descriptors (according to ACM CCS): H.5.1 [Information Systems]: Multimedia Information Systems ; H.5.2 [Information Systems]: User Interfaces 1. Introduction Text is one of the most important and common types of data in the world today [Shn96], and there exists a multitude of tools (e.g., [DZG 07, vhwv09, VWF09]) for visualizing such data. However, as we go beyond simple text corpora to more complex datasets, one particular class of data emerges that combines textual labels with their relationships (e.g., as graphs). We denote such datasets knowledge networks because they exhibit a graph structure with textual data for the nodes and links. Examples include dictionaries, where each word has a definition and relationships to synonyms, antonyms, or modifiers; the web, where each webpage consists of text and hyperlinks to other pages; and research funding networks, where text describes projects and relationships capture investigators, institutions, and program officers. While several techniques exist for visualizing graph or text alone, visualizing their combination is challenging [KKEE11]. Furthermore, in such graphs sometimes the

2 2 Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks textual content is more important, and sometimes the relationship structure is more important. For example, consider crime network data consisting of crime reports and their relationship. In this case, the user may either want to focus on how the various crimes occurred (i.e., the contents of the reports), or may want to see associations between particular crimes and individuals (i.e., relationships between reports). Some tools focus only on textual contents and make little use of connections (e.g., [CVW09]), and vice versa (e.g., [vhwv09]). Only a handful of tools [KKEE11, SGL08] visualize both features simultaneously. Another problem with knowledge networks is their size. Not only are the graphs themselves generally large, but so are their textual content. Therefore, summarization [EF10] techniques are required to provide effective overview. There exist techniques [DZG 07, VWF09] that visualize large size textual data by extracting important tags and patterns from them and visualize those tags. However, these tools do not show the overall landscape or effective summary of the whole data in one screen. In our case, we need to summarize both the networks and their textual contents. In this work, drawing from previous work such as selforganizing maps (SOMs) [Koh82], WordBridge [KKEE11], and Wordle [VWF09], we propose a visualization technique that we call WORDONOI for visualizing both the relations and textual contents of knowledge networks. Our technique is a multi-scale and space-filling 2D text visualization that supports hierarchical aggregation [EF10] to allow the user to interactively explore the knowledge network. The contribution of this work is the ability to show a summary of an entire textual and relational dataset in a single screen. We have implemented a Wordonoi prototype that accepts knowledge networks as input and renders an interactive visualization. Our framework initially calculates node positions using a graph algorithm and then uses these positions to compute a Voronoi tessellation of the space. Each Voronoi cell represents a node, and the text associated with the node is shown inside the cell as tags extracted from the text. In our implementation, we explore several aspects of the Wordonoi design space, including aggregation, text visualization in Voronoi cells, coloring schemes, and interaction techniques such as semantic zooming, panning, querying, etc. We validate our technique by applying it to three examples. The first is a large research structure for relationships between persons, projects, institutions, and centers, where each node contains details about the research. The second example is a hypertext network where the relationships are links and the text is the document contents. In the last example, we study a knowledge network of NSF funding data containing relationships between PIs and projects, and where the text is the project descriptions. All three examples show that our technique works well for practical knowledge networks by summarizing their textual contents while simultaneously considering their relationship structure. 2. Related Work Digital technology has made text data ubiquitous. However, staying abreast of this onslaught of textual streams such as news articles, academic papers, crime reports, etc is impossible [Hea09]. Text visualization uses interactive visual representations to summarize, highlight, and characterize the contents of textual data [DZG 07]. However, this method is complicated by the fact that text data is categorical, unstructured, and high-dimensional [SWL 10]. Below we outline the general approaches in the literature Frequency-based Visualization Extracting important words from text according to their frequency and visualizing that metric is a common text visualization technique [Hea09]. Most famous among these techniques is tag clouds (or word clouds) [HR08], and is commonly used on the Internet by Web 2.0 and social media websites. Despite their popularity, tag clouds have several problems, such as attributing too much attention on longer words [VWF09] and not making efficient use of the spatial dimension. The Wordle technique [VWF09] overcomes many of the problems associated with tag clouds and produces highly aesthetic and compact clouds. ManiWordle [KLKS10] proposes several improvements and allows the user to interactively control the cloud layout. Finally, a technique called clustered word clouds [Cla] use word relatedness to control positioning in a tag cloud layout and thus display co-occurring and related words in close proximity Visual Concordances A concordance is an alphabetical index of all the words in a text together with their context [Hea09]. Several text visualizations have been designed to visualize texts in this way. SeeSoft [Eic94] visualizes text documents by representing each document as a vertical column and text in them as a color-coded row of pixels. Similarly, TextArc [Pal02] displays the lines of text in elliptical layout with frequently occurring words placed in the center. Selecting the central word displays its connections to lines of text containing it. Finally, DocuBurst [CCP09] and WordTree [WV08] are examples of document concordances built using hierarchies Combining Text with Other Visualizations Some existing work blends text visualization with other visualization techniques to show patterns in the text. ThemeRiver [HHWN02] uses thematic variations over time to visualize the frequencies of topics extracted from the text. NameVoyager [Wat06] uses stacked bar graphs to show frequencies of baby names across time. TIARA [WLS 10] integrates trend graphs into tag clouds to show important patterns over time. Another text visualization technique called SparkClouds [LRKC10] integrates sparklines into tag clouds to show the trend of each word over time.

3 Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks Visualizing Relationships in Text Visualizing the relationships of the words in a text collection is becoming very important, and numerous visualizations have been proposed to show the structure of the text. ArcDiagrams [Wat02] displays patterns of repetition in string data. IN-SPIRE [WTP 95] shows the relationships between the documents in the corpus using a themescape, which is a topic-based projection of document concepts and keywords onto a 2D space. The Word Tree [WV08], a hierarchical document concordance, shows the context of each word by displaying phrases commonly following or preceding a given word. Finally, Phrase Nets [vhwv09] shows a graph of related words based on a user-specified relation. Th [VGD06] parses conversations and portrays relationships between individuals by extracting and analyzing keywords. Parallel Tag Clouds (PTCs) [CVW09] extends Th by combining parallel coordinates [Ins85] with tag clouds, enabling faceted browsing of textual documents and comparisons across facets. Another common approach is to use node-link diagrams to show relations in the text. Wong et al. [WMP 05] describe a novel method of displaying dynamic text in the place of links in a node-link diagram. TextArc [Pal02] uses nodelink diagrams to show all the contexts in which a word appears. WordBridge [KKEE11] replaces the nodes and links in a graph with node and link tag clouds that convey not just connectivity but also the content of the relations. FacetAtlas [CSL 10] combines node-link diagrams with density maps to visualize entity relationships in a text. In particular, Gansner et al. [GHK10a, GHK10b, GHKV09, GHN12] combine geographic maps with nodelink diagrams to increase the visual appeal of a graph. These works are closely related to our work from a visual design point of view, but have a very different data model and approach. Wordonoi visualizes bodies of text associated with a node cluster in the area assigned to the cluster s cell, while Gansner s work focuses on clustering nodes into larger regions and rendering their node labels. In other words, Wordonoi primarily visualizes knowledge networks with textual data, while Gansner uses the geographical map approach to increase the visual appeal of the node-link diagram. All of these techniques focus on showing the relationship or structure of the text within a text corpus. In contrast, our Wordonoi technique shows the relationships of nodes each with textual content in a knowledge network. Below, we will see what impact this focus will have on the technique. 3. Generating Knowledge Networks We define knowledge networks simply as graphs with associated textual content for the nodes and edges in that network, where the textual data provides some form of semantic meaning to the graph structure. In such networks, the capacity to understand not just the relationship between entities, but also the semantics of the connections is important. For example, a standard citation network shows whether or not particular authors have collaborated or cited each other, whereas a knowledge network constructed from such data may be able to tell us the nature of their collaboration or citations. We here propose a mechanism for constructing knowledge networks from standard graph and textual datasets. The most straightforward way to generate knowledge networks is by deriving multidimensional graphs from tabular data, e.g., as described by Liu et al. [LNS11]. The process is heavily dependent on the application domain, but often involves recovering an entity-relationship (E-R) model [Che76] from the data before extracting the knowledge network. In such a model, each entity (a person, publication, organization, physical object, or concept) becomes a node in the network, and the relationships between entity types are used to generate the links between nodes. For example, relational databases are built from E-R models, so extracting this mode from tables and their keys is relatively straightforward (although the user still has to make selection on which entities to include). In other cases, the E-R model must be explicitly specified by the user; for example, in a collection of crime reports, we may first have to identify entities in the text (similar to how Jigsaw [SGL08] extracts entities) and then decide on how to generate the relationships (co-occurrence, distance in text, semantic meaning). Given a basic E-R model and network data extracted from the original dataset, we must now augment the network with textual information describing the semantics of the entities and their relationship. Again, this process is applicationspecific and depends on which semantics should be highlighted. A common approach is to summarize all of the textual information available in the original database and integrate it within the knowledge network. This can be achieved using text mining techniques such as counting word frequencies, calculating tf-idf [Jon72] and related metrics, or automatically extracting text summaries [Hea99]. The extracted information ranging from keywords, phrases, or entire texts is then used as node and edge attributes. 4. Wordonoi: Design Space Wordonoi is an interactive visual representation for knowledge networks that combines both relational and textual content. Figure 2 depicts the Wordonoi pipeline that takes a knowledge network as input, processes the network in stages, and yields an interactive visualization. Below we describe these stages and explore the Wordonoi design space. Figure 2: From knowledge network to visualization.

4 4 Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks Figure 3: Spatializing knowledge networks using Wordonoi: (a) graph layout; (b) Voronoi tessellation; and (c) cell generation Spatialization The first step in visualizing knowledge networks is to assign space on the 2D visual substrate to each entity (node) in the knowledge network. The intuition is to project the highdimensional network structure onto 2D space so that the textual component associated with each node can be displayed. This spatialization technique should fill the available space in the viewport by allocating disjoint 2D regions to each network node based on the graph structure. To achieve this, we abandon the explicit display of the graph structure itself in favor of conveying the textual content of the knowledge network while maintaining connectivity. Figure 3 illustrates the three parts of the spatialization process: (a) generating a graph layout of the network; (b) tessellating the space into disjoint cells; and (c) allocating a cell to each node. Graphs are high-dimensional datasets, and graph layout algorithms are concerned with finding projections of such datasets into 2D (or 3D) space by calculating the position of each node in a way that optimizes some metric (typically readability). However, the primary purpose of the layout for the Wordonoi technique is to find a 2D mapping that mimics the structure of the underlying graph, i.e., that places highly connected nodes in close proximity. Any layout algorithm can be used; we prefer Noack s lin-log algorithm [Noa05] because it can cluster nodes based on connectivity. Having found a graph layout, we now convert the nodes into 2D regions on the viewport where textual content can be visualized. For this purpose, we use a Voronoi tessellation that subdivides the space into disjoint subspaces, or cells, based on node positions (each cell being points closest to each node). Each cell has an associated node, and its area can be used for the node s textual contents. Several design decisions were made in arriving at the above spatialization approach. The main tradeoff here is clearly that we are sacrificing some of the graph structure from a traditional node-link diagram in order to be able to convey more textual content in the visualization. This is different from techniques such as WordBridge [KKEE11] and PhraseNets [vhwv09] that also combine graph and text visualization, but retain more of the relationship structure in the representation. The drawback for such approaches is that less space is available for visualizing textual content, and, as with any node-link diagram, they do not scale well with graph size. As we shall see in Section 4.5, the Wordonoi space-filling representation not only allows devoting virtually the entire space to textual content, it is also highly amenable to hierarchical aggregation to manage scale Text Visualization Spatialization has subdivided the viewport into cells based on network topology, yielding one cell per node in the knowledge network. The next step is to use the 2D cells to visualize the textual content associated with each node: Most important keyword: Scale the most important keyword (e.g., most frequent) to fit as a single label. Repeat keyword: Again, use the most important keyword, but fill the cell completely using the keyword. Word cloud: Draw a word cloud of the tags belonging to the cell using the global frequency of each keyword. We use all three strategies depending upon the size of the cell on the screen. For small-sized regions, only the most important tag is displayed; for medium-sized regions, the tag is repeatedly displayed; and for large-sized regions, a word cloud is displayed. This choice of visual representation changes as the user zooms in and out in the visualization, giving rise to a form of semantic zooming [BGM04]. Most word cloud layout algorithms are designed for rectangular spaces, so an irregular (non-convex) cell may cause parts of keywords to fall outside of the cell polygon. We therefore provide an interaction where users can drill down in any region to see its text without clipping Utilizing Color Color is a free parameter in our design space, and can be used for features such as topology, categories, and affinity: Random assignment: Cells are assigned random colors to allow for differentiation. Graph coloring: A graph coloring algorithm is used to color cells such that no two adjacent cells have the same. Categories: Node type or textual content can be used to categorize nodes (and their cells), and cells can then be assigned colors based on their category. Color scale: A color scale (such as gray scale or heat scale) can be used to show quantitative information about each cell. Some examples of such quantitative features include node centrality, cohesiveness, and connectedness.

5 Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks 5 In Figure 6, a green color scale has been used to assign colors to cells based on cohesion (the ratio of internal edges to total edges associated with a cell). Thus, light green cells represent nodes that have few internal connections, whereas dark green cells have higher connectivity. This can be used to find communities in the data Recovering Network Topology Even though the spatialization is based on the topology of the underlying knowledge network, the Wordonoi representation still sacrifices some of the graph structure in favor of the semantic content of the network. We provide two ways for users to recover topology information: Cell Border Visualization: A Wordonoi visualization no longer displays the edges in the original node-link diagram that formed its basic structure, but there is nothing to prevent us from decorating the cell representation with this information. More specifically, we can use the borders between adjacent cells to convey information about the connectivity of those cells; for example, by varying the thickness or color of the border. Interactive Color Diffusion: We can also use color to explicitly show the network topology through an interactive paint metaphor where users assign a color to a cell, and the color is then diffused through the representation based on topology: (1) to all adjacent neighbors of a colored node; or (2) to all cells that are connected in the network. The diffusion proceeds in a breadth-first fashion with diminishing amounts of color for each step. The amount of color also depends on the relational strength between the source and the destination cells; the stronger the relations (i.e., the more edges between the cells), the more color is diffused. We use an alpha blending model where all regions are initially white, and iteratively gets blended with other colors. This technique is incremental, i.e., different colors can be assigned to different cells, each blending with existing colors across the representation (Figure 5) Aggregation Knowledge networks are often large, and therefore require summarization. Popular solutions to achieve this include clustering, filtering, or sampling the graph, but none of these approaches are well-suited in combination with textual data. For the Wordonoi technique, we simply take advantage of the space-filling visual representation of disjoint cells, one per node, resulting from the spatialization stage by designing a hierarchical aggregation technique [EF10] that incrementally agglomerates adjacent cells together until only one the sum of all cells remains. For our Voronoi cells, an agglomeration of two cells is simply the union of the 2D space of each cell, and the textual content of each corresponding node is also combined. This results in a binary clustering tree that can be expanded to any level depending on the user. Choosing a good distance metric is key to any hierarchical aggregation [EF10]. Examples of possibilities include metrics based on network topology or layout geometry Topology-based Distance Metrics Distance metrics based on network topology use graph structure to determine the order of agglomeration for cells: Degree: Merge the regions for associated nodes that combine to result in the highest (or lowest) degree. Edge weights: Combine regions for the nodes that are connected by edges with the highest (or lowest) weight. Cohesion: We define the cohesion (clustering affinity) of a region with other regions as the ratio of common edges between these regions to its degree. This metric will merge regions that are the most (or least) cohesive Spatialization-based Distance Metrics These metrics use the spatialization data to define distances between cells. While these do not depend on the graph topology, they result in a more optimized visual representation: Minimum area: Merge the two adjacent regions that have the minimum combined area. This would converge towards unifying cell size at each level of the aggregation. Maximum Area: Merge the two adjacent regions with the maximum combined area. This preserves small cells, which for many graph layout algorithms are highly connected, as well as central nodes in the center of the space. Rectangle completion: To avoid irregular polygons, use a distance metric based on how closely two cells form a complete rectangle (calculated as the ratio between the combined area and that of their bounding box). 5. Wordonoi: Implementation We implemented a Wordonoi prototype consisting of two components: a preprocessor that performs off-line spatialization, and an interactive tool that displays the visualization Preprocessing The preprocessing tool loads knowledge networks in GraphML format, generated from some earlier generation stage (Section 3), and performs the spatialization process presented in Section 4.1. Our implementation uses Noack s lin-log graph layout [Noa05] to find a 2D layout for the nodes, and then tessellates the space using a standard Voronoi implementation. We also make sure to preserve the textual content (extracted while generating the knowledge network) from the GraphML input file in the representation. Finally, the tool computes the complete aggregation hierarchy (several distance metrics are available for use) and saves the cell shapes, the text summaries, and the aggregation hierarchy to a custom XML format. Typical computing time for the preprocessor on a network consisting of approximately

6 6 Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks (a) Search and highlighting. (b) Show text visualization without clipping. (c) Show underlying node-link diagram. (d) Show child cells. Figure 4: Interactions in our Wordonoi prototype implementation. 2,000 nodes and 5,000 edges is on the order of 5 seconds. The resulting XML file format, which includes the aggregation hierarchy and the Voronoi tesselation at all levels, is approximately 5 MB in size for the above dataset. Many stages in our implementation are computationally expensive. By precalculating the aggregation hierarchy, we avoid long run-times and can quickly render all the components at a particular hierarchy level without new computation. The result is a smooth and interactive visualization Visualization The interactive tool is built using the Piccolo [BGM04] toolkit for 2D vector graphics. The tool loads the preprocessed data, including the 2D cell shapes vectors and the aggregation hierarchy, and visualizes it as an interactive application. Inside each cell, the tool visualizes the textual content of the cell (and any aggregated children) using a method dependant on the screen space allocation (Section 4.2). We use a deterministic Wordle layout [KKEE11] to avoid the random and unstable layouts of the original Wordle [VWF09] Interactions Users can interact with the Wordonoi prototype as follows: Search: The user can type in a query, and cells that match will be highlighted while others are dimmed (Figure 4(a)). Pan & zoom: Users can pan and zoom in the visualization. Semantic zooming will change the visual representation of text depending upon the screen size of each cell. Aggregation: A slider (or mouse wheel) controls how many cells to display. Changing this setting will dynamically drill down or roll up the visual aggregation. Show text: Disable clipping of the text visualization and show the full contents of the current cell (Figure 4(b)). Show node-link: Display the node-link diagram for the current cell, as well as for neighbors (Figure 4(c)). Show children: Cells may be aggregates of multiple cells. This interaction mode will show all of the child cells of the current cell under the mouse cursor (Figure 4(d)). Interactive coloring: There are two options available for interactive coloring to recover network topology: Hover diffusion mode: Color is dynamically assigned to the clicked cell and diffused to its neighbors. Full color diffusion mode: When starting this mode, all

7 Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks 7 6. Examples We showcase Wordonoi by applying it to three examples: a research organization, hypertext documents, and NSF funding data. Below we explain these examples in detail. Figure 5: Interactive color diffusion where cells have been colored in the order shown as red, blue, light green, dark green, and cyan, and then diffused based on connectivity. cells become white. Users can then iteratively select a color from a palette, assign it to a cell, and the color will be diffused across the representation (Figure 5). Cohesion color mapping: This mode switches from the default categorical color coding to coloring nodes based on the cohesion of each cell, i.e., the ratio of internal edges to the total number of edges for the cell (Figure 6). This helps in finding communities in the knowledge network. Reset: A reset option is also available to the user to revert the visualization to its original appearance Research Organization The motivation for this example is to support policymakers and researchers alike in understanding the size, scope, and research topics in a research organization for homeland security. The original dataset is an SQL database consisting of a network of research centers, the associated faculty and students, their institutions, their publications, and the projects they work on. Some of the tasks that a program manager might want to perform include the following: T1: What research is happening in the organization? T2: What research is done at particular centers? T3: What research is done by particular researchers? T4: How well are the partners collaborating? T5: Which reports deal with specific keywords? Several fulltexts are available that characterize the network, including project descriptions, paper abstracts, center mission statements, and investigator websites. In a manner similar to that described by Liu et al. [LNS11], we extract a knowledge network by mapping the tabular data to an entityrelationship model and summarizing the descriptive text for each entity using tf-idf. The resulting GraphML file consists of approximately 2,000 nodes and 5,000 edges. The file is then used as input to the Wordonoi pipeline. Figure 1 shows a screenshot of the top 26 nodes from the interactive visualization of this knowledge network. It serves as an overview of the research done in the organization, showing that the focus is on topics such as food protection, transportation safety, health and disease management, communication, and training (T1). Users can navigate, drill down, and explore this dataset further using the interactions described in the previous section. This will allow the audience to see the research landscape at all levels of scale, identify gaps, and find commonalities between projects. For example, if a program manager searches for a particular center, the summary of topics in the resulting regions gives the idea of research done in these centers (T2). For task T3, the user can use search and show node-link interactions. Interactive coloring and show node-link find the collaborations, and cohesion color mapping interaction indicates the amount of collaboration (T4). On searching a specific keywords, papers or project reports corresponding to resulting cells can be opened (T5). Figure 6: Color scale example showing regions having higher cohesion (more internal edges) as darker green Hypertext Documents Hypertext document networks are extremely common, with the web being the canonical hypertext collection, but visualizing Internet content is also notoriously difficult [Car96].

8 8 Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks While we make no claims as to the utility of Wordonoi for hypertext documents in comparison to other web visualization techniques (surveyed by Card [Car96]), we do think that the technique provides some very useful new perspectives on hypertext collections. For example, Wordonoi could be applied to web browsing history data to see a summary of visited websites and their relationships NSF Funding Data Figure 7: Top 50 nodes for a network starting from the Wikipedia article on visualization. We use a straightforward approach to generate knowledge networks from the Internet by implementing a simple web crawler that takes the URL of a webpage as a starting point and processes all of the outgoing links from that webpage in a breadth-first fashion. Each processed webpage is added as a connected node with edges for all of the hyperlinks, and we also extract the text on the page by extracting title, meta-data, and document content (the latter summarized using tf-idf). The crawl is stopped at some given size (we use 500 documents in these examples), and the resulting data is stored as a GraphML file that can later be visualized. Some of the tasks the user might want to perform on such knowledge networks include the following: T1: Summarize a given webpage and its neighbors. T2: Summarize all webpages related to a keyword. T3: Find webpages with few or many links to other pages. T4: Given a keyword, find related keywords. T5: Open all webpages related to a keyword. Figure 7 shows a Wordonoi using the Wikipedia article on visualization as root. The top 50 cells shown here convey the gist of the knowledge network through concepts such as graphics, maps, graph, data, crime, theory, and charts. Users can not only see an overview of the whole dataset, but the Wordonoi technique allows them to see details and search for a particular page or words to see a summary of text related to it (T1 & T2). They can also use interactions such as show node-link or interactive coloring to see relationships. Color cohesion mapping interaction will help users in answering T3. Aggregation and search will help in answering T4. For example, in our network fields related to visualization are graphics, data, crime, police, maps etc. Searching for specific keyword and open all the webpages associated to resulting regions will help in answering T5. Federal funding agencies around the world typically make their funding data publically available, and the U.S. National Science Foundation is no exception; the NSF award search at provides fully searchable funding information on more than 300,000 awards from 1976 to today. Analyzing and understanding this portfolio of funded projects has many and diverse applications: for an investigator, this data may yield information on important topics and previous work; for a program officer, coverage and gaps in their funding portfolio; and for a policy maker, the scope of research being funded. Some tasks involving such knowledge networks include the following: T1: What are the major research areas funded? T2: Find the areas in which a person is getting funding. T3: Open all grant proposals related to a keyword. To be able to apply our Wordonoi visualization to this data, we first downloaded the full set of current awards for our own university. Based on our generation process (Section 3), we mapped this tabular data to an entity-relationship model consisting of projects, investigators, co-investigators, directorates, and program officers. Project descriptions were used as the textual data characterizing the network, and we summarized this using standard tf-idf. We further generated edges between projects based on the co-occurrence of keywords specified for each project, as well as co-occurring concepts derived from the tf-idf process. The size of the network is approximately 1,000 nodes and 5,000 edges. Figure 8 shows the Wordonoi for this NSF funding network for our university. The aggregation level is again chosen to show about 50 nodes to yield a high-level overview. It is clear from the visualization that most of the grants are related to engineering this is also accurate given the size and prominence of the college of engineering at the university. Other research topics include chemistry, earthquakes, and agriculture, which are also accurate. Another trend is the prominence of workshops and conferences, presumably for awards used to fund such scientific meetings. 7. Discussion and Limitations Our aim in this work is to visualize both the textual and structural content of knowledge networks. One potential limitation of the space-filling Wordonoi visualization is that the technique does not explicitly maintain relationships in the

9 Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks 9 ric where cell aggregates tend towards a rectangular and thus more efficient shape. Another limitation with our implementation although not necessarily with the general technique is that we do not group or even link keywords that refer to the same concept. For example, the keywords first responder and emergency personnel are almost synonyms, but our naïve co-occurrence mapping would not catch this distinction. A more sophisticated approach left for future work would use a word ontology to make such connections. Figure 8: NSF funding example with rectangle aggregation. visual representation. On the other hand, the tessellation itself is derived from the network topology, and we have also presented an array of interactive and visual methods for recovering topology while benefiting from all of the advantages afforded by the space-filling representation. With this in mind, it is clear that the choice of graph layout algorithm has considerable impact on the resulting Wordonoi tessellation. A layout algorithm that clusters highly connected nodes in close proximity will result in a more cohesive representation, meaning that the generated cells at all levels of aggregation will be more faithful to the network topology. The lin-log layout we use has this property, but one of its drawbacks is that it is non-deterministic, i.e., it generates different actual layouts for each invocation. This may be detrimental for users trying to maintain a mental map of the Wordonoi representation across spatializations. Our choice in text visualization is solely based on tag clouds, and there may be other and more efficient text visualization techniques to use for this purpose. Nevertheless, we think that any text visualization will benefit from having a maximum of available space inside the Wordonoi cells, and should therefore be possible to simply plug into the existing layout. Similarly, our use of tf-idf for text mining and extraction should not be seen as indicative of limitations in the Wordonoi technique itself; we certainly think more advanced text analytics algorithm can be used in its stead. It should also be noted that we provide distance metrics for aggregation based both on spatial information as well as network topology. Different metrics have different strengths and weaknesses. While topology-based metrics are clearly the most faithful to the original network structure, they may yield aggregations that are difficult to use efficiently for text visualization. Furthermore, even such spatially based metrics may yield less-than-ideal layouts. These effects were the reason we devised the rectangular completion distance met Finally, there exists several additional examples where the Wordonoi technique can be applied. For example, social media data from sites such as Facebook, Twitter, and MySpace contain both relations and large amounts of textual data, and are therefore potential applications for the technique. Another example could be for crime and incident reports to summarize types of crime, relations between different incidents, and the crime trend in a particular area. 8. Conclusion and Future Work We have presented a novel visualization technique called Wordonoi that combines both the structure and textual content in knowledge networks. While this approach sacrifices some of the structure from the original network in favor of textual content, it is highly amenable to hierarchical aggregation to combat large scale, and we also provide multiple interactive and visual methods for recovering this lost structure. We have also demonstrated the utility of the Wordonoi technique in three examples of knowledge networks. Several potential future directions exist. We plan to deploy and evaluate the system in a large research organization. We will bring in more text mining and analytics, such as topic modeling and word ontologies, to improve the text visualization component. References [BGM04] B EDERSON B. B., G ROSJEAN J., M EYER J.: Toolkit design for interactive structured graphics. IEEE Transactions on Software Engineering 30, 8 (2004), , 6 [Car96] C ARD S. K.: Visualizing retrieved information: A survey. IEEE Computer Graphics and Applications 16, 2 (Mar. 1996), , 8 [CCP09] C OLLINS C., C ARPENDALE M. S. T., P ENN G.: DocuBurst: Visualizing document content using language structure. Computer Graphics Forum 28, 3 (2009), [Che76] C HEN P. P.-S.: The entity-relationship model toward a unified view of data. ACM Transactions on Database Systems 1, 1 (1976), [Cla] C LARK J.: Clustered word clouds. http: //neoformix.com/2008/clusteredwordclouds. html. Oct [CSL 10] C AO N., S UN J., L IN Y., G OTZ D., L IU S., Q U H.: FacetAtlas: Multifaceted visualization for rich text corpora. IEEE Transactions on Visualization and Computer Graphics 16, 6 (2010),

10 10 Submission ID 238 / Wordonoi: Visualizing the Structure and Textual Contents of Knowledge Networks [CVW09] COLLINS C., VIÉGAS F. B., WATTENBERG M.: Parallel tag clouds to explore faceted text corpora. In Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (2009), pp , 3 [DZG 07] DON A., ZHELEVA E., GREGORY M., TARKAN S., AUVIL L., CLEMENT T., SHNEIDERMAN B., PLAISANT C.: Discovering interesting usage patterns in text collections: integrating text mining with visualization. In Proceedings of the ACM Conference on Information and Knowledge Management (2007), pp , 2 [EF10] ELMQVIST N., FEKETE J.-D.: Hierarchical aggregation for information visualization: Overview, techniques and design guidelines. IEEE Transactions on Visualization and Computer Graphics 16, 3 (2010), , 5 [Eic94] EICK S. G.: Graphically displaying text. Journal of Computational and Graphical Statistics 3, 2 (1994), [GHK10a] GANSNER E., HU Y., KOBOUROV S.: GMap: Visualizing graphs and clusters as maps. In Proceedings of the IEEE Pacific Visualization Symposium (2010), pp [GHK10b] GANSNER E., HU Y., KOBOUROV S.: Visualizing graphs and clusters as maps. Computer Graphics and Applications 30, 6 (2010), [GHKV09] GANSNER E., HU Y., KOBOUROV S., VOLINSKY C.: Putting recommendations on the map: visualizing clusters and relations. In Proceedings of the ACM Conference on Recommender Systems (2009), pp [GHN12] GANSNER E. R., HU Y., NORTH S. C.: Visualizing streaming text data with dynamic maps. CoRR abs/ (2012). 3 [Hea99] HEARST M. A.: Untangling text data mining. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (1999), pp [Hea09] HEARST M.: Search user interfaces. Cambridge University Press, [HHWN02] HAVRE S., HETZLER E., WHITNEY P., NOWELL L.: ThemeRiver: Visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics 8, 1 (Jan. 2002), [HR08] HEARST M. A., ROSNER D. K.: Tag clouds: Data analysis tool or social signaller? In Proceedings of the Hawaii International Conference on System Sciences (2008), pp [Ins85] INSELBERG A.: The plane with parallel coordinates. The Visual Computer 1, 2 (1985), [Jon72] JONES S. K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28 (1972), [KKEE11] KIM K., KO S., ELMQVIST N., EBERT D. S.: Word- Bridge: using composite tag clouds in node-link diagrams for visualizing content and relations in text corpora. In Proceedings of the Hawaii International Conference on System Sciences (2011), pp , 2, 3, 4, 6 [KLKS10] KOH K., LEE B., KIM B., SEO J.: ManiWordle: Providing flexible control over wordle. IEEE Transactions on Visualization and Computer Graphics 16, 6 (2010), [Koh82] KOHONEN T.: Self-organized formation of topologically correct feature maps. Biological Cybernetics 43, 1 (1982), [LNS11] LIU Z., NAVATHE S. B., STASKO J. T.: Networkbased visual analysis of tabular data. In Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (2011), pp , 7 [LRKC10] LEE B., RICHE N. H., KARLSON A. K., CARPEN- DALE S.: SparkClouds: visualizing trends in tag clouds. IEEE Transactions on Visualization and Computer Graphics 16, 6 (2010), [Noa05] NOACK A.: Energy-based clustering of graphs with nonuniform degrees. In Proceedings of the International Symposium on Graph Drawing (2005), pp , 5 [Pal02] PALEY W. B.: TextArc: Showing word frequency and distribution in text. In Poster Proceedigns of the IEEE Symposium on Information Visualization (2002). 2, 3 [SGL08] STASKO J. T., GÖRG C., LIU Z.: Jigsaw: supporting investigative analysis through interactive visualization. Information Visualization 7, 2 (2008), , 3 [Shn96] SHNEIDERMAN B.: The eyes have it: A task by data type taxonomy for information visualizations. In Proceedings of the IEEE Symposium on Visual Languages (1996), pp [SWL 10] SHI L., WEI F., LIU S., TAN L., LIAN X., ZHOU M. X.: Understanding text corpora with multiple facets. In Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (2010), pp [VGD06] VIÉGAS F. B., GOLDER S., DONATH J.: Visualizing content: portraying relationships from conversational histories. In Proceedings of the ACM Conference on Human Factors in Computing Systems (2006), pp [vhwv09] VAN HAM F., WATTENBERG M., VIÉGAS F. B.: Mapping text with phrase nets. IEEE Transactions on Visualization and Computer Graphics 15, 6 (2009), , 2, 3, 4 [VWF09] VIÉGAS F. B., WATTENBERG M., FEINBERG J.: Participatory visualization with Wordle. IEEE Transactions on Visualization and Computer Graphics 15, 6 (2009), , 2, 6 [Wat02] WATTENBERG M.: Arc diagrams: Visualizing structure in strings. In Proceedings of the IEEE Symposium on Information Visualization (2002), pp [Wat06] WATTENBERG M.: Visual exploration of multivariate graphs. In Proceedings of the ACM Conference on Human Factors in Computing Systems (2006), pp [WLS 10] WEI F., LIU S., SONG Y., PAN S., ZHOU M., QIAN W., SHI L., TAN L., ZHANG Q.: TIARA: a visual exploratory text analytic system. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (2010), pp [WMP 05] WONG P. C., MACKEY P., PERRINE K., EAGAN J., FOOTE H., THOMAS J.: Dynamic visualization of graphs with extended labels. In Proceedings of IEEE Symposium on Information Visualization (2005), pp [WTP 95] WISE J. A., THOMAS J. J., PENNOCK K., LANTRIP D., POTTIER M., SCHUR A., CROW V.: Visualizing the nonvisual: Spatial analysis and interaction with information from text documents. In Proceedings of the IEEE Symposium on Information Visualization (1995), pp [WV08] WATTENBERG M., VIÉGAS F. B.: The word tree, an interactive visual concordance. IEEE Transactions on Visualization and Computer Graphics 14, 6 (Nov./Dec. 2008), , 3