Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important for text analytics as done with the KMX technology of Treparel. Text visualization is most powerfull when it supports understanding complex patterns in data and support decision making. Statistical and machine learning techniques are used to find patterns and relationships that can then be visualized. Classification and clustering are two fundamental approaches in text analytics and the visualization of classified and clustered documents are thus two importent visualization approaches that are discussed here. 1 Introduction: seeing the unseen Visualization is the process of constructing a visual image in the mind to understand the data better. Although this is an accurate description of the word visualization instead of being a mental process the task of visualization has become more and more an external process. The fact that visualization has partly become an external process indicates that a broader definition of the term visualization seems to be needed, such as: Visualization is a method of computing. It transforms the symbolic into the geometric,enabling researchers to observe their simulations and computations. Visualisation offers a method for seeing the unseen. It enriches the process of scientific discovery and fosters profound and unexpected insights. Visualization enables man to comprehend large data sets, data sets which are too large to grasp by mental imagination. Visualization enables the discovery of previous unknown properties of the data set which may not have been anticipated. The perception of these properties or patterns can lead the user to develop new insights. Visualization often reveals inherent problems of the data, for instance errors and artifacts may be readily revealed. Visualization enables both the examination of the large scale features of the data set as well as the local features, allowing the user to see local features in a larger scale reference. Visualization allows the user to form hypothesis based on the (newly) observed phenomena or developed insights. Ideally visualization should be used to provide a means to overview, explore and navigate large multidimensional data sets. Visualization is needed to understand data and data is becoming more imporant but also more complex. We need to understand our data, in the unstructured form and the structured form, and extract all relevant patterns and trends from the data to obtain a clear picture and be able to make well judged decisions. For this we need data analytics and especially data/text mining and machine learning techniques that can analyze large complex data sets. Machine learning algorithms help us to model a pattern in the data set and describe it mathematically which is very powerful since a mathematical description of a pattern or
a trend is the most precise way of capturing and describing the data. We can then compare multiple patterns in the data at the level where on can draw conclusions taking into account all relevant information (captured by these models). This is the level where the model provides a meaning (interpretation) in the real world, the semantic level of the pattern or a trend in the data. This is also the level where reasoning on patterns in the data can start. Now what kind of patterns can there exist? This is asking what kind of models can we find when analyzing any kind of data. We can find linear models and non-linear models and the relationships between these models. Their mathematical description is well suited for further analysis but visualization is very powerfull to understand the models by seeing the pattern or trend in the data. The visualization techniques needed to support this are techniques that can show the type of relationships between the data points (documents in the case of text analytics) and these can be supervised (categorical), unsupervised (similarity relationships as in clusters) but also hierarchical relationships or relationships over time. There are visualizations of table data (rows and columns) visualization hierarchical data visualization classified data (categorical) visualization clustered data visualization data over time visualization correlation data visualization The relationships in the data that these visualization techniques can reveal can be linked which is very important for possible conclusions on the data and therefore interaction in one visualization is coupled in the other visualizations in KMX. This is called multiple coupled view visualization and is becoming recently more important since in data analysis one looks at all related data and this also means analyzing multiple data sets combined. In the scientific literature one can find many papers on details behind the visualization techniques we mention here. KMX technology uses these advanced visualizations as part of the analysis pipeline where we also have support for multiple selections of data points (which can be patents, research, legal or news documents) in different visualizations. These multiple coupled views and it basically means that when a user selects one or more documents this is shown in all available visualizations and the interaction is also supported from all visualizations. Figure 1: GUI showing multiple views with different visulizations of a large patent data set Let us first take a brief look at how exactly we arrive at a visualization from the original raw data. The visualization pipeline is the name of the sequence of processes to create a visual representation of data. Before the visualization pipeline is entered a quantity of data is generated either from databases or any other means of data collection. The visualization pipeline basically consists of four steps. Data analysis is the first step in the visualization process which consist of multiple steps in a pipeline. During data analysis the data is prepared for visualization. Basically this means 2
(a) (b) Figure 2: a: One example of a treemap showing patents on chemistry. With the interaction of the tree map visualization one of course also needs to have support for drill down into the data. If one want to see all patents in C07 and can update the visualization and show patents deeper in the C07 classification tree. This is one of the strength of the tree map algorithm. Fig b: The above example shows how coloring can be used to show a parameter like the number of patents in a class, shown from green (large number) to black (small number of patents in that sub class). that a number of operations can be performed on the data to make it more suitable for visualization. 1. After completing the data analysis step the raw data has been transformed to data which can be visualized. However this does not mean that all of the data is of interest. Only the portions of the data that are of interest should be visualized and hence the second step in the visualization pipeline is a data selection step to select the data of interest, so only focal data remains in the pipeline. Usually this part of the pipeline features some user-interaction to decide on the sections of interest. 2. Now that has been decided which data is the focus data, the next step is the mapping step of the visualization pipeline. In this part of the pipeline the data is mapped to render-able representations. These representations are geometric primitive like lines, surfaces, points, voxels with certain attributes like color, position, size, transparency, texture etc. 3. After the data mapping all that remains is the final rendering of the geometric data. Rendering is creating an image from a model. Operations performed here are viewing transformations, lighting calculations, hidden surface removal, scan conversion, anti aliasing etc. The final visualization is created and either written to file or displayed on the screen. The resulting visualization should ideally be expressive, effective and appropriate. Expressive meaning that the visualization should only display the relevant information of a data set. It should be effective in such a manner that it complements the users capabilities of perception and the mental image that a user has of the visualization. Finally an appropriate visualization is a visualization in which the efforts of creating the visualization do not outweigh the benefits of the resulting visualization. The first step in the visualization of patent data is often done by searching/filtering the data to extract the patterns text mining can strongly contribute to the visualization. Some important analysis tasks for a user are: the visualization of a document collection to a known set of classes the visualization of a document collection to a unknown set of classes the visualization of a document collection in the context of their hierarchy the visualization of a document collection over time 1.1 Treemap visualization 3
The first two task can be implemented using supervised and unsupervised machine learning techniques through which automatic classification and clustering of the data is done. This data is then processed in the visualization pipeline to provide insight in the classified and clustered patent data. Since patent data contains classification codes the data can be hierarchically ordered in for instance the IPC classification. To provide insight in a collection of patent data we also provide an approach to visualize hierarchical patent data using a tree map algorithm. The patent data also contains time stamp data through which a collection of patents can be analyzed over time. For this we implemented a visualization o the change of the number of patents from a patent collection which belong to a patent class over time. Tree mapping is a method for displaying tree-structured data using nested rectangles which provide overview and selection of data points. An example is given in figure 6 below where there are documents in class A and H and in the class A there are three sub categories (A1, A2 and A3) where one is selected and all documents in that class are in shown in red. Within the tree map the user has an overview of the classes and number of patents in those classes for the full collection. With a mouse over he can get additional information about the patent and he can add or remove one or more patents from the currently selected set. When one selects on box one selects on document (such as EP0058137 in the example). The tree map visualization is very powerful since in a fixed screen scape the tree mapping algorithm can show all hierarchical data points (patent documents) and provide and an overview and also a good selection mechanism. Figure 3: Combined use of two visualizations in KMX (tree map and clustering) to show the patent data hierarchically (tree map) and unsupervised (3D clustering where the height is the density of the patents especially prominent for the an-organic and organic chemistry) and the color is used to display the pattern in the patent data over time. We can also combine two visualizations, as shown below where the tree map coloring is used to show the patents over three years (2005,2006,2007) and the cluster visualization is showing the same documents but then their similarity as calculated by the machine learning algorithm for clustering in KMX. The clustering of documents helps to analyze a collection of patents and get insight in the natural grouping of the patents. In the cluster visualization, the user can easily select documents by brushing, i.e. selecting them using the 4
mouse. By brushing in a cluster or a parallel coordinates visualization the user gets feedback about the selected documents which greatly helps in the selection of documents, which is an example of the mentioned multiple coupled views support. One can use multiple brushes to have a rough selection and a more precise selection which provides the user feedback on a larger selected set of documents and also a smaller set. The use of multiple brushes also helps the analyst to explore the documents directly visualized in a tree map visualization and a visualization of the documents over time. This helps to understand if a brushed set of documents which are close together in a cluster visualization are also hierarchically close together in the tree map visualization. Additionally one can analyze this also over time which provides the user to analyze if documents which are clustered close together are also close together over time. If one wants to check if there is a trend on a certain technology over time this would be a logical way to analyze it and also to explore 1.2 Parallel coordinates visualization When we have a set of documents selected maybe by filtering or brushing (see right cluster image) we can show for the selected set of documents (in the example below the documents on Ebola, SARS and h5n1) the distribution of the classification score. This is done by using three parallel vertical oriented coordinates where the classification score is from 0.0 (bottom) to 100 (top) can be shown for each document and each document is a line going through the three axis. Immediately one can now see the document that are selected on one cluster and that have a high score on one class and a low score on the other classes. This is true in the below shown example of KMX for all classes and shows the high performance of the classifiers. Parallel coordinates is a very general visualization technique and can map multivariate data belonging to text data. Here we have explained it with an example related to clustering, classification and two types of visualizations. When we have classified all patents in KMX we can use the classification scores to calculate the correlation between all patents and visualize this. This provides valuable insights on aspects which one cannot determine in a query based approach, such as shown below. On the vertical and horizontal axes of the correlation visualization (matrix) we have the classification codes (IPC for instance) and therefore the visualization is symmetric. There are documents which are in different classes (like with pesticides) and although they are in different classes the still can share a strong correlation such as shown for patents in class C07K02 and A61K05 that have a correlation coefficient of 0,75 in the visualization below. Seeing where these strong correlation classes are is easy and valuable and this information cannot be determine by a query based approach. Also seeing where there are many of the correlating classes is seen directly in one picture which shows the strength of overview first and details on demand later when using visualizations. When we have classified all patents in KMX we can use the classification scores to calculate the correlation between all patents and visualize this. This provides valuable insights on aspects which one cannot determine in a query based approach, such as shown below. On the vertical and horizontal axes of the correlation visualization (matrix) we have the classification codes (IPC for instance) and therefore the visualization is symmetric. There are documents which are in different classes (like with pesticides) and although they are in different classes the still can share a strong correlation such as shown for patents in class C07K02 and A61K05 that have a correlation coefficient of 0,75 in the visualization below. Seeing where these strong correlation classes are is easy and valuable and this information cannot be determine by a query based approach. Also seeing where there are many of the correlating classes is seen directly in one picture which shows the strength of overview first and details on demand later when using visualizations. 1.3 Visualization of patterns over time in a document set When one wants to understand patent data over time it is valuable to be able to analyse them as part of a class capturing document about the same subject, classifications and concepts. This can be done using classification and/or clustering and then we can visualise the increase or decrease of the patents over time where the band with of the classes show the trends. This is shown below for patent classification classes but can also be done for instance on non patent literature for instance the MESH terms of pubmed documents. 5
(a) (b) Figure 4: Parellel coordinates visualization and cluster visualization of 3 Medline clusters (Ebola (purple), H5N1 (blue) and SARS (yellow) ). Figure 5: Here we show the use of parallel coordinates where we sorted the scores for the patents to the most important coordinate classes and the decay shows that all patents belong distinctively to the first shown class (first coordinate) and thereafter to one or two additional classes but dominant. The gray cylinders indicate the number of patents in that range of the classification score which helps to read and interpret the patent data. 6
(a) (b) Figure 6: a: Correlation visualization between many patent and their patent classes and Fig b: Trends of the patents over time for different patent classes. 7
2 Text Analytics visualizations Treparel s KMX big data text analytics solution is an client server based software platform. The KMX API makes the system open for integration with existing technologies. The client GUI is a native windows application of which a screen shot is shown below. The solution comes as a very flexible and scalable system in terms of performance and system management. Scalability of the solution allows to handle both the growing amount of data as well as the growing complexity of the data at hand at predictable cost. Figure 7: Overview of the KMX Patent Analytics GUI showing patent titles and their lables, the cluster visualization and a section of the full text of a selected patent (see cross hair in the visualization) and the brushes (green and red) indicating the training documents of the classifier. The classification score is shown from blue (positive) to yellow (negative) in the patent landscape. The training documents are indicated by the color of the brushes (green and red) About Treparel Treparel is a leading global software provider in Big Data Text Analytics and Visualization. The KMX platform allows organizations to enhance innovation processes, improve competitive advantage, mitigate litigation risk and cost and manage interactions with customers by gaining insights from numerous sources unstructured data (text, application notes, images, blogs, email and patents). Global companies, government agencies, software vendors or data publishers are using Treparel KMX text analysis software to gain faster, reliable, precise insights in large complex unstructured data sets allowing them to make better informed decisions. For more information contact info@treparel.com or go to http://www.treparel.com 8