Visualization methods for patent data

Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes them with a short explanation of the concepts behind them. In the scientific literature one can find many papers on details behind the visualization techniques we mention here. Treparel s KMX technology uses these advanced visualizations as part of the analysis pipeline where we also have support for multiple selections of data points (patent documents) in different visualizations. This we call multiple coupled views and it basically means that when a user selects one or more documents this is shown in all available visualisations and the interaction is also supported from all visualizations. Visualisation is the process of constructing a visual image in the mind to understand the data better. Although this is an accurate description of the word visualisation instead of being a mental process the task of visualisation has become more and more an external process. The fact that visualisation has partly become an external process indicates that a broader definition of the term visualisation seems to be needed, such as: Visualisation is a method of computing. It transforms the symbolic into the geometric,enabling researchers to observe their simulations and computations. Visualisation offers a method for seeing the unseen. It enriches the process of scientific discovery and fosters profound and unexpected insights. The definition already hints at some of the benefits of computer visualisation. A good summary of benefits can be found in : Visualisation enables man to comprehend large datasets, datasets which are too large to grasp by mental imagination. Visualisation enables the discovery of previous unknown properties of the dataset which may not have been anticipated. The perception of these properties or patterns can lead the user to develop new insights. Visualisation often reveals inherent problems of the data, for instance errors and artefacts may be readily revealed. Visualisation enables both the examination of the large scale features of the dataset as well as the local features, allowing the user to see local features in a larger scale reference. Visualisation allows the user to form hypothesis based on the (newly) observed phenomena or developed insights. Treparel, Delftechpark 26, 2628 XH Delft, The Netherlands +31 15 2600 455 www.treparel.com IBAN: NL39.ABNA0.5555.05.278 BTW/VAT: NL 8157.09.37.7 B01 Chamber of Commerce: 27.28.57.28 info@treparel.com

Ideally visualisation should be used to provide a means to overview, explore and navigate large multidimensional datasets. Let us first take a brief look at how exactly we arrive at a visualisation from the original raw data. The visualisation pipeline is the name of the sequence of processes to create a visual representation of data. Before the visualisation pipeline is entered a quantity of data is generated either from databases or any other means of data collection. The visualisation pipeline basically consists of four steps. Data analysis is the first step in the visualisation process, which consist of multiple steps in a pipeline. During data analysis the data is prepared for visualisation. Basically this means that a number of operations can be performed on the data to make it more suitable for visualisation. After completing the data analysis step the raw data has been transformed to data which can be visualised. However this does not mean that all of the data is of interest. Only the portions of the data that are of interest should be visualised and hence the second step in the visualisation pipeline is a data selection step to select the data of interest, so only focal data remains in the pipeline. Usually this part of the pipeline features some user- interaction to decide on the sections of interest. Now that has been decided which data is the focus data, the next step is the mapping step of the visualisation pipeline. In this part of the pipeline the data is mapped to render- able representations. These representations are geometric primitive like lines, surfaces, points, voxels with certain attributes like colour, position, size, transparency, texture etc. After the data mapping all that remains is the final rendering of the geometric data. Rendering is creating an image from a model. Operations performed here are viewing transformations, lighting calculations, hidden surface removal, scan conversion, anti aliasing etc. The final visualisation is created and either written to file or displayed on the screen. Visualization Pipeline The resulting visualisation should ideally be expressive, effective and appropriate. Expressive meaning that the visualisation should only display the relevant information of a dataset. It should be effective in such a manner that it complements the users capabilities of perception page 2 / 16

and the mental image that a user has of the visualisation. Finally an appropriate visualisation is a visualisation in which the efforts of creating the visualisation do not outweigh the benefits of the resulting visualisation. An alternative way to show the steps in the visualization process is shown below: Visualization Pipeline Al these steps are part of the visualization pipeline as we use it in KMX. Visualization of the patent text The first step in the visualization of patent data is often done by searching/filtering the data to extract the patterns text mining can strongly contribute to the visualization. Some important analysis tasks for a user are: the visualization of a patent collection to a known set of classes the visualization of a patent collection to a unknown set of classes the visualization of a patent collection in the context of their hierarchy the visualization of a patent collection over time The first two task can be implemented using supervised and unsupervised machine learning techniques through which automatic classification and clustering of the data is done. This data is then processed in the visualization pipeline to provide insight in the classified and clustered patent data. Since patent data contains classification codes the data can be hierarchically ordered in for instance the IPC classification. To provide insight in a collection of patent data we also provide an approach to visualize hierarchical patent data using a tree map algorithm. The patent data also contains time stamp data through which a collection of patents can be analysed over time. For this we implemented a visualization o the change of the number of patents from a patent collection which belong to a patent class over time. page 3 / 16

Treemap visualization Tree mapping is a method for displaying tree- structured data using nested rectangles which provide overview and selection of data points. An example is given in the figure below where there are documents in class A and H and in the class A there are three sub categories (A1, A2 and A3) where one is selected and all documents in that class are in shown in red. Within the tree map the user has an overview of the classes and number of patents in those classes for the full collection. With a mouse over he can get additional information about the patent and he can add or remove one or more patents from the currently selected set. When one selects on box one selects on document (such as EP0058137 in the example). The tree map visualization is very powerful since in a fixed screen scape the tree mapping algorithm can show all hierarchical data points (patent documents) and provide and an overview and also a good selection mechanism. Tree map visualization One example of a tree map showing patents on chemistry. With the interaction of the tree map visualization one of course also needs to have support for drill down into the data. If one want to see all patents in C07 and can update the visualization and show patents deeper in the C07 classification tree. This is one of the strength of the tree map algorithm. page 4 / 16

Tree map visualization using colour to indicate the number of patents in each class The above example shows how colouring can be used to show a parameter like the number of patents in a class, shown from green (large number) to black (small number of patents in that sub class). We can also combine two visualizations, as shown below where the tree map colouring is used to show the patents over three years (2005,2006,2007) and the cluster visualization is showing the same documents but then their similarity as calculated by the machine learning algorithm for clustering in KMX. page 5 / 16

Combination of tree map and landscape (cluster) visualization with colouring over the years Combined use of two visualizations in KMX (tree map and clustering) to show the patent data hierarchically (tree map) and unsupervised (3D clustering where the height is the density of the patents especially prominent for the an- organic and organic chemistry) and the colour is used to display the pattern in the patent data over time. The clustering of documents helps to analyse a collection of patents and get insight in the natural grouping of the patents. In the cluster visualization, the user can easily select documents by brushing, i.e. selecting them using the mouse. By brushing in a cluster or a parallel coordinates visualization the user gets feedback about the selected documents which greatly helps in the selection of documents, which is an example of the mentioned multiple coupled views support. One can use multiple brushes to have a rough selection and a more precise selection which provides the user feedback on a larger selected set of documents and also a smaller set. The use of multiple brushes also helps the analyst to explore the documents directly visualized in a tree map visualization and a visualization of the documents over time. This helps to understand if a brushed set of documents which are close together in a cluster page 6 / 16

visualization are also hierarchically close together in the tree map visualization. Additionally one can analyse this also over time which provides the user to analyse if documents which are clustered close together are also close together over time. If one wants to check if there is a trend on a certain technology over time this would be a logical way to analyse it and also to explore Parallel coordinates visualization When we have a set of documents selected maybe by filtering or brushing (see right cluster image) we can show for the selected set of documents (in the example below the documents on ebola, sars and h5n1) the distribution of the classification score. This is done by using three parallel vertical oriented coordinates where the classification score is from 0.0 (bottom) to 100 (top) can be shown for each document and each document is a line going through the three axis. Immediately one can now see the document that are selected on one cluster and that have a high score on one class and a low score on the other classes. This is true in the below shown example of KMX for all classes and shows the high performance of the classifiers. Parallel coordinates is a very general visualization technique and can map multivariate data belonging to text data. Here we have explained it with an example related to clustering, classification and two types of visualizations. Left a parallel coordinates visualization for the selected patents from the cluster visualization on the right. page 7 / 16

Here we show an example how it can be integrated in an application where the KMX algorithms are used to calculate the patterns in the data that are shown in the visual interface. Here we show the use of parallel coordinates where we sorted the scores for the patents to the most important coordinate classes and the decay shows that all patents belong distinctively to the first shown class (first coordinate) and thereafter to one or two additional classes but dominant. The gray cylinders indicate the number of patents in that range of the classification score which helps to read and interpret the patent data. page 8 / 16

Another example of using parallel coordinates in KMX. We have 10 classifiers and thus 10 coordinates and when we select patents from the clustering (see below in blue) we can see which patents (the line) score high for which classifier. This technology can be used for tagging and thus data enrichment where the visualizations are important in the analysis process. Cluster visualization of patents page 9 / 16

When we have classified all patents in KMX we can use the classification scores to calculate the correlation between all patents and visualise this. This provides valuable insights on aspects which one cannot determine in a query based approach, such as shown below. On the vertical and horizontal axes of the correlation visualization (matrix) we have the classification codes (IPC for instance) and therefore the visualization is symmetric. There are documents which are in different classes (like with pesticides) and although they are in different classes the still can share a strong correlation such as shown for patents in class C07K02 and A61K05 that have a correlation coefficient of 0,75 in the visualization below. Seeing where these strong correlation classes are is easy and valuable and this information cannot be determine by a query based approach. Also seeing where there are many of the correlating classes is seen directly in one picture which shows the strength of overview first and details on demand later when using visualizations. Correlation visualization of patents page 10 / 16

Combined use of search, correlation, tree map, parallel coordinates and cluster visualization with brushing and filtering (here for 4 classes indicated by 4 colours). In more detail the parallel coordinates for many classes of data from pubmed page 11 / 16

Parallel coordinates for a selected set of documents (shown in yellow) after brushing in a cluster visualisation. Visualization of patterns over time in a document set When one wants to understand patent data over time it is valuable to be able to analyse them as part of a class capturing document about the same subject, classifications and concepts. This can be done using classification and/or clustering and then we can visualise the increase or decrease of the patents over time where the band with of the classes show the trends. This is shown below for patent classification classes but can also be done for instance on non patent literature for instance the MESH terms of pubmed documents. Visualisation of increase of the number of patents over time for different patent classes. page 12 / 16

Trends of the patents over time for different patent classes. Trends of the pubmed article over time for different MESH terms. page 13 / 16

With selection of a group of documents one can have direct access and interaction with these documents and analyse them further. Visualization of relationships between patents (graph visualizations) In a patent document set there are many meta data variables that can be analysed in relation with all other document. For instance on the left we show a cluster visualization of patent on optical recording and on the right we have the same cluster but then colour coded to a specific classifier and additionally relationships are show by the connecting lines that are also colour mapped. Cluster visualization showing relations between the patents in the set. A very common approach to analyse relationships between documents is the citation analysis where we want to determine and show which patent cite other patents and vice versa (forward and backward citations). This is also very common for scientific papers and even used to estimate impact indicators. These citation visualisation are in fact graph (network) visualizations and below we show an example. page 14 / 16

Citation network visualisation for three domains over time and with bars on the papers (shown as nodes in the graph) with the highest impact. Visualization of meta data from the patents Patent and non- patent documents contact a lot of metadata (inventor, assignee, year of filling etc.) that can be visualised. Normally we filter the data and calculate some aggregated data and then use simple information visualisation representations (bar chart, pie charts etc.) to show basic data of the analysed patent/document sets. This can be done in a process (workflow) like shown below: page 15 / 16

Example workflow diagram for generating visualisations Example workflow to determine and represent basic average data of document sets. The representations that can be used are shown below and again here it is an advantage if rich interaction is supported especially with multiple coupled views. This is the case for KMX but in most cases it is not supported and them one does not have the ability to really have interaction with the data and learn from it since one only has static images. Examples of basic info- graphic visualizations: Example bar chart examples Example point and line charts examples (including radar plot on the right) Example pie charts and ring charts (showing a level of hierarchy) and bubble charts page 16 / 16