INTERACTIVE MANIPULATION, VISUALIZATION AND ANALYSIS OF LARGE SETS OF MULTIDIMENSIONAL TIME SERIES IN HEALTH INFORMATICS

Transcription

1 Proceedings of the 3 rd INFORMS Workshop on Data Mining and Health Informatics (DM-HI 2008) J. Li, D. Aleman, R. Sikora, eds. INTERACTIVE MANIPULATION, VISUALIZATION AND ANALYSIS OF LARGE SETS OF MULTIDIMENSIONAL TIME SERIES IN HEALTH INFORMATICS Artur Dubrawski, Maheshkumar Sabhnani, Saswati Ray, Michael Baysek, Lujie Chen, John Ostlund and Michael Knight The Auton Lab Carnegie Mellon University Pittsburgh, Pennsylvania awd@cs.cmu.edu Abstract We present a scalable data representation structure which supports interactive access to large sets of time series data, the type of data frequently encountered in health informatics. This structure, called T-Cube, is a cached sufficient statistic equivalent to the data cube concept known in OLAP applications. It differs from a regular data cube in that it can be used to very efficiently answer all conceivable (not just the most common) queries against multidimensional time series databases, including disjunctive queries. Rapid access to complex data not only makes advanced analytics feasible, but it also enables user-level data navigation (drill-downs, roll-ups, visualization) at the interactive speeds. T-Cube also allows for rapid execution of massive screening through data for statistically significant patterns at different levels of aggregation. The exhaustive search strategy guarantees that no event of interest will ever be missed. Successful applications in the domains of public health and food safety indicate that the combined benefits lead to improved situational awareness of the analysts working with information systems powered with T-Cube. We are on the outlook for other applications where it may be of help. Keywords: multidimensional time series, OLAP, scalable analytics, rapid retrieval. Introduction Time series data is abundant in many domains including finance, weather forecasting, epidemiology, food safety and many others. For instance, large scale bio-surveillance programs monitor status of public health against adverse events such as outbreaks of infectious diseases and emerging patterns of factors affecting public health. They rely on data collected throughout a health management system (hospital records, health insurance companies records, lab test requests and results, issued and filled prescriptions, ambulance and emergency phone service calls, etc) as well as outside of it (school/workplace absenteeism, sales of non-prescription medicines, etc). The key objective is to as early as possible and as reliably as possible detect changes in statistics of data sources which may be indicative of a developing public health problem. One of the challenges the users of such systems face is data overload. The actual number of e.g. daily transactions of drug sales in pharmacies across country may be very large. The users need tools to enable timely analysis of those massive data sources. The analyses can be performed automatically (with a data mining software), but typically automatically discovered 1

2 patterns are subject to careful follow-ups through manual drill-downs. In both scenarios, massive screening of very large collections of data must be executed really fast in order to make the biosurveillance systems useful in practice. A saving of just a few hours of detection latency of an outbreak of a lethal infectious disease can yield enormous monetary and social benefits. Most of the kinds of data mentioned above can be interpreted as time series of interval (e.g. daily) counts of events (such as number of certain type of drugs, e.g. anti-diarrheals sold; number of patients reporting to emergency department with specific symptoms, number of positive results of microbial tests of food samples taken in a production facility, etc.). These time series can be sliced-and-diced across multiple categorical dimensions such as location, gender and age group of patients, and so on. Computational efficiency of data mining operations which can be applied to such data, as well as the efficiency of interactive manual drill-downs, heavily depends on the efficiency of extraction of series of counts aggregated for specific values of the categorical dimensions. We use a data structure called T-Cube to rapidly retrieve such aggregates for any complex query. It achieves its efficiency by pre-computing and caching responses to all possible queries against the underlying temporal database of counts annotated with sets of categorical labels, while keeping the storage size in check. It has already been successfully used in support of health informatics applications in food safety and disease surveillance domains, but its potential applicability reaches further. Related Work Standard approach to handling ad-hoc queries in commercial databases is that of On-Line Analytical Processing (OLAP). The idea relies on data cubes, cached data structures extracted from (usually only parts of) the original data and made in the form allowing for fast ad-hoc querying of pre-selected subsets of aggregated data. For the sake of brevity we do not review the details of OLAP technology here, but these methods are known to often suffer from long build times (typically hours for the databases of sizes and complexities typical to applications considered in this paper) and huge memory requirements (causing the need to rely on high-end database servers). Additionally, as we observed empirically, data cubes still typically need a second to respond to a complex query on the datasets which we tested. Such latency is an inconvenience to users who want to perform multiple ad-hoc queries on-the-fly. It also hampers statistical analyses which may require execution of millions of complex queries, and which could take days of processing time using industry-standard OLAP data cubes. Data cubes are closely related to another technology originating from computer science research: Cached Sufficient Statistics. Similarly to data cubes, cached statistics structures pre-compute answers to queries, however they cover all possible future queries, aiming at efficiency of not only data retrieval, but also their memory representations. All Dimensional Tree (AD-Tree, [1]) is a very good example of such data structure. AD-Trees are designed to efficiently represent counts of all possible co-occurrences among values of multiple dimensions of categorical data. This is very important in many scenarios involving statistical modeling of such data, where most operations require computing aggregate counts, ratios of counts or their products. Quick access to counts of arbitrary subsets of demographic properties is essential for overall performance of analytic tools relying on them. AD-Trees have been shown to dramatically speed-up notoriously expensive machine learning algorithms including Bayesian Network structure learning [1], 2

3 Decision Tree learning and Association Rule learning [2]. The attainable speedups range from one to four orders of magnitude with respect to previously known efficient implementations. These efficiencies are available at moderate memory requirements, which are easy to control. Dynamic AD-Trees [3] can grow on demand allowing for even more memory efficiencies. AD- Trees are the best of the existing solutions to categorical data representation when it comes to very quickly responding to ad-hoc queries against large datasets. T-Cube T-Cube is an extension of the idea of AD-Trees designed for very fast retrieval and analysis of additive data streams such as e.g. time series of counts. Technical description of the fundamental concepts of T-Cube can be found in [4], which also details techniques leading to further performance improvements such as: specific arrangements of demographic attributes, mostcommon-value-based pruning, and controlling the depth of the tree. They help to balance building time, query response time and physical memory requirements of the tool. Here we provide only a brief introduction to the main ideas underlying T-Cube. T-Cube addresses the algorithmic question of storing and searching the combinatorially large set of possible time series that can be derived from queries on attributes of data. Figure 1 conveys the basic idea, and illustrates its extension to time series data. For brevity we do not discuss here the details of the data structure, nor algorithms for construction and querying, but the essential property of the T-Cube is that once built, time series for any query (in a general class including conjunctions of disjunctions) can be obtained in constant time (independent of the number of records in the raw dataset). One example of such a query is get me the time series by day for all males in zip codes 15213, and 15206, excluding children, and specific to GI or Respiratory syndromes. The drawback of the simplistic approach above is that the size of the T- Cube structure grows impractically large if there are more than a few attributes. There are, however, a few simple innovations which can make this kind of approach practical. Firstly, we do not need to store any node which in fact corresponds to a time series of all zeros. Other series can be stored in space proportional to the number of non-zero values, and due to additional compression approaches we can achieve an average of less than four bits per nonzero time step per time series on real health informatics data. Secondly, even when there are no or few nodes with zero counts, an additional trick can make a large difference. This takes every Vary node in the diagram in Figure 1, and considers the most common value (MCV) of the variable that is to be instantiated. It is relatively easy to prove that all such nodes can be removed, together with any of their descendents, and the T-Cube will still contain sufficient information to retrieve any requested time series with no loss of accuracy. In our previous work the use of this innovation has taken the memory requirements for a 50-dimensional AD-tree from more than 100 terabytes down to 200 megabytes. The third trick involves the use of leaf lists in which nodes which occur infrequently in the raw data are replaced with a set of pointers to the raw data. This can reduce the memory requirements another 1-3 orders of magnitude, with a tradeoff in access time. Those relatively straightforward efficiencies of data caching allow T-Cube to perform time series queries 2-3 orders of magnitude faster than standard state-of-the-art data cube technologies. This speedup has been already found highly beneficial in the practice of bio-surveillance and food safety [5-6], where the need for rapid analysis of massive collection of time series data is very 3

4 common. T-Cube can be very useful in such applications for two main reasons: (1) It enables fast anomaly detection by simultaneous statistical analysis of many thousands of time series, and (2) It allows the users to perform many complex, ad hoc time series queries on the fly without inconvenient delays. The potential benefits are manifold and include scalable contextual labeling of queries and retrieval of patterns by content. The users can perform inverted queries in which they ask the server to search through thousands or even millions of previous time series to find series with given properties, and answer the question: Which demographic features of the series from historical data best explain a situation like this?. Figure 1. Simple example of a T-Cube built for data typical to the public health domain. T-Cube has been tested on synthetic and real-world datasets containing millions of records and hundreds of dimensions. Results show that its response time can be 1,000 times shorter than that of the state-of-the-art commercial database tools. The utility of the T-Cube structure has been already extensively demonstrated in practice in applications ranging from bio-surveillance, to monitoring food safety, to detection of emerging patterns of failures in maintenance and supply management systems. In one of those applications, the data under consideration included several relatively small sets of about 80 thousand records of transactions with 33 categorical variables of arities varying from 2 to over 100. The application Called for massive screening through all combinations of attribute-value pairs of sizes 1 and 2, the total number of such combinations approaching 1 million. The analytic task used expectation-based temporal scan algorithm to retrospectively detect unusual short-term increases in counts of specific aggregate time series. The total number of individual temporal scan tests for one such data set exceeded 2 billion. Each such test involved a Chi-square test of independence performed on a 2-by-2 contingency table formed by the counts corresponding to the time series of interest (one of the 1 million series) and 4

5 the baseline counts, within the current temporal window of interest (one of 2,000), and outside of it. The complete sequence, including the time necessary to retrieve and aggregate all the involved time series, compute and store the test results, load source data and build the T-Cube structure, etc., took about 1 hour of machine time. Using one of the commercial data cube tools, the time needed to retrieve the data corresponding to one of the involved queries was in the range of 280 milliseconds. Therefore, without the T-Cube, it would take about 3 days to just pull all the required data, not including any processing of it or execution of statistical tests. Table 1 presents results of a controlled experiment involving a data with 12 million records of transactions and 3 categorical fields of arities 1000, 10 and 5 respectively, covering a period of 5 years at daily resolution [4]. It compares complex query response times for 3 commercial data cube tools (their names have been anonymized upon requests from their vendors) and two configurations of T- Cube (one favoring rapid responses, the other memory-savvy). Each of the commercial tools required a different amount of memory to represent the test data, and the response time improved with the increase of the amount of used memory. However, they needed seconds to respond to a complex query on average. T-Cube on is able to respond in milliseconds, even in the memoryconscious mode. Tool A B C T-Cube 1 T-Cube 2 Memory [MB] over 1, Response time [s] Table 1. Performance of T-Cube compared against 3 commercial data cube tools. Figure 2. Screen shot of the T-Cube Web Interface displaying a time series chart and a screen shot of it showing spatial distribution of multivariate data of temporal counts. T-Cube Web Interface The original uses of the T-Cube data structure were focused on speeding up complex data mining operations rather than on supporting human users experience of the direct interaction with data. The T-Cube web interface attempts to fill that gap. It is a publicly accessible tool for interactive visualization and manipulation of large scale multivariate data of time series of counts [7]. It allows the user to execute complex queries and to run various types of statistical analyses on an 5

6 uploaded dataset. It can be accessed using any Java-enabled browser. The interface, still under incremental development and testing, includes a suite of visualization and statistical analysis tools allowing intuitive navigation through the data. After uploading a data file, complex queries and statistical tests can be performed. The interface also enables running massive searches for statistically significant patterns rapidly and at different levels of data aggregation. The left part of Figure 2 shows a time series chart and a menu of selectable categorical attributes for an example bio-event dataset; the right part of it presents a spatial representation of the same data. Conclusion T-Cube is an efficient tool for representing additive time series data labeled with a set of categorical attributes. It is especially useful for retrieving responses to ad-hoc complex queries against large datasets of that kind, where it significantly outperforms the existing commercial data cubes. T-Cubes are simple to setup and easy to use. Typically, it takes only minutes to build one from data. Database users do not need to define any stored procedures, or materialized views in order to make that happen. Once it is built, it is ready to rapidly respond to any simple or complex query. It can be used as a general tool for any application requesting access to time series data from a database. From the application s perspective it is transparent: it acts just like the database itself, but an incredibly quickly responding one. The T-Cube web interface is intended to become a user-level platform for variety of analytic endeavors which can benefit from T-Cube efficiency. The key areas include sub-domains of health informatics and tasks in which rapid analyses of large sets of time series data or interactive drill-downs are of interest. Its ease of use and availability should hopefully increase popularity and tangible success of datadriven methods of rapid detection of adverse events. We hope to see T-Cubes widely used. Acknowledgements This material is based upon work that was partially supported by the National Science Foundation under grant number IIS This work was partially supported by the Centers of Disease Control (award number R01-PH000028). References 1. A. Moore, M. Lee. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence research, 8, 67-91, B. Anderson, A. Moore. AD-trees for fast counting and for fast learning of association rules. 4th International Conference on Knowledge Discovery and Data Mining, P. Komarek, A. Moore. A dynamic adaptation of AD-trees for efficient machine learning on large data sets. Proceedings of the 17th International Conference on Machine Learning, M. Sabhnani, A. Moore, A. Dubrawski. T-Cube: Fast extraction of time series from large datasets. Technical Report, Carnegie Mellon University, CMU-ML , M. Sabhnani, A. Dubrawski, J. Schneider. Multivariate time series analyses using primitive univariate algorithms. Advances in Disease Surveillance 3, A. Dubrawski, M. Sabhnani, S. Ray, J. Roure, M. Baysek. T-Cube as an enabling technology in surveillance applications. Advances in Disease Surveillance 3, T-Cube Web Interface: 6