Stream processing in data-driven computational science

Transcription

1 Stream processing in data-driven computational science Ying Liu, Nithya N. Vijayakumar and Beth Plale Computer Science Department, Indiana University Bloomington, IN, USA {yingliu, nvijayak, Abstract The use of real-time data streams in data-driven computational science is driving the need for stream processing tools that work within the architectural framework of the larger application. Data stream processing systems are beginning to emerge in the commercial space, but these systems fail to address the needs of large-scale scientific applications. In this paper we illustrate the unique needs of large-scale data driven computational science through an example taken from weather prediction and forecasting. We apply a realistic workload from this application against our Calder stream processing system to determine effective throughput, event processing latency, data access scalability, and deployment latency. I. INTRODUCTION The same technology advancements that have driven down the price of handhelds, cameras, phones and other devices, have enabled affordable commodity sensors, wireless networks and other devices for scientific use. As a result, scientific computing that was previously static, such as weather forecast prediction models, can now be envisioned as dynamic - with models triggered in response to changes in the environment. The cyberinfrastructure needed to bring about the dynamic capabilities is still evolving. Stream processing in scientific applications differs from stream processing in other domains in important ways. We define a stream S as a sequence of events, S = {e i } where i is a monotonically increasing number and 0 < i <. Events often are timestamped. Depending on the source, event flow rates in a stream can range from an event per microsecond to an event per day, and can range in size from a few bytes to megabytes or gigabytes. The contents of an event could be for instance a new reading of a stock value, or could mark a state change in an application. Stream processing falls into three general categories: stream management systems, rule engines, and stream processing engines []. In stream management systems, stream processing is similar to a traditional database management system which could be relational [2] [3] or object-relational []. The interface is through a declarative SQL-style language that has This work is supported in part by NSF grants EIA and CDA- 0600, and DOE DE-FG02-0ER2600. been augmented with operations over time-based tables []. A client invokes pre-built operations or can code his own in a procedural language that is then stored as a stored procedure []. Rule engines date from the early 970 s. Clients write rules in a declarative programming language in which patterns of events can be described [6] [7]. The rule language supports relational and temporal operators, as well as subtyping, parallelization, etc. [8]. When events arrive, selected rules in the rule base are fired, causing an action to result. Rule engines include Message Oriented Middleware MOM) technologies. The latter hold a collection of user profiles in the form of XPath expressions as rules for instance [9] [0]. Arriving events are matched against the profiles, with the corresponding action being to forward the event on the user indicated in the profile. Stream processing engines SPE s) are designed specifically for processing data flows on the fly. In many systems described in literature and available commercially, engines execute queries continuously over arriving streams of data [] [2] [3]. Clients describe their filtering and processing needs through a declarative language or through a graphical user interfacegui) [] [] that is converted. Events are processed on the fly, without necessarily storing them. Queries can be deployed dynamically [3], and can have their operators reordered on the fly []. The SPE uses constructs such as the time window to deal with the unbounded nature of the streams. The size of the sliding window determines the history over which a operator can execute. Optimizations have been applied to yield memory savings for instance in [3] [] [6]. The SPE architecture uses an underlying storage and/or transport medium that can be files [2] [], a publish-subscribe system [7], or sockets [8]. The contributions of this paper are as follows. Through our extensive study of stream processing in the context of scientific computing, we have come to understand what we believe to be fundamental differences of stream processing in the context of scientific computing versus elsewhere. We list these requirements here. Having worked with meteorology researchers over the past several years, we understand their needs more clearly than others. Hence we have developed a

2 realistic stream workload and stream processing scenario for dynamic weather forecasting and use it to illustrate features of stream processing in data-driven scientific computing, through the Calder system developed at Indiana University. In [3] we evaluated throughput and deployment latency of single queries on a synthetic workload. In this paper we extend that work to encompass distributed collections of queries and users under synthetic and realistic workloads. Specifically, we measure effective throughput, event processing latency, data access scalability, and deployment latency. Our results show that good performance and excellent scalability can be achieved by a service that fits within the context of a data-driven, workfloworchestrated computational science application. The remainder of the paper is organized as follows. In Section II, we list and discuss unique features of data streams in data-driven science and the requirements of stream processing systems in scientific domain. In Section III, we describe a dynamic data stream example from weather prediction and forecasting. In Section IV, we briefly describe the Calder stream processing architecture and show how it fits in the framework of meteorology forecasting. In Section V, we experimentally evaluate our system under a realistic meteorological workload. Conclusions and future work are discussed in Section VI. II. STREAM PROCESSING IN COMPUTATIONAL SCIENCE Stream processing in computational science introduces challenges not always fully present in domains such as finance, media streaming, and business such as RFID tags). We characterize the list of unique requirements to data driven computational science as follows. We argue that the most data driven applications we have observed have these requirements. A) Heterogeneous data formats. Science applications use and generate data in many different data formats, including netcdf, HDF, FITS, JPG, XML, and ASCII. The binary formats can have complex access and retrieval APIs. B) Asynchronous streams. Stream generation rates can be highly asynchronous. One event stream might generate an event once every millisecond, while another might generate an event only once every 2 hours. Some SPEs fuse or join streams based on the assumption of relatively synchronous streams. C) Wide variance in event sizes. Events generated by a sensor are only a few bytes in size while events generated by large-scale instruments or regularly run models can be in the 0 s of megabytes in size. D) Timeliness is relative. One application may want to be notified the instant a condition occurs, whereas for a second application a condition may only emerge over days or weeks. E) Streaming is part of larger system. Stream processing in data-driven computational science can be one small part of a much larger system. Its architecture must be compatible with the overall system architecture. Data No. Ev. Size Ev. Rate Cum. Rate Cum. BW Source sources KB) ev/hr) event/hr) Kbps) Metars st order Metars nd order Rawinsondes buoy data) Acars NexRad II NexRad III GOES model data ) Eta model data) CAPS sensors) TABLE I OBSERVATIONAL DATA SOURCES USED IN MESOSCALE METEOROLOGY. SHOWS THE RATES AND SIZES OF DATA PRODUCTS OVER NEW ORLEANS. Fig.. Data sources around New Orleans. F) Scientists need changes as an experiment progresses. One could envision a dynamic weather prediction workflow that data mines a region of the atmosphere looking for tornado signatures then kicks off a prediction model. The region over which data mining is carried out will change as a storm moves across the Midwest for instance. As the storm moves, the filtering criteria e.g., spatial region) must adapt. G) Domain specific processing. Much stream processing in computational science is domain specific. For instance, a mesoscale detection algorithm classifies vortices detected in Doppler radar data. Thus, a stream processing system needs to be extensible, that is, it needs to provide mechanisms for scientists to extend stream and processing with their own operators. III. METEOROLOGY EXAMPLE Meteorology is a rich application domain for illustrating the uniqueness of stream processing in scientific domains. Atmospheric scientists have considerable number and variety of weather observational instruments available to them due in large part to over 00 years of history in observing the atmosphere. Tools such as the Unidata Internet Data Dissemination IDD) [9] system distribute many of the data products to interested universities for research purposes. The data products range considerably in their sizes and generation

3 rates. Table I lists nine of the most common data products. These products are moved to the location where the weather forecast model is to run, then ingested into the model at runtime. To illustrate the use of stream processing engines in this context, suppose that an atmospheric science student is studying Fall severe weather in the region around New Orleans, Louisiana see Figure ) and wants to kick off a regional km forecast when a storm cell emerges. The Figure shows the region around New Orleans approximately at degree North Latitude and 90.2 degree West Longitude). The innerbox in Figure marks an area of 2 degree Latitude height and 2 degree Longitude width around New Orleans, where one degree latitude is 70 statute miles and one degree longitude 60 is statute miles approximately. The figure is taken from the GeoGUI in LEAD portal [20]. The number of data products, their sizes and rates for the sensors that overlap the 80 mile radius around New Orleans are given in Table I. We call this the New Orleans Workload. The table shows nine data products, and for each type gives the number of sources. The event rate is the rate at which events are generated at the source. The cumulative rate and bandwidth are calculated over all data sources within a data type and under storm mode. An event is a time stamped observation from a data source. For the NexRad Level II Doppler radar, for instance, an event corresponds to a scan, where one scan consists of fourteen 360 degree sweeps of a radar. A scan completes in -7 minutes. The range given in the event size column of the table is bipolar: the small event size occurs during clear skies, and the large event size occurs during storm conditions. The variability in event rates in Table I, from 0.08 ev/hr to ev/min, and variability in event sizes, from KB to MB, clearly demonstrates several stream processing requirements of Section II, specifically asynchronous streams requirement B), and wide variances in event sizes requirement C). This collection of data products also demonstrates a common requirement of stream processing in scientific domains, that of heterogeneous data products requirement A). The product formats shown in Table I alone include text, raw radar format, model specific binary format, images, and netcdf data. IV. CALDER ARCHITECTURE Calder, developed at Indiana University, falls into the category of a stream processing engine SPE). Its purpose is to provide timely access to data streams. Additional details of the system architecture can be found in [3]. In this section, we provide a brief overview of the system architecture and show how a stream processing fits into a larger datadriven computational science application. In particular, we discuss a scenario in the context of the mesoscale meteorology forecasting example of Section III. We view data streams as a virtual data repository, that while data sources point of presence handlers for incoming Pub sub System channels channels, one per event type Fig. 2. continuous dynamic deployment runtime container execution engine Calder System Calder architecture. service factory planner service continuous GDS GDS GDS rowset request response chunk rowset service ring buffers hold results constantly changing, has many similarities to a database [2]. Like a database, a collection of streams is bound by coherence, in that the streams belonging to a collection are related to one another, and possess meaning in that a collection of streams can be described. We call such a collection of streams a Virtual Stream Store. Calder manages multiple virtual stream stores simultaneously and provides users with access to one or more virtual stream stores. Calder uses a publish-subscribe system, dquobec [22] as its underlying transport layer. How sensors and instruments are pub-sub enabled is outside our scope of research, but solutions exist, such as [23], which takes an XML approach. This pubsub enabling is shown in Figure 2 as a single point of presence, however other approaches exist. In the simplified diagram of Figure 2, the data streams flow to a execution engine where they are received by handlers. The runtime acts on each incoming event by triggering one or more queries. A executes on the event, and generates zero, one, or more events that either trigger other queries in the system or flow to the Rowset Service where they are stored to a ring buffer for user access. User interaction with Calder follows the Globus OGSI model of service interaction where a grid data service GDS) is created on behalf of a user to serve an interaction with the virtual stream store. The user submits SQL-like queries through the GDS. Details of the extended GDS interface are given in [2]. The planner service optimizes and distributes queries and fragments based on local and global optimization criteria. The planner service initiates a request to the rowset service to create a new ringbuffer for the. Calder supports monotonic time-sequenced SQL Select- From-Where queries. The operators supported are select/project/join operators where the join operator is an equijoin over the logical or physical time fields; the boolean operations are AND and OR; and relational operations are

4 / % & ' ' ) * + 0, ' 6 3 & ) < ' * 6 6 V. E XPERIMENTS 2. ). - radar scan. When the classification algorithm detects a vortice pattern that exceeds the threshold, a response trigger is issued to the response channel. The workflow engine is reading the response channel, and acts on the message to wake the dormant prediction simulation.! " : ; Fig. 3. Stream processing to detect vortices in Doppler radar data below) as part of a larger workflow above). =, 6=,,, < and >. We do not currently support aggregate operations like GROUP BY but are working towards it. In addition, our language supports START and EXPIRE clauses for specifying the lifetime of the, the RANGE clause for specifying a user s approximation of the divergence in stream rates that the will experience. RANGE is an optional clause which is only required for the which includes join operations. The EXEC FUNC clause specifies a user-defined function to be executed on the resulting events. As we indicated in requirement B of Section II, an SPE must often operate as part of a larger system. In applications where it makes sense to treat a collection of streams as a coherent and meaningful data resource, Calder provides continuous access to the resource. Figure 3 illustrates how Calder works in an real application, in this case from mesoscale meteorology. Suppose a storm front is moving across the U.S. Midwest, threatening to spawn tornados. A user wants to deploy a data mining agent that can detect precursor conditions for a tornado, and when detected, spawn a weather prediction model. A scientist creates an experiment by interacting with an experiment builder [2] accessed through a science gateway. The specification is handed off to a workflow engine, which interacts with component pieces through a notification system. The workflow engine interacts with Calder by passing it a declarative, similar to how it would interact with a database management system. Calder optimizes the and deploys it which includes the data mining classification components [26]) at a computational node located, for instance, on the Teragrid [27]. The when instantiated at the computational node executes the filtering/data mining loop depicted at the bottom of Figure 3 for every incoming Doppler As discussed in Section II, data driven computational science imposes unique demands on a stream processing engine SPE). While some of these requirements are future work see Section VI), Calder already addresses several important requirements. One of these is the requirement that the engine adapt to changing needs of the experiment, requirement F). Calder addresses this through dynamic deployment of queries at runtime. We also experimentally evaluate the scalability of the rowset service, because while not unique to scientific computing, is important nonetheless. Finally, we examine throughput and event processing latency of a execution engine for the scenario given in Section III. A. Experimental Setup We developed a workload simulator that simulates the instrument types common to mesoscale meteorology. The simulator generates events at realistic sizes and rates as shown in Table I. Our workload generator is a set of highly configurable parallelized processes. Each process takes a channel name, data type, rates, sizes, and modes clear or storm) for one instrument and produces a stream of events of the required size and rate with pre-set metadata. The streams generate events onto the dquobec publish-subscribe system, one stream per channel. In our experimental setup, each execution engine registers through the pub-sub system to receive all data products from the workload simulator. The experiment is executed on a 28-node cluster where each node runs RHEL WS release and has dual AMD 2.0 GHz Opteron 6 bit processors with GB memory. The Opteron nodes are interconnected by a Gbps LAN. The simulator processes execute on 9 cluster nodes. B. Query Deployment Latency In this first experiment, we examine the time taken to deploy a continuous into the Calder system while performing under the New Orleans workload. We used a set of selectproject queries that filter the data products on temporal and spatial aspects. Currently, Calder supports only falls-within boundary check) spatial queries. Users submit queries through the Grid Data Service. The planner service creates a execution plan, and then deploys the to the execution engine. Microbenchmarks of the steps of deployment are presented in [3]. Here we examine the scalability of deployment latency by submitting 000 queries across 2 processing engines using 2 nodes of the Opteron cluster. The number of simultaneous users submitting queries is set at 0, based on a study [28] that estimates the number of

5 users running canonical workflows in LEAD simultaneously at 0. Query deployment latency includes plan generation, distribution and installation time, plus the overhead for XML and SOAP communication and processing between the different components of Calder. Figure shows the average deployment latency as seen by 0 users for 20 queries. The X axis shows in milliseconds. The deployment latency of the n th in the figure was computed by taking the average of the deployment latency of the n th for all 0 users. One can see from Figure that latency is high for the first and low thereafter. The initial high latency can be attributed to the large user proxy creation GDS setup) time of approximately 200ms. While Figure shows the average latency, and we can see that after the first, the deployment latency seen by the user is almost constant at around 300 to 00 ms. The table embedded on top right of Figure shows the overall distribution of deployment latency for all the 000 queries. From this table, it can be observed that maximum number of queries fall in the range of ms. Query Deployment Latency ms) Time ms) Count Number of Queries Fig.. Average deployment latency and frequency distribution for 000 queries. C. Data Access Scalability The rowset service provides users and programs with flexibility in data access by synchronizing data generation between the execution engine and requests by the users. Users request their results through OGSA-DAI v6 GDS OGSI) that has been extended to support stream data resource [2]. The GDS maintains a persistent connection to the rowset service and thus a user can submit any number of rowset requests using a single GDS. In this experiment we study the scalability of this service by measuring the response time as seen by a single user in the presence of multiple other users and the resultant data streams from New Orleans workload. Each user instantiates a GDS to connect to the rowset service. The rowset service response time is defined as a time period from the instance a request is submitted to the rowset service until the instant the user receives the first result. The scalability experiment consists of two simultaneous tasks. The rowset service is fed with New Orleans workload data products defined in Table I. The user request workload is simulated as many users sending requests to the rowset service simultaneously. We measure a single user s response time while gradually increasing from 0 to 800, the number of users sending requests to the rowset service. The results appearing in Figure is the average response time calculated over 0 runs. Further scaling beyond 800 users encountered a limit on the maximum number of open sockets for a process, because each GDS maintains an active connection to the rowset service. We can see from the figure that the response time increases in proportion to the number of users. But the response time with 800 users in the system is still in a reasonable range of 20ms. The best fit plot shows trend of increase. The variation in response time is caused by the variations in rates of the input streams which in turn influence the rates at which streams arrive at the rowset service. D. Throughput of execution engine 0 Average RT Trend 6 Input BW Output BW Mean Output BW Trend Output BW Scatter Plot Average Response Time ms) Bandwidth BW) in Mbps Number of users Queries Fig.. Response time of the rowset service as seen by a single user in the presence of n number of users. Fig. 7. Input and output network bandwidth at a Query Execution Engine under increasing load. Pass-through queries running over New Orleans Workload

6 27 Metar 0.9 Kbps 00 Metar 2. Kbps Event AMeta ) Q Rawin sondes ACAR NEXRAD II NEXRAD III GOES Eta CAPS 0.07 Kbps Kbps Kbps 2.6 Kbps 9. Kbps 6 Kbps Kbps collective bandwidth under storm mode Max = 22. Mbps Raw data on channels Event AMeta ) Q Event BMeta 2) Q2 Event BMeta 2) Q22 Event ICAPS) Q89 Event ICAPS) Q99 Event ICAPS) Q00 Query Execution Engine collective bandwidth under storm mode Max = Mbps Output data on channels Data Products Fig. 6. right). Data types and bandwidth produced by workload simulator left) operating in storm mode and collective output bandwidth from queries output stream The purpose of this third experiment is to compute throughput for a single execution engine on a large number of queries and realistic data streams with high throughput output bandwidth). The output bandwidth of a single is the product of the rate and size of the output events produced by the. The overall output bandwidth of a quoblet is the sum of all output bandwidth produced by all the queries deployed in the system at that time. Figure 6 shows data products reflecting the New Orleans workload given in Table I. The cumulative input bandwidth shown in Figure 6 is calculated by adding the bandwidth of each data product under storm mode. In this experiment, we capture the scalability of a execution engine on a single computational node as a function of the output bandwidth of the engine. We gradually increase the number of queries deployed and measure the overall output bandwidth at different times. We use a suite of metadata filtering queries, each of which executes on one of the 9 data products. For the purposes of this experiment, our queries are pass-through select all) queries that act on one data product at a time. Pass-through queries remove the bias on the output bandwidth. Each additional produces a specific output stream. An example of pass-through is as follows: SELECT * FROM NexRad Level II START " T00:00: :00" EXPIRE " T00:00: :00"; The execution engine the quoblet) is hosted on a single computational node. The output streams generated by the queries are fed to client processes listening on corresponding output channels. The streams are mapped oneto-one to channels in the the underlying pub-sub system. The workload simulator was configured to generate the New Orleans workload under clear sky mode. We increased the rates of few data products while correspondingly decreasing their sizes to maintain a smooth continuous input. The queries were submitted at 0-second intervals and the throughput measured at -second intervals. Figure 7 shows the input and output bandwidth Y-axis) measured for 00 queries X-axis). We can see that the input bandwidth is steady around 0.Mbps clear mode) which is less than the cumulative input bandwidth shown in Figure 6 storm mode). Figure 7 shows that the output bandwidth of the engine Y axis) increases with the number of queries X axis). The scatter plot plots all the output bandwidth measurements taken. The average output bandwidth is connected and a trend line super-imposed to show the increasing nature of output bandwidth with increase in the number of queries. From Figure 7, we can see that the output bandwidth keeps increasing linearly with increasing number of queries. We tested up to 00 queries and this increasing trend continues providing an average throughput of 38Mbps for 00 queries. This shows the ability of the quoblet to scale well to hundreds of queries under clear mode. We are currently working on throughput measurements for storm mode as well. The maximum number of queries supported by a execution engine is influenced by several factors including the arrival rate of the input streams and the complexity of the queries. The current measurements were taken in a cluster where the stream providers and the processing node existed in the same LAN and were connected by a Gbps Ethernet connection. In a wide area network, the output bandwidth may be restricted by the maximum bandwidth of

7 Description Average Std. Deviation Query Execution Time ms) MDA Execution Time ms) Total Service Time ms) Fig. 8. data TABLE II SERVICE TIME FOR EXECUTING FILTER QUERY AND DATA MINING ALGORITHM ON NEXRAD LEVEL II DATA. Execution of filter and data mining algorithm on NexRad II the network links. E. Event Processing Latency In this final experiment we measure event processing latency service time) for a typical in the context of the motivating example of Section III, that is, where the portion filters out all data products that are not NexRad Level II data and are outside the geospatial region of interest. The data mining portion of the is a Mesoscale Detection classifier algorithm [26] that operates over the NexRad Level II data. Figure 8 shows the relationship between the two pieces, and a representative is as follows: SELECT * FROM NexRad Level II WHERE southbound >= "28.00" and eastbound <= "-89.00" and northbound <= "3.00" and westbound >= "-9.00" EXEC_FUNC MDA_Algorithm START " T00:00: :00" EXPIRE " T00:00: :00" Table II shows service time distributed across the filtering and mining parts of the execution. We can see that execution consumes a small fraction of total service time. More complex queries may consume longer execution time, but this confirms earlier results [29] and also confirms our earlier results that service time is dependant on the rates of the input streams when joins are involved [7]. VI. CONCLUSION AND FUTURE WORK In this paper we have distinguished the major categories of stream processing system, and have argued that data driven science imposes unique demands on stream processing systems. The stream processing needs from meteorology researchers are evidence of the unique requirements of stream processing systems in data driven applications. We signified this point by describing a use scenario. The scenario motivates the experimental evaluation carried out on the Calder stream processing engine. Specifically, the experiments apply a realistic workload from the meteorology applications against the Calder system to experimentally determine its effective throughput, event processing latency, data access scalability, and deployment latency. The primary focus of our ongoing work is to support ing on XML data streams, because Calder, though currently supports various data formats, lacks the ability to dynamically add new data formats and user-defined functions needed to satisfy the requirement A). XML based language support will allow users to dynamically define new data formats. Our second focus is stream resource discovery. As discussed earlier, a collection of active streams form a Virtual Stream Store; the store must be describable in a way that clients can discover it, understand the data it contains, and issue suitably formatted queries. Capturing metadata about the streams, the collection of streams, queries, and other details such as data format is key to enabling discovery. Third, optimal placement is dependent upon information about the computational mesh in which the queries exist, so metadata must include performance monitoring information collected about streams in real time. We are also planning to migrate our GDS to OGSA-DAI WSRF.0 which is compatible with Globus Toolkit.0. Finally, we are examining issues of user privacy, approximate processing in the occurrence of missing data, missing streams and dynamic deployment of user specified data mining. The latter is needed to satisfy the requirement G. REFERENCES [] M. Stonebraker, U. Cetintemel, and S. Zdonik, The 8 requirements of real-time stream processing, SIGMOD Rec., vol. 3, no., 200. [2] A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom, Stream: The Stanford data stream management system, in Data Stream Management, 200. [3] T. Johnson, C. D. Cranor, and O. Spatscheck, Gigascope: a stream database for network application, in ACM SIGMOD International Conference on Management of Data, [] M.G.Koparanova and T.Risch, High-performance stream-oriented grid database manager for scientific data, in st European Across Grids Conference, [] A. Arasu, S. Babu, and J. Widom, The CQL continuous language: Semantic foundations and execution, In Very Large Database VLDB) Journal, vol., no., 200. [6] D. Luckham, The Power of Events. Addison Wesley, [7] L. Brownston, R. Farrell, E. Kant, and N. Martin, Programming Expert Systems in OPS. Addison Wesley, 98.

8 [8] D. C. Luckham and J. Vera, An event-based architecture definition language, IEEE Transactions on Software Engineering, vol. 2, no. 9, pp , 99. [9] M. Altinel and M. J. Franklin, Efficient filtering of XML documents for selective dissemination of information, in The Very Large Database VLDB Conference, [0] B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda, Monitoring XML data on the Web, SIGMOD Record, vol. 30, no. 2, pp. 37 8, 200. [] R. Avnur and J. M. Hellerstein, Eddies: continuously adaptive processing, in ACM SIGMOD International Conference on Management of Data, [2] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah, Telegraphcq: Continuous dataflow processing for an uncertain world, in Conference on Innovative Database systems Research CIDR), [3] N. Vijayakumar, Y. Liu, and B. Plale, Calder grid service: Insights and experimental evaluations, in CCGrid Conference, [] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.- H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik, The Design of the Borealis Stream Processing Engine, in Second Biennial Conference on Innovative Data Systems Research CIDR) Conference, 200. [] U. V. Catalyurek, Supporting large scale data driven science in distributed environments, in Minisymposium on Distributed Data Management Infrastructures for Scalable Computational Science and Engineering Applications, SIAM Conference on Computational Science and Engineering SIAM CSE 0), 200. [6] B. Plale, Leveraging run time knowledge about event rates to improve memory utilization in wide area data stream filtering, in IEEE International Symposium on High Performance Distributed Computing, [7] B. Plale and N. Vijayakumar, Evaluation of rate-based adaptivity in joining asynchronous data streams, in 9th IEEE International Parallel and Distributed Processing Symposium, April 200. [8] D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik, Aurora: A new model and architecture for data stream management, In Very Large Database VLDB) Journal, vol. 2, no. 2, pp , [9] B. Domenico, Unidata internet data distribution: Real-time data on the desktop, in Science Information Systems Interoperability Conference SISIC), 200. [20] K. K. Droegemeier, V. Chandrasekar, R. Clark, D. Gannon, S. Graves, E. Joseph, M. Ramamurthy, R. Wilhelmson, K. Brewster, B. Domenico, T. Leyton, V. Morris, D. Murray, B. Plale, R. Ramachandran, D. Reed, J. Rushing, D. Weber, A. Wilson, M. Xue, and S. Yalda, Linked environments for atmospheric discovery LEAD): A cyberinfrastructure for mesoscale meteorology research and education, in 20th Conf. on Interactive Information Processing Systems for Meteorology, Oceanography, and Hydrology, Seattle, WA, 200. [2] B. Plale, Using global snapshots to access data streams on the grid, in Lecture Notes in Computer Science, Volume 36. Springer Verlag, 200, 2nd European Across Grids Conference AxGrids). [22] N. Vijayakumar and B. Plale, dquobec event channel communication system, Computer Science Department of Indiana University, Tech. Rep. TR6, 200. [23] D. McMullen, et al., Instruments and sensors on the grid: Issues and challenges, in GlobusWorld, 200. [2] Y. Liu, B. Plale, and N. Vijayakumar, Realization of ggf dais data service interface for grid access to data streams, Indiana University, Computer Science Department, Tech. Rep. TR63, 200. [2] B. Plale, D. Gannon, Y. Huang, G. Kandaswamy, S. L. Pallickara, and A. Slominski, Cooperating services for data-driven computational experimentation, in Computing in Science and Engineering CiSE), 200. [26] J. Rushing, R. Ramachandran, U. Nair, S. Graves, R. Welch, and A. Lin, ADaM: A data mining toolkit for scientists and engineers, Computers and Geosciences, vol. 3, 200. [27] TeraGrid, 200, [Online]. Available: [28] B. Plale, Usage study for data storage repository in LEAD, 200, LEAD TR00. [29] B. Plale and K. Schwan, Dynamic ing of streaming data with the dquob system, IEEE Transactions on Parallel and Distributed Systems, vol., no., April, 2003.