A Virtual Data Grid for LIGO
|
|
|
- Lee Stephens
- 10 years ago
- Views:
Transcription
1 A Virtual Data Grid for LIGO Ewa Deelman, Carl Kesselman Information Sciences Institute, University of Southern California Roy Williams Center for Advanced Computing Research, California Institute of Technology Albert Lazzarini, Thomas A. Prince LIGO experiment, California Institute of Technology Joe Romano Physics, University of Texas, Brownsville Bruce Allen Physics, University of Wisconsin Abstract. GriPhyN (Grid Physics Network) is a large US collaboration to build grid services for large physics experiments, one of which is LIGO, a gravitational-wave observatory. This paper explains the physics and computing challenges of LIGO, and the tools that GriPhyN will build to address them. A key component needed to implement the data pipeline is a virtual data service; a system to dynamically create data products requested during the various stages. The data could possibly be already processed in a certain way, it may be in a file on a storage system, it may be cached, or it may need to be created through computation. The full elaboration of this system will allow complex data pipelines to be set up as virtual data objects, with existing data being transformed in diverse ways. This document is a data-computing view of a large physics observatory (LIGO), with plans for implementing a concept (Virtual Data) by the GriPhyN collaboration in its support. There are three sections: Physics: what data is collected and what kind of analysis is performed on the data. The Virtual Data Grid, which describes and elaborates the concept and how it is used in support of LIGO. The GriPhyN Layer, which describes how the virtual data will be stored and handled with the use of the Globus Replica Catalog and the Metadata Catalog. 1 Physics LIGO (Laser Interferometer Gravitational-Wave Observatory) [1] is a joint Caltech/ MIT project to directly detect the gravitational waves predicted by Einstein's theory of relativity: oscillations in the fabric of space-time. Currently, there are two observatories in the United States, (Washington state and Louisiana), one of which recently went through the first lock phase, in which initial calibration data was collected. Theoretically, gravitational waves are generated by accelerating masses, however they are so weak that so far they have not been directly detected (although their indirect influence
2 has been inferred [2]). Because of the extreme weakness of the signals, large data-computing resources are needed to extract them from noise and differentiate astrophysical gravitational waves from locally-generated interference. Some phenomena expected to produce gravitational waves are: Coalescence of pairs of compact objects such as black-holes and neutron stars. As they orbit, they lose energy and spiral inwards, with a characteristic chirp over several minutes. Estimates predict of order one observation per year for the current generation of detectors. Continuous-wave signals over many years, from compact objects rotating asymmetrically. Their weakness implies that deep computation will be necessary. Supernova explosions the search may be triggered by observation from traditional astronomical observatories. Starquakes in neutron stars. Primordial signals from the very birth of the Universe. LIGO s instruments are laser interferometers, operating in a 4km high-vacuum cavity, and can measure very small changes in length of the cavity. Because of the high sensitivity, phenomena such as seismic and acoustic noise, magnetic fields, or laser fluctuations can swamp the astrophysical signal, and must themselves be instrumented. Computing is crucial to digitally remove the spurious signals and search for significant patterns in the resulting multi-channel time-series. The raw data is a collection of time series sampled at various frequencies (e.g., 16kHz, 16Hz, 1Hz, etc.) with the amount of data expected to be generated and catalogued each year is in the order of tens of terabytes. The data collected represents a gravitational channel (less than 1% of all data collected) and other instrumental channels produced by seismographs, acoustic detectors etc. Analysis on the data is performed in both time and frequency domains. Requirements are to be able to perform single channel analysis over a long period of time as well as multi-channel analysis over a short time period. 1.1 Data in LIGO Each scalar time series is represented by a sequence of 2-byte integers, though some are 4-byte integers. Time is represented by GPS time, the number of seconds since an epoch in 1981, and it is therefore a 9-digit number, possibly followed by 9 more digits for nanosecond accuracy (it will become 10 digits in Sept. 2011). Data is stored in Frame files, a standard format accepted throughout the gravitational wave community. Such a file can hold a set of time-series with different frequencies, together with metadata about channel names, time intervals, frequencies, file provenance, etc. In LIGO, the Frames containing the raw data encompass an interval of time of one second and result in about 3Mb of data. In addition to the raw time series, there are many derived data products. Channels can be combined, filtered and processed in many ways, not just in the time domain, but also in the Fourier basis, or others, such as wavelets. Knowledge is finally extracted from the data through pattern matching algorithms; these create collections of candidate events, for example, inspiral events or candidate pulsars.
3 1.2 LIGO Data Objects The Ligo Data Analysis System (LDAS) [3] is a set of component services for processing and archiving LIGO data. Services deliver, clean, filter, and store data. There is high-performance parallel computing for real-time inspiral search, storage in a distributed hierarchical way, a relational database, as well as a user interface based on the Tcl language. The LIGO data model splits data from metadata explicitly. Bulk data is stored in Frame files, as explained above, and metadata is stored in a relational database, IBM DB2. There is also an XML format called LIGO-LW (an extension of XSIL [4]) for representing annotated, structured scientific data, that is used for communication between the distributed services of LDAS. In general, a file may contain more than one Frame, so we define the word FrameFile, for a file that may contain many frames. Raw data files contain only one frame, and they are named by the interferometer that produced the data (H: Hanford, L: Livingston), then the 9-digit GPS time corresponding to the beginning of the data. There is a one or two letter indication of what kind of data is in the file, (F: full, R: reduced, T: trend, etc.). So an example of a raw data frame might be H F. For long-term archiving, rather larger files are wanted than the 3-megabyte, one second raw frames, so there are collection-based files, generally as multi-frame FrameFiles. In either case, an additional attribute is in the file name saying how many frames there are, for example H F.n200 would be expected to contain 200 frames. One table of the metadatabase contains FrameSets, which is an abstraction of the FrameFile concept, recognizing that a FrameFile may be stored in many places: perhaps on tape at the observatory, on disk in several places, in deep archive. Each frame file records the names of all of the approximately one thousand channels that constitute that frame. In general, however, the name set does not change for thousands or tens of thousands of one-second frames. Therefore, we keep a ChannelSet object, which is a list of channel names together with an ID number. Thus the catalog of frames need only store the ChannelSet ID rather than the whole set of names. The metadatabase also keeps collections of Events. An event may be a candidate for an astrophysical event such as a black-hole merger or pulsar, or it may refer to a condition of the LIGO instrument, the breaking of a feedback loop or the RF signal from a nearby lightning strike. The generic Event really has only two variables: type and significance (also called Signal to Noise Ratio, or SNR). Very significant events are examined closely, and insignificant events used for generating histograms and other statistical reports. 1.3 Computational aspects, Pulsar Search LDAS is designed primarily to analyze the data stream in real time to find inspiral events, and secondarily to make a long-term archive of a suitable subset of the full LIGO data stream. A primary focus of the GriPhyN effort is to use this archive for a full-scale search for continuous-wave sources. This search can use unlimited computing resources, since it can be done at essentially arbitrary depth and detail. A major objective and testbed of the GriPhyN involvement is to do this search with backfill computa-
4 tion (the paradigm), with high-performance computers in the physics community. If a massive, rotating ellipsoid does not have coincident rotational and inertial axes (a time-dependent quadrupole moment), then it emits gravitational radiation. However, the radiation is very weak unless the object is extremely dense, rotating quickly, and has a large quadrupole moment. While the estimates of such parameters in astrophysicallysignificant situations are vague, it is expected that such sources will be very faint. The search is computationally intensive primarily because it must search a large parameter space. The principle dimensions of the search space are position in the sky, frequency, and rate of change of frequency. The search is implemented as a pipeline of data transformations. First steps are cleaning and reshaping of the data, followed by a careful removal of instrumental artifacts to make a best estimate of the actual deformation of space-time geometry at the LIGO site. From this, increasingly specialized data products are made, and new ways to calibrate and filter the raw and refined data. The pulsar search problem in particular can be thought of as finding features in a large, very noisy image. The image is in time-frequency space, and the features are curves of almost-constant frequency the base frequency of the pulsar modulated by the doppler shifts caused by the motion of the Earth and the pulsar itself. The pulsar search can be parallelized by splitting the possible frequencies into bins, and each processor searching a given bin. The search involves selecting sky position and frequency slowing, and searching for statistically-significant signals. Once a pulsar source has been detected, the result is catalogued as an event data structure, which describes the pulsar's position in the sky, the signal-to-noise ratio, time etc. 1.4 Event Identification Computation Pipeline During the search for astrophysical events a long duration, one dimensional time series is processed by a variety of filters. These filters then produce a new time series which may represent the signal to noise ratio in the data. A threshold is applied to each of the new time series in order to extract possible events, which are catalogued in the LIGO database. In order to determine if the event is significant, the raw data containing instrumentation channels needs to be re-examined. It is possible that the occurrence of the event was triggered by some phenomena such as lightning strikes, acoustic noise, seismic activity, etc. These are recorded by various instruments present in the LIGO system and can be found in the raw data channels. To eliminate their influence, the instrumental and environmental monitor channels must be examined and compared to the occurrence of the event. The location of the raw data channels can be found in the LIGO database. Since the event is pinpointed in time, only small portions of the many channels (that possibly needs to be processed) need to be examined. This computational pipeline clearly demonstrates the need for efficient indexing and processing of data in various views: a long time interval single channel data, such as the initial data being filtered, and the many channel, short time interval such as the instrument data needed to add confidence to the observation of events.
5 2 GriPhyN/LIGO Virtual Data Grid GriPhyN [5] (Grid Physics Network) is a collaboration funded by the US National Science Foundation to build tools that can handle the very large (petabyte) scale data requirements of cutting-edge physics experiments. In addition to the LIGO observatory, GriPhyN works with the CMS and Atlas experiments at CERN s Large Hadron Collider [6] and the Sloan Digital Sky Survey [7]. A key component needed to implement the data pipeline is a virtual data service; a system to dynamically create data products requested during the various stages. The data could possibly be already processed in a certain way, it may be in a file on a storage system, it may be cached, or it may need to be created through computation. The full elaboration of this system will allow complex data pipelines to be set up as virtual data objects, with existing data being transformed in diverse ways. 2.1 Virtual Data An example of Virtual Data is this: The first 1000 digits of Pi. It defines a data object without computing it. If this request comes in, we might already have the result in deep archive: should we get it from there, or just rerun the computation? Another example is: Pi to 1000 places. How can we decide if these are the same request even though the words are different? If someone asks for Pi to 30 digits, but we already have the first two, how can we decide that the latter can be derived easily from the former? These questions lie at the heart of the GriPhyN Virtual Data concept. In the extreme, there is only raw data. Other requests for data, such as obtaining a single channel of data ranging over a large time interval, can be derived from the original data set. At the other extreme, every single data product that has been created (even if it represents an intermediate step not referred to again) can be archived. Clearly, neither extreme is an efficient solution; however with the use of the Virtual Data Grid (VDG) technology, one can bridge the two extremes. The raw data is of course kept and some of the derived data products are archived as well. Additionally, data can be distributed among various storage systems, providing opportunities for intelligent data retrieval and replication. VDG will provide transparent access to virtual data products. To efficiently satisfy requests for data, the VDG needs to make decisions about the instantiation of the various objects. The following are some examples of VDG support for LIGO data: Raw data vs. cleaned data channels. Most likely, only the virtual data representing the most interesting clean channels should be instantiated. Data composed from smaller pieces of data, such as long duration frames that could have been already processed from many short duration frames. Time-frequency image, such as the one constructed during the pulsar search. Most likely the entire frequency-time image will not be archived. However, all its components (short power spectra) might be instantiated. The VDG can then compose the desired frequency-time images on demand. Interesting events. Given a strong signal representing a particularly promising event, the engineering data related to the time period of the occurrence of the
6 event will most likely be accessed and filtered often. In this case, the VDG might instantiate preprocessed instrumental and environmental data channels, data that might otherwise exist only in its raw form. 2.2 Simple Virtual Data Request Scenario A VDG is defined by its Virtual Data Domain, meaning the (possibly infinite) set of all possible data objects that can be produced, together with a naming scheme ( coordinate system ) for the Domain. There are functions on the domain that map a name in the domain to a data object. The VDG software is responsible for caching and replicating instantiated data objects. In our development of the LIGO VDG, we begin with a simple Cartesian product domain, and then add complexity. Let D be the Cartesian product of a time interval with a set of channels, and we think of the LIGO data as a map from D to the value of the channel at a given time. In reality, of course, it is complicated by missing data, multiple interferometers, data recorded in different formats, and so on. In the following, we consider requests for the data from a subdomain of the full domain. Each request is for the data from a subset of the channels for a subinterval of the full time interval. Thus, a request might be written in as: T0= , T1= ; IFO_DCDM_1, IFO_Seis_* where IFO_DCDM_1 is a channel, and IFO_Seis_* is a regular expression on channel names. Our first task is to create a naming scheme for the virtual data object, each name being a combination of the name of a time interval and the name of a set of channels. This could be satisfied if there is a suitable superset file in the replica database, for example this one: T0= ,T1= ;IFO_* We need to be able to decide if a given Virtual Data Object (VDO) contains another, or what set of VDO's can be used to create the requested VDO. If C i is a subset of the channels, and I i is a subinterval, then tools could be used to combine multiple files (C 1,I 1 ), (C 2,I 2 ),... perhaps as: The new file could be (C, I), where C = union C i and I = intersect I i, a channel union tool, or The new file could be (C, I), where C = intersect C i and I = union I i, an interval union tool. We could thus respond to requests by composing existing files from the distributed storage to form the requested file. To be effective, we need to develop a knowledge base of the various transformations (i.e., how they are performed, which results are temporary, and which need to be persistent). We need to know the nature of the context in which these transformations occur: something simple is the Fourier transform, but more difficult would a transformation that uses a certain code, compiled in a certain way, running on a specific machine. This description of context is needed in order to be able to execute the transformations on data sets as well as to determine how to describe them.
7 2.3 Generalized Virtual Data Description The goal of the GriPhyN Virtual Data Grid system is to make it easy for an application or user to access data. As explained above, as a starting point we will service requests consisting of a range of time t 0 to t 1 (specified in GPS seconds), followed by a list of channels. However, we now extend the idea of channel to virtual channel. Virtual Channels A virtual channel is a time series, like a real channel, but it may be derived from actual channels, and not correspond to a channel in the raw data. Some examples of virtual channels are: An actual recorded channel, raw. An actual recorded channel, but downsampled or resampled to a different sampling rate. An arithmetic combination of channels, for example 2C 1 + 3C 2, where C 1 and C 2 are existing channels. The actual channel, convolved with a particular calibration response function, and scaled. For example, the X component of the acceleration. The virtual channel might be computed from the actual data channels in different ways depending upon what time interval is requested (e.g., the calibrations changed, the channels were hooked up differently, etc.). A virtual channel could be defined in terms of transformations applied to other virtual channels. In short, the virtual channels are a set of transformations applied to the raw data. The set of virtual channels would be extendable by the user. As the project progresses, one may want to extend the set of virtual channels to include additional, useful transformations. Thus, if a user is willing to define a virtual channel by specifying all the necessary transformations, it will be entered in the catalog and will be available to all users, programs, and services. New channels can be created from the raw data channels by parameterized filters, for example decimation, heterodyning, whitening, principle components, autocorrelation, and so forth. Data naming A crucial step in the creation of the GriPhyN Virtual Data model is the naming scheme for virtual data objects. Semantically, we might think of names as a set of keyword-value pairs, extended by transformations, perhaps something like (T0=123,T1=456,Chan=[A,B*,C?]).pca().decimate(500). The first part in parentheses is the keyword-value set, the rest is a sequence of (perhaps parameterized) filters. We could also think of using names that contain an SQL query, or even names that include an entire program to be executed. The syntax could also be expressed in other ways, as XML, or with metacharacters escaped to build a posix-like file name. We could use a syntax like protocol://virtual-data-name to express these different syntax in one extensible syntax. However, decisions on naming Virtual Data must be premised on existing schemes described in Section 1.2.
8 3 GriPhyN Support The goal of GriPhyN is to satisfy user data requests transparently. When a request for a set of virtual channels spanning a given time interval is made, the application (user program) does not need to have any knowledge about the actual data location, or even if the data has been pre-computed. GriPhyN will deliver the requested virtual channels by either retrieving existing data from long term storage, data caches containing previously requested virtual channels, or by calculating the desired channels from the available data (if possible). When satisfying requests from users, data may be in slow or fast storage, on tape, at great distance, or on nearby spinning disk. The data may be in small pieces (~1 second) or in long contiguous intervals (~1 day), and conversion from one to another requires computational and network resources. A given request for the data from a given time interval can thus be constructed by joining many local, small files, by fetching a distant file that contains the entire interval, or by a combination of these techniques. The heart of this project is the understanding and solution of this optimization problem, and an implementation of a real data server using GriPhyN tools. 3.1 User Requests The initial implementation of the GriPhyN system will accept requests in the semantic form: t0,t1; A,B,C..., where t0, t1 is a time interval, and A,B,C,... are virtual channels. We assume that the order in which the channels are listed does not affect the outcome of the request. The syntax of the virtual channel is yet to be determined. In the simplest form, a virtual channel resulting from a transformation Tr x on a channel C y would be Tr x (C y ); additional attributes (such as transformation parameters or other input channels) can be specified as additional parameters in the list: Tr x (C y, C z ;?,?, ), depending on the transformation. The transformation specific information will be stored in the Transformation Catalog described below. The semantic content of the request is defined above, but not the syntax. The actual formulation for the GriPhyN-LIGO VDG will be an XML document, though the precise schema is not yet known. We are intending to implement requests as a very small document (for efficiency), but with links to context, so that a machine can create the correct environment for executing the request. We expect much of the schema development to be implemented with nested XML namespaces [8]. 3.2 Data access, cost performance estimation When a user or computer issues a request to the Virtual Data Grid, it is initially received by a Request Manager service and sent for processing to the Metadata Catalog, which provides the set of logical files that satisfies the request, if such exists. The files names are retrieved from the Metadata Catalog based on a set of attributes. The logical files found in the Metadata Catalog are sent to the Replica Catalog, which maps them to a unique file, perhaps the closest in some sense of many such replicas.
9 The information about the actual file existence and location (provided by the Replica Catalog) are passed to the Request Manager, which makes a determination about how to deliver the data. If the requested data is present, the Request Manager still needs to determine whether it is cheaper to recalculate the data or access it. When considering the cost of referencing data, the cost of accessing various replicas of the data (if present) needs to be estimated. If the data is not present, the possibility and cost of data calculation needs to be evaluated. In order to make these decisions, the Request Manager queries the Information Catalog. The latter can provide information about the available computational resources, network latencies, bandwidth, etc. The next major service is the Request Planner. It is in charge of creating a plan for the execution of the transformations on a given data set and/or creating a plan for the retrieval of data from a storage system. The Request Planner has access to the system information retrieved by the Request Manager. To evaluate the cost of re-computation, the cost of the transformations needs to be known. This information, as well as the input and parameters required by a given transformation, code location, etc. are stored in the Transformation Catalog. The Request Planner uses information about the transformations that have been requested, to estimate the relative costs of computation vs. caching, and so on. The system would possibly keep a record of how long the various transformations took, and could use this performance history to estimate costs. This record and the analytical performance estimates will be maintained in the Transformation Catalog. The performance data needed in the evaluation of re-computation and replica access costs (such as network performance, availability of computational resources, etc.) will be provided by other information services, which are part of the Globus toolkit [9], a software environment designed for the Grid infrastructure. Once the request planner decides on a course of action the Request Executor is put in charge of carrying out the plan which involves the allocation of resources, data movement, fault monitoring, etc. The Request Executor will use the existing Globus infrastructure to access the data and the computational Grid, and run large jobs reliably with systems controlled by systems such as Condor [10]. As a result of the completion of the request, the various catalogs might need to be updated. 3.3 Proactive Data Replication Simply, just retrieving data from the replica catalog is not sufficient. The system must take a proactive approach to creating replica files and decide whether the results of transformations will be needed again. For example, if there is a request for a single channel spanning a long time interval and the replica catalog contains only files which are multi-channel spanning short time periods, then a significant amount of processing is needed to create the requested file (many files need to be opened and a small amount of data needs to be retrieved from each of them). However, once this transformation is performed, the resulting data can be placed in the replica catalog for future use, thus reducing the cost of subsequent requests. New replicas should also be created for frequently accessed data, if accessing the data from the available locations is too costly. For example, it may be useful to replicate data that initially resides on tape to a local file system. Since the Request Manager has information about data existence, location and
10 computation costs, it will also be responsible for making decisions about replica creation. 4 Conclusions Although the GriPhyN project is only in its initial phase, its potential to enhance the research of individual physicists and enable their wide-spread collaboration is great. This paper presents the first step in bridging the understanding between the needs of the physics community and the research focus of computer science to further the use and deployment of grid technologies [11] on a wide scale, in the form of a Virtual Data Grid. References. [1] Barish, B. C. and Weiss, R., Physics Today, Oct 1999, pp ; also [2] Taylor, G., and Weisberg, J. M., Astrophys. J. 345, 434 (1989). [3] Anderson, S., Blackburn, K., Lazzarini, A., Majid, W., Prince, T., Williams, R., The LIGO Data Analysis System, Proc. of the XXXIVth Recontres de Moriond, January 23-30, 1999, Les Arcs, France, also [4] Blackburn, K., Lazzarini, A., Prince, T., Williams, R., XSIL: Extensible Scientific Interchange Language, Lect. Notes Comput. Sci (1999) [5] GriPhyN, Grid Physics Net, [6] CERN Large Hadron Collider, [7] Sloan Digital Sky Survey, [8] For information of XML and schemas, see [9] Globus: [10] Condor: [11] Gridforum:
Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007
Data Management in an International Data Grid Project Timur Chabuk 04/09/2007 Intro LHC opened in 2005 several Petabytes of data per year data created at CERN distributed to Regional Centers all over the
Neutron Stars. How were neutron stars discovered? The first neutron star was discovered by 24-year-old graduate student Jocelyn Bell in 1967.
Neutron Stars How were neutron stars discovered? The first neutron star was discovered by 24-year-old graduate student Jocelyn Bell in 1967. Using a radio telescope she noticed regular pulses of radio
(Possible) HEP Use Case for NDN. Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015
(Possible) HEP Use Case for NDN Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015 Outline LHC Experiments LHC Computing Models CMS Data Federation & AAA Evolving Computing Models & NDN Summary Phil DeMar:
The Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
Distributed Database Access in the LHC Computing Grid with CORAL
Distributed Database Access in the LHC Computing Grid with CORAL Dirk Duellmann, CERN IT on behalf of the CORAL team (R. Chytracek, D. Duellmann, G. Govi, I. Papadopoulos, Z. Xie) http://pool.cern.ch &
Gravitational waves: Understanding and Detection
Gravitational waves: Understanding and Detection Final draft Physics 222 November 11, 1999 Aaron Astle Dan Hale Dale Kitchen Wesley Krueger Abstract Gravitational waves carry information about catastrophic
StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data
: High-throughput and Scalable Storage Technology for Streaming Data Munenori Maeda Toshihiro Ozawa Real-time analytical processing (RTAP) of vast amounts of time-series data from sensors, server logs,
REALIZING EINSTEIN S DREAM Exploring Our Mysterious Universe
REALIZING EINSTEIN S DREAM Exploring Our Mysterious Universe The End of Physics Albert A. Michelson, at the dedication of Ryerson Physics Lab, U. of Chicago, 1894 The Miracle Year - 1905 Relativity Quantum
Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data
Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data Yinong Chen 2 Big Data Big Data Technologies Cloud Computing Service and Web-Based Computing Applications Industry Control Systems
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
FAWN - a Fast Array of Wimpy Nodes
University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed
Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001
A comparison of the OpenGIS TM Abstract Specification with the CIDOC CRM 3.2 Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001 1 Introduction This Mapping has the purpose to identify, if the OpenGIS
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Big Data Analytics. for the Exploitation of the CERN Accelerator Complex. Antonio Romero Marín
Big Data Analytics for the Exploitation of the CERN Accelerator Complex Antonio Romero Marín Milan 11/03/2015 Oracle Big Data and Analytics @ Work 1 What is CERN CERN - European Laboratory for Particle
High-Volume Data Warehousing in Centerprise. Product Datasheet
High-Volume Data Warehousing in Centerprise Product Datasheet Table of Contents Overview 3 Data Complexity 3 Data Quality 3 Speed and Scalability 3 Centerprise Data Warehouse Features 4 ETL in a Unified
August, 2000 LIGO-G000313-00-D. http://www.ligo.caltech.edu/~smarka/sn/
LIGO-G000313-00-D Proposal to the LSC for Entry of LIGO into the Supernova Nova Early Warning System (SNEWS) and Prototype Development of Real-Time LIGO Supernova Alert 1 Barry Barish, Kenneth Ganezer(CSUDH),
The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets
The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets!! Large data collections appear in many scientific domains like climate studies.!! Users and
CHAPTER 3: DIGITAL IMAGING IN DIAGNOSTIC RADIOLOGY. 3.1 Basic Concepts of Digital Imaging
Physics of Medical X-Ray Imaging (1) Chapter 3 CHAPTER 3: DIGITAL IMAGING IN DIAGNOSTIC RADIOLOGY 3.1 Basic Concepts of Digital Imaging Unlike conventional radiography that generates images on film through
telemetry Rene A.J. Chave, David D. Lemon, Jan Buermans ASL Environmental Sciences Inc. Victoria BC Canada [email protected] I.
Near real-time transmission of reduced data from a moored multi-frequency sonar by low bandwidth telemetry Rene A.J. Chave, David D. Lemon, Jan Buermans ASL Environmental Sciences Inc. Victoria BC Canada
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Data Centric Computing Revisited
Piyush Chaudhary Technical Computing Solutions Data Centric Computing Revisited SPXXL/SCICOMP Summer 2013 Bottom line: It is a time of Powerful Information Data volume is on the rise Dimensions of data
16th International Conference on Control Systems and Computer Science (CSCS16 07)
16th International Conference on Control Systems and Computer Science (CSCS16 07) TOWARDS AN IO INTENSIVE GRID APPLICATION INSTRUMENTATION IN MEDIOGRID Dacian Tudor 1, Florin Pop 2, Valentin Cristea 2,
ASKAP Science Data Archive: Users and Requirements CSIRO ASTRONOMY AND SPACE SCIENCE (CASS)
ASKAP Science Data Archive: Users and Requirements CSIRO ASTRONOMY AND SPACE SCIENCE (CASS) Jessica Chapman, Data Workshop March 2013 ASKAP Science Data Archive Talk outline Data flow in brief Some radio
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
165 points. Name Date Period. Column B a. Cepheid variables b. luminosity c. RR Lyrae variables d. Sagittarius e. variable stars
Name Date Period 30 GALAXIES AND THE UNIVERSE SECTION 30.1 The Milky Way Galaxy In your textbook, read about discovering the Milky Way. (20 points) For each item in Column A, write the letter of the matching
LIGO Cybersecurity Status
LIGO Cybersecurity Status Albert Lazzarini NSF Annual Review of LIGO LIGO Hanford Observatory October 23-25, 2006 1 Outline Recommendations from last review Overview of cybersecurity within LIGO Laboratory
Data Grids. Lidan Wang April 5, 2007
Data Grids Lidan Wang April 5, 2007 Outline Data-intensive applications Challenges in data access, integration and management in Grid setting Grid services for these data-intensive application Architectural
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
EII - ETL - EAI What, Why, and How!
IBM Software Group EII - ETL - EAI What, Why, and How! Tom Wu 巫 介 唐, [email protected] Information Integrator Advocate Software Group IBM Taiwan 2005 IBM Corporation Agenda Data Integration Challenges and
Harmonics and Noise in Photovoltaic (PV) Inverter and the Mitigation Strategies
Soonwook Hong, Ph. D. Michael Zuercher Martinson Harmonics and Noise in Photovoltaic (PV) Inverter and the Mitigation Strategies 1. Introduction PV inverters use semiconductor devices to transform the
Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities
Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling
M3039 MPEG 97/ January 1998
INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND ASSOCIATED AUDIO INFORMATION ISO/IEC JTC1/SC29/WG11 M3039
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
World-wide online monitoring interface of the ATLAS experiment
World-wide online monitoring interface of the ATLAS experiment S. Kolos, E. Alexandrov, R. Hauser, M. Mineev and A. Salnikov Abstract The ATLAS[1] collaboration accounts for more than 3000 members located
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Gravitational waves from neutron stars binaries: accuracy and tidal effects in the late inspiral S. Bernuzzi TPI-PAF FSU Jena / SFB-TR7
Gravitational waves from neutron stars binaries: accuracy and tidal effects in the late inspiral S. Bernuzzi TPI-PAF FSU Jena / SFB-TR7 M. Thierfelder, SB, & B.Bruegmann, PRD 84 044012 (2011) SB, MT, &
Electronic Communications Committee (ECC) within the European Conference of Postal and Telecommunications Administrations (CEPT)
Page 1 Electronic Communications Committee (ECC) within the European Conference of Postal and Telecommunications Administrations (CEPT) ECC RECOMMENDATION (06)01 Bandwidth measurements using FFT techniques
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
WebSphere Business Monitor
WebSphere Business Monitor Monitor models 2010 IBM Corporation This presentation should provide an overview of monitor models in WebSphere Business Monitor. WBPM_Monitor_MonitorModels.ppt Page 1 of 25
Reference Architecture, Requirements, Gaps, Roles
Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture
SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide
SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24 Data Federation Administration Tool Guide Content 1 What's new in the.... 5 2 Introduction to administration
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
Deploying a distributed data storage system on the UK National Grid Service using federated SRB
Deploying a distributed data storage system on the UK National Grid Service using federated SRB Manandhar A.S., Kleese K., Berrisford P., Brown G.D. CCLRC e-science Center Abstract As Grid enabled applications
SQL Server Administrator Introduction - 3 Days Objectives
SQL Server Administrator Introduction - 3 Days INTRODUCTION TO MICROSOFT SQL SERVER Exploring the components of SQL Server Identifying SQL Server administration tasks INSTALLING SQL SERVER Identifying
White Paper. How Streaming Data Analytics Enables Real-Time Decisions
White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream
Non-Data Aided Carrier Offset Compensation for SDR Implementation
Non-Data Aided Carrier Offset Compensation for SDR Implementation Anders Riis Jensen 1, Niels Terp Kjeldgaard Jørgensen 1 Kim Laugesen 1, Yannick Le Moullec 1,2 1 Department of Electronic Systems, 2 Center
Technical. Overview. ~ a ~ irods version 4.x
Technical Overview ~ a ~ irods version 4.x The integrated Ru e-oriented DATA System irods is open-source, data management software that lets users: access, manage, and share data across any type or number
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Online Filtering for Radar Detection of Meteors
1, Gustavo O. Alves 1, José M. Seixas 1, Fernando Marroquim 2, Cristina S. Vianna 2, Helio Takai 3 1 Signal Processing Laboratory, COPPE/Poli, Federal University of Rio de Janeiro, Brazil. 2 Physics Institute,
Testing the In-Memory Column Store for in-database physics analysis. Dr. Maaike Limper
Testing the In-Memory Column Store for in-database physics analysis Dr. Maaike Limper About CERN CERN - European Laboratory for Particle Physics Support the research activities of 10 000 scientists from
Timing Errors and Jitter
Timing Errors and Jitter Background Mike Story In a sampled (digital) system, samples have to be accurate in level and time. The digital system uses the two bits of information the signal was this big
High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances
High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances Highlights IBM Netezza and SAS together provide appliances and analytic software solutions that help organizations improve
Planning for workflow construction and maintenance on the Grid
Planning for workflow construction and maintenance on the Grid Jim Blythe, Ewa Deelman, Yolanda Gil USC Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292 USA {blythe,deelman,gil}@isi.edu
Gravitational wave group
ICRR Review @Kashiwa 19-20, October 2006 Gravitational wave group Kazuaki Kuroda LCGT/CLIO/TAMA Collaboration Gravitational wave (GW) is the propagation of spacetime distortion predicted by Einstein s
Use Data Budgets to Manage Large Acoustic Datasets
Use Data Budgets to Manage Large Acoustic Datasets Introduction Efforts to understand the health of the ocean have increased significantly in the recent past. These efforts involve among other things,
Business Analytics and the Nexus of Information
Business Analytics and the Nexus of Information 2 The Impact of the Nexus of Forces 4 From the Gartner Files: Information and the Nexus of Forces: Delivering and Analyzing Data 6 About IBM Business Analytics
Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce
Elastic Application Platform for Market Data Real-Time Analytics Can you deliver real-time pricing, on high-speed market data, for real-time critical for E-Commerce decisions? Market Data Analytics applications
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Chapter 1: Introduction
Chapter 1: Introduction Database System Concepts, 5th Ed. See www.db book.com for conditions on re use Chapter 1: Introduction Purpose of Database Systems View of Data Database Languages Relational Databases
2 Associating Facts with Time
TEMPORAL DATABASES Richard Thomas Snodgrass A temporal database (see Temporal Database) contains time-varying data. Time is an important aspect of all real-world phenomena. Events occur at specific points
Data Deduplication: An Essential Component of your Data Protection Strategy
WHITE PAPER: THE EVOLUTION OF DATA DEDUPLICATION Data Deduplication: An Essential Component of your Data Protection Strategy JULY 2010 Andy Brewerton CA TECHNOLOGIES RECOVERY MANAGEMENT AND DATA MODELLING
Real-time Big Data Analytics with Storm
Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap
An Active Packet can be classified as
Mobile Agents for Active Network Management By Rumeel Kazi and Patricia Morreale Stevens Institute of Technology Contact: rkazi,[email protected] Abstract-Traditionally, network management systems
The LIGO Data Monitoring Tool. John G. Zweizig LIGO, Caltech
The LIGO Data Monitoring Tool John G. Zweizig LIGO, Caltech GDS Final Design Review Pasadena, May 7, 1999 1 A) Introduction 1. Purpose Contents 2. S/W Requirements 3. H/W Requirements 4. Architecture B)
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
Jeff Thomas Tom Holmes Terri Hightower. Learn RF Spectrum Analysis Basics
Jeff Thomas Tom Holmes Terri Hightower Learn RF Spectrum Analysis Basics Learning Objectives Name the major measurement strengths of a swept-tuned spectrum analyzer Explain the importance of frequency
Email: [email protected]
USE OF VIRTUAL INSTRUMENTS IN RADIO AND ATMOSPHERIC EXPERIMENTS P.N. VIJAYAKUMAR, THOMAS JOHN AND S.C. GARG RADIO AND ATMOSPHERIC SCIENCE DIVISION, NATIONAL PHYSICAL LABORATORY, NEW DELHI 110012, INDIA
Implementation of Digital Signal Processing: Some Background on GFSK Modulation
Implementation of Digital Signal Processing: Some Background on GFSK Modulation Sabih H. Gerez University of Twente, Department of Electrical Engineering [email protected] Version 4 (February 7, 2013)
NoSQL. Thomas Neumann 1 / 22
NoSQL Thomas Neumann 1 / 22 What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins,
Circumventing Picture Archiving and Communication Systems Server with Hadoop Framework in Health Care Services
Journal of Social Sciences 6 (3): 310-314, 2010 ISSN 1549-3652 2010 Science Publications Circumventing Picture Archiving and Communication Systems Server with Hadoop Framework in Health Care Services 1
An IDL for Web Services
An IDL for Web Services Interface definitions are needed to allow clients to communicate with web services Interface definitions need to be provided as part of a more general web service description Web
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework
Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework Many corporations and Independent Software Vendors considering cloud computing adoption face a similar challenge: how should
Doppler. Doppler. Doppler shift. Doppler Frequency. Doppler shift. Doppler shift. Chapter 19
Doppler Doppler Chapter 19 A moving train with a trumpet player holding the same tone for a very long time travels from your left to your right. The tone changes relative the motion of you (receiver) and
SAP SE - Legal Requirements and Requirements
Finding the signals in the noise Niklas Packendorff @packendorff Solution Expert Analytics & Data Platform Legal disclaimer The information in this presentation is confidential and proprietary to SAP and
Storage in Database Systems. CMPSCI 445 Fall 2010
Storage in Database Systems CMPSCI 445 Fall 2010 1 Storage Topics Architecture and Overview Disks Buffer management Files of records 2 DBMS Architecture Query Parser Query Rewriter Query Optimizer Query
Cloud Computing. Lecture 5 Grid Case Studies 2014-2015
Cloud Computing Lecture 5 Grid Case Studies 2014-2015 Up until now Introduction. Definition of Cloud Computing. Grid Computing: Schedulers Globus Toolkit Summary Grid Case Studies: Monitoring: TeraGRID
