NSLS-II Data Management Framework Arman Arkilic Brookhaven National Lab NSLS-II



Similar documents
Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

How To Scale Out Of A Nosql Database

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

NoSQL for SQL Professionals William McKnight

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Detailed Design Report

MS 10978A Introduction to Azure for Developers

MongoDB Developer and Administrator Certification Course Agenda

THE TESLA TEST FACILITY AS A PROTOTYPE FOR THE GLOBAL ACCELERATOR NETWORK

MS 20487A Developing Windows Azure and Web Services

Date: December 2009 Version: 1.0. How Does Xen Work?

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

MEAN/Full Stack Web Development - Training Course Package

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

NSLS-II Control System Network Architecture

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

MicroStrategy Course Catalog

Developing Windows Azure and Web Services

MedBroker A DICOM and HL7 Integration Product. Whitepaper

How To Handle Big Data With A Data Scientist

Big Data? Definition # 1: Big Data Definition Forrester Research

IO Visor: Programmable and Flexible Data Plane for Datacenter s I/O

Course 10978A Introduction to Azure for Developers

Can the Elephants Handle the NoSQL Onslaught?

Evolving Data Warehouse Architectures

Computer Programming for the Social Sciences

SPring-8 Experimental Data Repository [SP8DR]

Designing a Microsoft SQL Server 2005 Infrastructure

Storage of the Experimental Data at SOLEIL. Computing and Electronics

Overview of IBM Cloud Integration

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Hadoop. Sunday, November 25, 12

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Cloud Storage Specialist Certification Self-Study Kit Bundle

Case Study : 3 different hadoop cluster deployments

PES. High Availability Load Balancing in the Agile Infrastructure. Platform & Engineering Services. HEPiX Bologna, April 2013

Big Data Challenges in Bioinformatics

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Cisco WAAS for Isilon IQ

Performance Evaluation of NoSQL Systems Using YCSB in a resource Austere Environment

Realtime Motion Control -Update. Uwe Ristau


DataNet Flexible Metadata Overlay over File Resources

So What s the Big Deal?

Course Code NCS2013: SharePoint 2013 No-code Solutions for Office 365 and On-premises

Educational Collaborative Develops Big Data Solution with MongoDB

Course Description for the Bachelors Degree in Library and Information Science

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Accelerate the Performance of Virtualized Databases Using PernixData FVP Software

CHESS DAQ* Introduction

Milestone Solution Partner IT Infrastructure MTP Certification Report Scality RING Software-Defined Storage

Big Data Visualization and Dashboards

Architecting Applications to Scale in the Cloud

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Middleware- Driven Mobile Applications

Building Heavy Load Messaging System

SOA REFERENCE ARCHITECTURE: WEB TIER

Upgrading a Telecom Billing System with Intel Xeon Processors

Performance Tuning and Optimizing SQL Databases 2016

Software Performance, Scalability, and Availability Specifications V 3.0

MS 50547B Microsoft SharePoint 2010 Collection and Site Administration

How To Switch A Layer 1 Matrix Switch On A Network On A Cloud (Network) On A Microsoft Network (Network On A Server) On An Openflow (Network-1) On The Network (Netscout) On Your Network (

Software Architecture Document

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Accelerating and Simplifying Apache

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Data Modeling for Big Data

Oracle Big Data SQL Technical Update

Information Retrieval Elasticsearch

WOS. High Performance Object Storage

Server Virtualization with Windows Server Hyper-V and System Center

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Ad Hoc Analysis of Big Data Visualization

The Lab and The Factory

Inside Track Research Note. In association with. Enterprise Storage Architectures. Is it only about scale up or scale out?

This module provides an overview of service and cloud technologies using the Microsoft.NET Framework and the Windows Azure cloud.

Transcription:

NSLS-II Data Management Framework Arman Arkilic Brookhaven National Lab NSLS-II

Motivation Modern synchrotron experiments require frameworks that provide scientists with data mining/analysis tools Given the variety of experiments a good data management framework must accommodate semi-structured and unstructured data A centralized framework that serves to multiple scientific applications is more feasible than various frameworks that serve to specific scientific applications for facilities File I/O is the bottleneck for most data analysis: Text files are not query-able and impossible to maintain in large scale distributed systems Experiment-specific databases requires reinventing the wheel for each new experiment and can not perform additional tasks that occur with changing needs Text files do not support RESTful interfaces(i.e. one can t easily develop a web service as they would using a database) And pretty much any reason that makes databases better than text files

Beamline Service Architecture

Beamline Data Management Architecture databroker Data from experiments pvaccess OR HTTP pvaccess OR HTTP Experimental Data for Analysis DataBroker Inner Facing API metadatastore framestore Channel Archiver* Olog* Long term storage

Components databroker databroker is a server that provides write and read access to experimental data frameworks underneath it. Acts as a glue for various underlying services metadatastore metadatastore is a service that is used in order to record metadata in beamline experiments. framestore framestore is a service that provides means to manage images and/or any NDimensional data from experiments Channel Archiver Channel archiver records signals from various hardware and provides means to query time series data Olog Olog is an operation logbook that keeps track of a user s experimental activity providing various APIs and web clients

MongoDB Web Service Deployment + motor Primary Secondary

metadatastore, not so non-relational sample_specs event_descriptor runstart event_descriptor events beamline_config events custom run_stop

Middle-layer Services The middle layer of the beamline applications are composed of: metadatastore, filestore, olog, archiver appliance, channel finder, analysisstore, and ophyd. databroker and bluesky are the higher level applications that provides users with the ability to control their experiment, while databroker acts as a gateway to experimental data. metadatastore is the primary source of storage for scans, sweeps, etc. metadatastore is implemented as a RESTful web service on Tornado with a NoSQL mongodb backend. It consists of 4 major collections: RunStart, EventDescriptor, Event, RunStop. One can think of RunStart and RunStop as the head and tail of a series of events that happen throughout an experiment. EventDescriptor(s) contain information regarding the data acquisition Event(s). In other words, they contain information about what are the data_keys that are being captured, what their shapes and data types are. This design pattern allows NSLS-II middle-layer to tackle many challenges including asynchronous scans, custom hardware brought from an external source, and varying data formats among experiments. filestore is the component of the middle-layer that keeps track of experimental files. These experimental files are generated via EPICS areadetector module or any custom area detector. file store s NoSQL mongo backend allows users to save any sort of information about their images, while providing links to metadatastore runs. This way, any data point in any scan can easily be corresponded to an image, providing scientists with data mining techniques they didn t have access earlier. databroker is the pathway for data analysis tools to access data without any knowledge of experimental data (implementation of Daron Chabot s architecture). It is implemented in Python and provides users with broker objects that are Pandas data frames, allowing semi and un-structured experiemental data to be accessed within scientific Python framework

databroker

Production Library Performance 110K run_start/event_descriptor and 657 million events 275GB of data in total Finding a unique descriptor or runstart takes ~180 microseconds Finding events given descriptor(the most common application) takes ~ 43ms Finding a runstart out of 110.6K takes ~2.5 ms

For more information: http://nsls-ii.github.io/index.html

Acknowledgements Bob Dalesio Daron Chabot Daniel Allan Thomas Caswell