DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID



Similar documents
How To Use Game To Learn From Data

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Instruments in Grid: the New Instrument Element

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

VisIVO, an open source, interoperable visualization tool for the Virtual Observatory

VisIVO, a VO-Enabled tool for Scientific Visualization and Data Analysis: Overview and Demo

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

Web Cloud Architecture

Information Management course

The Scientific Data Mining Process

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Client/server is a network architecture that divides functions into client and server

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Apache Hama Design Document v0.6

THE CCLRC DATA PORTAL

Knowledge Discovery from patents using KMX Text Analytics

A Collaborative Approach to Building Personal Knowledge Networks or How to Build a Knowledge Advantage Machine?

Learn Oracle WebLogic Server 12c Administration For Middleware Administrators

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

How To Retire A Legacy System From Healthcare With A Flatirons Eas Application Retirement Solution

Building Semantic Content Management Framework

IT Infrastructure and Emerging Technologies

Computational Science and Informatics (Data Science) Programs at GMU

Massive Cloud Auditing using Data Mining on Hadoop

Cloud application for water resources modeling. Faculty of Computer Science, University Goce Delcev Shtip, Republic of Macedonia

What Is the Java TM 2 Platform, Enterprise Edition?

Amit Sheth & Ajith Ranabahu, Presented by Mohammad Hossein Danesh

Learning from Big Data in

NASA s Big Data Challenges in Climate Science

CHAPTER 1 INTRODUCTION

Master of Science in Health Information Technology Degree Curriculum

Introduction. A. Bellaachia Page: 1

zen Platform technical white paper

Introduction to Data Mining

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Data Mining Challenges and Opportunities in Astronomy

Introduction to Data Mining

Google Web Toolkit (GWT) Architectural Impact on Enterprise Web Application

Cluster, Grid, Cloud Concepts

An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

How To Teach Data Science

Figure 1: Architecture of a cloud services model for a digital education resource management system.

The PACS Software System. (A high level overview) Prepared by : E. Wieprecht, J.Schreiber, U.Klaas November, Issue 1.

A Service for Data-Intensive Computations on Virtual Clusters

2012 LABVANTAGE Solutions, Inc. All Rights Reserved.

The CMS analysis chain in a distributed environment

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Zhenping Liu *, Yao Liang * Virginia Polytechnic Institute and State University. Xu Liang ** University of California, Berkeley

How to Build an E-Commerce Application using J2EE. Carol McDonald Code Camp Engineer

OBIEE 11g Analytics Using EMC Greenplum Database

Data analysis of L2-L3 products

White Paper: 1) Architecture Objectives: The primary objective of this architecture is to meet the. 2) Architecture Explanation

High Productivity Data Processing Analytics Methods with Applications

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Visualisation in the Google Cloud

Sanjeev Kumar. contribute

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

EDG Project: Database Management Services

The Impact of PaaS on Business Transformation

KNOWLEDGE GRID An Architecture for Distributed Knowledge Discovery

ENTERPRISE DOCUMENTS & RECORD MANAGEMENT

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Data Lab System Architecture

Data Center Virtualization and Cloud QA Expertise

Google App Engine f r o r J av a a v a (G ( AE A / E J / )

SAP HANA Data Center Intelligence - Overview Presentation

Journal of Global Research in Computer Science RESEARCH SUPPORT SYSTEMS AS AN EFFECTIVE WEB BASED INFORMATION SYSTEM

(Possible) HEP Use Case for NDN. Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Transcription:

DAME Astrophysical DAta Mining & Exploration on GRID M. Brescia S. G. Djorgovski G. Longo & DAME Working Group Istituto Nazionale di Astrofisica Astronomical Observatory of Capodimonte, Napoli Department of Physics Sciences, Università Federico II, Napoli California Institute of Technology, Pasadena

The Problem Astrophysics communities share the same basic requirement: dealing with massive distributed datasets that they want to integrate together with services In this sense Astrophysics follows same evolution of other scientific disciplines: the growth of data is reaching historic proportions while data doubles every year, useful information seems to be decreasing, creating a growing gap between the generation of data and our understanding of it Required understanding include knowing how to access, retrieve, analyze, mine and integrate data from disparate sources But on the other hand, it is obvious that a scientist could not and does not want to become an expert in its science and in Computer Science or in the fields of algorithms and ICT In most cases (for mean square astronomers) algorithms for data processing and analysis are already available to the end user (sometimes himself has implemented over the years, private routines/pipelines to solve specific problems). These tools often are not scalable to distributed computing environments or are too difficult to be migrated on a GRID infrastructure

A Solution So far, our idea is to provide: User friendly GRID scientific gateway to easy the access, exploration, processing and understanding of the massive data sets federated under standards according Vobs (Virtual Observatory) rules There are important treasons why toadopt texisting iti Vb Vobsstandards: d long term interoperabilityof data, available e infrastructure support for data handling aspect in the future projects Standards for data representation are not sufficient. This useful feature needs to be extended to data analysis and mining methods and algorithms standardization process. It basically means to define standards in terms of ontologies and well defined taxonomy of functionalities to be applied in the astrophysical use cases The natural computing environment for the MDS processing is GRID, but again, we need to define standards in the development of higher level interfaces, in order to: isolate end user (astronomer) from technical details of VObs and GRID use and configuration; make it easier to combine existing services and resources into experiments;

The Required Approach At the end, to define, design andimplement all these standards, a new scientific discipline profile arises: the ASTROINFORMATICS, whose paradigm is based on the following scheme Unsupervised methods Associative networks Clustering Principal components Self Organizing Maps Data Sources Images Catalogs Time series Simulations Information Extracted Shapes & Patterns Science Metadata Distributions & Frequencies Model Parameters KDD Tools New Knowledge or causal connections between physical events within the science domain Supervised methods Neural Networks Bayesian Networks Support Vector Machines

The new science field Any observed (simulated) datum p defines a point (region) in a subset of R N Example: experimental setup (spatial and spectral resolution, limiting mag, limiting surface brightness, etc.) parameters RA and dec λ, time fluxes polarization The computational cost of DM: N = no. of data vectors, D = no. of data dimensions K = no. of clusters chosen, K max = max no. of clusters tried I = no. of iterations, M = no. of Monte Carlo trials/partitions K means: K N I D Expectation Maximization: K N I D 2 Monte Carlo Cross Validation: M K max2 N I D 2 Correlations ~ N log N or N 2, ~ D k (k 1) Likelihood, Bayesian ~ N m (m 3), ~ D k (k 1) SVM > ~ (NxD) 3 ASTROINFORMATICS (emerging field) N points in a DxK dimensional parameter space: N >10 9 D>>100 K>10

The SCoPE GRID Infrastructure SCoPE : Sistema Cooperativo distribuito ad alte Prestazioni per Elaborazioni Scientifiche Multidisciplinari (High Performance Cooperative distributed system for multidisciplinary scientific applications) Objectives: Innovative and original software for fundamental scientific research High performance Data & Computing Center for multidisciplinary applications Grid infrastructure and middleware INFNGRID LCG/gLite Compatibility with EGEE middleware Interoperability with the other three PON 1575 projects and SPACI in GRISU Integration in the Italian and European Gid Grid Infrastructure CAMPUS GRID ASTRONOMICAL MEDICINE OBSERVATORY CSI ENGINEERING ENGINEERING Fiber Optic Already Connected Work in Progress The SCoPE Data Center 33 Racks (of which 10 for Tier2 ATLAS) 304 Servers for a total of 2.432 procs 170 TeraByte storage 5 remote sites (2 in progress)

What is DAME DAME is a joint effort between University Federico II, INAF OACN and Caltech aimed at implementing (as web application) a suite (scientific gateway) of data analysis, exploration, mining and visualization tools, on top of virtualized distributed computing environment. http://voneural.na.infn.it/ na it/ Technical and management info Documents Science cases http://dame.na.infn.it/ Web application PROTOTYPE

What is DAME In parallel with the Suite R&D process, all data processing algorithms (foreseen to be plugged in) have been massively tested on real astrophysical cases. http://voneural.na.infn.it/ Technicaland and management info Documents Science cases Also, under design a web application for data exploration on globular clusters (VOGCLUSTERS)

DAME Work breakdown Data (storage) Semantic construction of BoKs BoK Models & Algorithms Transparent computing Infrastructure (GRID, CLOUD, etc.) PCA MLP SVM SOM PPS MLPGA NEXT Application DAME engine Catalogs and metadata results

user The DAME architecture FRONT END WEB APPL. GUI XML Restful, Stateless Web Service experiment data, working FRAMEWORK flow trigger and supervision WEB SERVICE Servlets based on XML Suite CTRL protocol DRIVER FILESYSTEM & HARDWARE I/F Library CALL HW env virtualization; Storage + Execution LIB Data format conversion Client server AJAX (Asynchronous JAva Xml) based; interactive web app based on Javascript (GWT EXT); servlet XML CALL REGISTRY & DATABASE & EXPERIMENT INFORMATION DATA MINING MODELS Model Functionality LIBRARY RUN regression MLP Stand Alone GRID CLOUD INFO SESSIONS EXPERIMENTS

Two ways to use DAME -1 Simple user FRONT END WEB APPL. GUI DATA MINING MODELS Model Functionality LIBRARY RUN regression FRAMEWORK WEB SERVICE Suite CTRL servlet MLP DRIVER FILESYSTEM & HARDWARE I/F REGISTRY & DATABASE & EXPERIMENT INFORMATION Stand Alone GRID CLOUD INFO SESSIONS EXPERIMENTS

Two ways to use DAME -2 FRONT END WEB APPL. GUI DATA MINING MODELS Model Functionality LIBRARY RUN regression developer user FRAMEWORK WEB SERVICE Suite CTRL DMPLUGIN MLP DRIVER FILESYSTEM & HARDWARE I/F REGISTRY & DATABASE & EXPERIMENT INFORMATION Stand Alone GRID CLOUD INFO SESSIONS EXPERIMENTS

Two ways to use DAME -2 FRONT END WEB APPL. GUI DATA MINING MODELS Model Functionality LIBRARY RUN regression developer user FRAMEWORK WEB SERVICE Suite CTRL DMPLUGIN MLP DRIVER FILESYSTEM & HARDWARE I/F REGISTRY & DATABASE & EXPERIMENT INFORMATION Stand Alone GRID CLOUD INFO SESSIONS EXPERIMENTS

DAME on GRID Scientific Gateway GRID CE DM Models Job Execution The two DR component processes DR Execution DR Storage make GRID environment embedded to other components GRID SE User & Experiment tdata Archives REDB Logical DB for user and working session archive management XML XML FW GRID UI FE Browser Requests (registration, accounting, experiment configuration and submission) Client

Coming soon now: suite deployed on SCoPE GRID, currently under testing; package under test (beta SW & Manual already available for download); End of October 2009: beta version of Suite and released to the community; http://dame.na.infn.it/ Web applicationprototype http://voneural.na.infn.it/ Technicaland and management info Documents Science cases